tajo-docs/src/main/sphinx/table_management/sequencefile.rst - tajo - Git at Google

 *************************************
 SequenceFile
 *************************************

 -----------------------------------------
 Introduce
 -----------------------------------------

 SequenceFiles are flat files consisting of binary key/value pairs.
 SequenceFile is basic file format which provided by Hadoop, and Hive also provides it to create a table.

 The ``USING sequencefile`` keywords let you create a SequecneFile. Here is an example statement to create a table using ``SequecneFile``:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name text, score float, type text)
  USING sequencefile;

 Also Tajo provides Hive compatibility for SequenceFile. The above statement can be written in Hive as follows:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name string, score float, type string)
  STORED AS sequencefile;

 -----------------------------------------
 SerializerDeserializer (SerDe)
 -----------------------------------------

 There are two SerDe for SequenceFile as follows:

  + TextSerializerDeserializer: This class can read and write data in plain text file format.
  + BinarySerializerDeserializer: This class can read and write data in binary file format.

 The default is the SerDe for plain text file in Tajo. The above example statement created the table using TextSerializerDeserializer.If you want to use BinarySerializerDeserializer, you can specify it by ``sequencefile.serde`` keywords:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name text, score float, type text)
  USING sequencefile with ('sequencefile.serde'='org.apache.tajo.storage.BinarySerializerDeserializer')

 In Hive, the above statement can be written in Hive as follows:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name string, score float, type string)
  ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe'
  STORED AS sequencefile;

 -----------------------------------------
 Writer
 -----------------------------------------

 There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:

  + Writer : Uncompressed records.
  + RecordCompressWriter : Record-compressed files, only compress values.
  + BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

 The default is Uncompressed Writer in Tajo. If you want to use RecordCompressWriter, you can specify it by ``compression.type`` keywords and  ``compression.codec`` keywords:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name text, score float, type text)
  USING sequencefile with ('compression.type'='RECORD','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

 In hive, you need to specify settings as follows:

 .. code-block:: sql

  hive> SET hive.exec.compress.output = true;
  hive> SET mapred.output.compression.type = RECORD;
  hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
  hive> CREATE TABLE table1 (id int, name string, score float, type string) STORED AS sequencefile;;

 And if you want to use BlockCompressWriter, you can specify it by ``compression.type`` keywords and  ``compression.codec`` keywords:

 .. code-block:: sql

  CREATE TABLE table1 (id int, name text, score float, type text)
  USING sequencefile with ('compression.type'='BLOCK','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

 In hive, you need to specify settings as follows:

 .. code-block:: sql

  hive> SET hive.exec.compress.output = true;
  hive> SET mapred.output.compression.type = BLOCK;
  hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
  hive> CREATE TABLE table1 (id int, name string, score float, type string) STORED AS sequencefile;;

 For reference, you can use TextSerDe or BinarySerDe with compression keywords.
 Here is an example statement for this case.

 .. code-block:: sql

  CREATE TABLE table1 (id int, name text, score float, type text)
  USING sequencefile with ('sequencefile.serde'='org.apache.tajo.storage.BinarySerializerDeserializer', 'compression.type'='BLOCK','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

 In hive, you need to specify settings as follows:

 .. code-block:: sql

  hive> SET hive.exec.compress.output = true;
  hive> SET mapred.output.compression.type = BLOCK;
  hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
  hive> CREATE TABLE table1 (id int, name string, score float, type string)
        ROW FORMAT SERDE
          'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe'
        STORED AS sequencefile;;
	*************************************
	SequenceFile
	*************************************

	-----------------------------------------
	Introduce
	-----------------------------------------

	SequenceFiles are flat files consisting of binary key/value pairs.
	SequenceFile is basic file format which provided by Hadoop, and Hive also provides it to create a table.

	The ``USING sequencefile`` keywords let you create a SequecneFile. Here is an example statement to create a table using ``SequecneFile``:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name text, score float, type text)
	USING sequencefile;

	Also Tajo provides Hive compatibility for SequenceFile. The above statement can be written in Hive as follows:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name string, score float, type string)
	STORED AS sequencefile;

	-----------------------------------------
	SerializerDeserializer (SerDe)
	-----------------------------------------

	There are two SerDe for SequenceFile as follows:

	+ TextSerializerDeserializer: This class can read and write data in plain text file format.
	+ BinarySerializerDeserializer: This class can read and write data in binary file format.

	The default is the SerDe for plain text file in Tajo. The above example statement created the table using TextSerializerDeserializer.If you want to use BinarySerializerDeserializer, you can specify it by ``sequencefile.serde`` keywords:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name text, score float, type text)
	USING sequencefile with ('sequencefile.serde'='org.apache.tajo.storage.BinarySerializerDeserializer')

	In Hive, the above statement can be written in Hive as follows:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name string, score float, type string)
	ROW FORMAT SERDE
	'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe'
	STORED AS sequencefile;

	-----------------------------------------
	Writer
	-----------------------------------------

	There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:

	+ Writer : Uncompressed records.
	+ RecordCompressWriter : Record-compressed files, only compress values.
	+ BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

	The default is Uncompressed Writer in Tajo. If you want to use RecordCompressWriter, you can specify it by ``compression.type`` keywords and ``compression.codec`` keywords:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name text, score float, type text)
	USING sequencefile with ('compression.type'='RECORD','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

	In hive, you need to specify settings as follows:

	.. code-block:: sql

	hive> SET hive.exec.compress.output = true;
	hive> SET mapred.output.compression.type = RECORD;
	hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
	hive> CREATE TABLE table1 (id int, name string, score float, type string) STORED AS sequencefile;;

	And if you want to use BlockCompressWriter, you can specify it by ``compression.type`` keywords and ``compression.codec`` keywords:

	.. code-block:: sql

	CREATE TABLE table1 (id int, name text, score float, type text)
	USING sequencefile with ('compression.type'='BLOCK','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

	In hive, you need to specify settings as follows:

	.. code-block:: sql

	hive> SET hive.exec.compress.output = true;
	hive> SET mapred.output.compression.type = BLOCK;
	hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
	hive> CREATE TABLE table1 (id int, name string, score float, type string) STORED AS sequencefile;;

	For reference, you can use TextSerDe or BinarySerDe with compression keywords.
	Here is an example statement for this case.

	.. code-block:: sql

	CREATE TABLE table1 (id int, name text, score float, type text)
	USING sequencefile with ('sequencefile.serde'='org.apache.tajo.storage.BinarySerializerDeserializer', 'compression.type'='BLOCK','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec')

	In hive, you need to specify settings as follows:

	.. code-block:: sql

	hive> SET hive.exec.compress.output = true;
	hive> SET mapred.output.compression.type = BLOCK;
	hive> SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
	hive> CREATE TABLE table1 (id int, name string, score float, type string)
	ROW FORMAT SERDE
	'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe'
	STORED AS sequencefile;;