tajo-docs/src/main/sphinx/table_management/parquet.rst - tajo - Git at Google

 *************************************
 Parquet
 *************************************

 Parquet is a columnar storage format for Hadoop. Parquet is designed to make the advantages of compressed,
 efficient columnar data representation available to any project in the Hadoop ecosystem,
 regardless of the choice of data processing framework, data model, or programming language.
 For more details, please refer to `Parquet File Format <http://parquet.io/>`_.

 =========================================
 How to Create a Parquet Table?
 =========================================

 If you are not familiar with ``CREATE TABLE`` statement, please refer to Data Definition Language :doc:`/sql_language/ddl`.

 In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE``
 statement. Below is an example statement for creating a table using parquet files.

 .. code-block:: sql

   CREATE TABLE table1 (
     id int,
     name text,
     score float,
     type text
   ) USING PARQUET;

 =========================================
 Physical Properties
 =========================================

 Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
 The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.

 Now, Parquet file provides the following physical properties.

 * ``parquet.block.size``: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).
 * ``parquet.page.size``: The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. Default size is 1048576 bytes (= 1 * 1024 * 1024).
 * ``parquet.compression``: The compression algorithm used to compress pages. It should be one of ``uncompressed``, ``snappy``, ``gzip``, ``lzo``. Default is ``uncompressed``.
 * ``parquet.enable.dictionary``: The boolean value is to enable/disable dictionary encoding. It should be one of either ``true`` or ``false``. Default is ``true``.

 =========================================
 Compatibility Issues with Apache Hive™
 =========================================

 At the moment, Tajo only supports flat relational tables.
 As a result, Tajo's Parquet storage type does not support nested schemas.
 However, we are currently working on adding support for nested schemas and non-scalar types (`TAJO-710 <https://issues.apache.org/jira/browse/TAJO-710>`_).
	*************************************
	Parquet
	*************************************

	Parquet is a columnar storage format for Hadoop. Parquet is designed to make the advantages of compressed,
	efficient columnar data representation available to any project in the Hadoop ecosystem,
	regardless of the choice of data processing framework, data model, or programming language.
	For more details, please refer to `Parquet File Format <http://parquet.io/>`_.

	=========================================
	How to Create a Parquet Table?
	=========================================

	If you are not familiar with ``CREATE TABLE`` statement, please refer to Data Definition Language :doc:`/sql_language/ddl`.

	In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE``
	statement. Below is an example statement for creating a table using parquet files.

	.. code-block:: sql

	CREATE TABLE table1 (
	id int,
	name text,
	score float,
	type text
	) USING PARQUET;

	=========================================
	Physical Properties
	=========================================

	Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
	The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.

	Now, Parquet file provides the following physical properties.

	* ``parquet.block.size``: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).
	* ``parquet.page.size``: The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. Default size is 1048576 bytes (= 1 * 1024 * 1024).
	* ``parquet.compression``: The compression algorithm used to compress pages. It should be one of ``uncompressed``, ``snappy``, ``gzip``, ``lzo``. Default is ``uncompressed``.
	* ``parquet.enable.dictionary``: The boolean value is to enable/disable dictionary encoding. It should be one of either ``true`` or ``false``. Default is ``true``.

	=========================================
	Compatibility Issues with Apache Hive™
	=========================================

	At the moment, Tajo only supports flat relational tables.
	As a result, Tajo's Parquet storage type does not support nested schemas.
	However, we are currently working on adding support for nested schemas and non-scalar types (`TAJO-710 <https://issues.apache.org/jira/browse/TAJO-710>`_).