docs/xdocs/language_manual/data-manipulation-statements.xml - hive - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
 -->

 <document>

   <properties>
     <title>Hadoop Hive- Data Manipulation Statements</title>
     <author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
   </properties>

   <body>

     <section name="Create Table Syntax" href="create_table_syntax">

     <source><![CDATA[
 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
   [(col_name data_type [COMMENT col_comment], ...)]
   [COMMENT table_comment]
   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
   [ROW FORMAT row_format]
   [STORED AS file_format]
   [LOCATION hdfs_path]
   [TBLPROPERTIES (property_name=property_value, ...)]
   [AS select_statement]

 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
   LIKE existing_table_name
   [LOCATION hdfs_path]

 data_type
   : primitive_type
   | array_type
   | map_type
   | struct_type

 primitive_type
   : TINYINT
   | SMALLINT
   | INT
   | BIGINT
   | BOOLEAN
   | FLOAT
   | DOUBLE
   | STRING

 array_type
   : ARRAY < data_type >

 map_type
   : MAP < primitive_type, data_type >

 struct_type
   : STRUCT < col_name : data_type [COMMENT col_comment], ...>

 row_format
   : DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
         [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
   | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

 file_format:
   : SEQUENCEFILE
   | TEXTFILE
   | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 ]]></source>

 <p>
 CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error.
 </p>

 <p>
 The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
 </p>
 The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).

 <p>
 You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe. Refer to SerDe section of the User Guide for more information on SerDe.
 </p>

 <p>
 You must specify a list of a columns for tables that use a native SerDe. Refer to the Types part of the User Guide for the allowable column types. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
 </p>

 <p>
 Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
 </p>

 <p>
 Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
 </p>

 <p>
 Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
 </p>

 <p>A create table example:</p>
   <source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
      page_url STRING, referrer_url STRING,
      ip STRING COMMENT 'IP Address of the User')
  COMMENT 'This is the page view table'
  PARTITIONED BY(dt STRING, country STRING)
  STORED AS SEQUENCEFILE;]]></source>

  <p>The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
   </p>

 </section>

 <section name="Create Table as Select (CTAS)" href="ctas?">

   <p>
   Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
   </p>

   <p>
   There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
   </p>

   <source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
      page_url STRING, referrer_url STRING,
      ip STRING COMMENT 'IP Address of the User')
  COMMENT 'This is the page view table'
  PARTITIONED BY(dt STRING, country STRING)
  STORED AS SEQUENCEFILE;
 ]]></source>

 </section>

 <section name="Using SerDes" href="SerDes">

 <p>
 This example CTAS statement creates the target table new_key_value_store with the
 schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the
 SELECT statement. If the SELECT statement does not specify column aliases, the
 column names will be automatically assigned to _col0, _col1, and _col2 etc.
 In addition, the new target table is created using a specific SerDe and a storage
 format independent of the source tables in the SELECT statement.
 </p>

 <source><![CDATA[CREATE TABLE new_key_value_store
    ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
    STORED AS RCFile AS
 SELECT (key % 1024) new_key, concat(key, value) key_value_pair
 FROM key_value_store
 SORT BY new_key, key_value_pair;
 ]]></source>

 <p>
 <b>Being able to select data from one table to another is one of the most
 powerful features of Hive. Hive handles the conversion of the data from the source
 format to the destination format as the query is being executed!</b>
 </p>

 </section>

 <section name="Bucketed Sorted Table" href="bucketed_sorted_table">

 <source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
      page_url STRING, referrer_url STRING,
      ip STRING COMMENT 'IP Address of the User')
  COMMENT 'This is the page view table'
  PARTITIONED BY(dt STRING, country STRING)
  CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    COLLECTION ITEMS TERMINATED BY '\002'
    MAP KEYS TERMINATED BY '\003'
  STORED AS SEQUENCEFILE;
 ]]></source>

 <p>In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
 </p>

 <p>
 The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query. See
 <a href="working_with_bucketed_tables.html">Working with Bucketed tables</a> to see how these
 are used.
 </p>

 </section>

 <section name="External Tables" href="external_table?">

 <p>
 Unless a table is specified as EXTERNAL it will be stored inside a folder specified by the
 configuration property hive.metastore.warehouse.dir.
 EXTERNAL tables points to any hdfs location for its storage. You still have to make sure that the data is format is specified to match the data.

 </p>
 <source><![CDATA[CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
      page_url STRING, referrer_url STRING,
      ip STRING COMMENT 'IP Address of the User',
      country STRING COMMENT 'country of origination')
  COMMENT 'This is the staging page view table'
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
  STORED AS TEXTFILE
  LOCATION '<hdfs_location>';
  ]]></source>

 </section>

 <section name="Create Table ... Like" href="create_table_like?">

 <p>The statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.
 </p>

 <source><![CDATA[CREATE TABLE empty_key_value_store
 LIKE key_value_store;
 ]]></source>

 </section>

 <section name="drop" href="drop">
 <p>Drop it like it is hot</p>
 </section>
   </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<document>

	<properties>
	<title>Hadoop Hive- Data Manipulation Statements</title>
	<author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
	</properties>

	<body>

	<section name="Create Table Syntax" href="create_table_syntax">

	<source><![CDATA[
	CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
	[(col_name data_type [COMMENT col_comment], ...)]
	[COMMENT table_comment]
	[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
	[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC\|DESC], ...)] INTO num_buckets BUCKETS]
	[ROW FORMAT row_format]
	[STORED AS file_format]
	[LOCATION hdfs_path]
	[TBLPROPERTIES (property_name=property_value, ...)]
	[AS select_statement]

	CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
	LIKE existing_table_name
	[LOCATION hdfs_path]

	data_type
	: primitive_type
	\| array_type
	\| map_type
	\| struct_type

	primitive_type
	: TINYINT
	\| SMALLINT
	\| INT
	\| BIGINT
	\| BOOLEAN
	\| FLOAT
	\| DOUBLE
	\| STRING

	array_type
	: ARRAY < data_type >

	map_type
	: MAP < primitive_type, data_type >

	struct_type
	: STRUCT < col_name : data_type [COMMENT col_comment], ...>

	row_format
	: DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
	[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
	\| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

	file_format:
	: SEQUENCEFILE
	\| TEXTFILE
	\| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
	]]></source>

	<p>
	CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error.
	</p>

	<p>
	The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
	</p>
	The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).

	<p>
	You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe. Refer to SerDe section of the User Guide for more information on SerDe.
	</p>

	<p>
	You must specify a list of a columns for tables that use a native SerDe. Refer to the Types part of the User Guide for the allowable column types. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
	</p>

	<p>
	Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
	</p>

	<p>
	Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
	</p>

	<p>
	Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
	</p>

	<p>A create table example:</p>
	<source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
	page_url STRING, referrer_url STRING,
	ip STRING COMMENT 'IP Address of the User')
	COMMENT 'This is the page view table'
	PARTITIONED BY(dt STRING, country STRING)
	STORED AS SEQUENCEFILE;]]></source>

	<p>The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
	</p>

	</section>

	<section name="Create Table as Select (CTAS)" href="ctas?">

	<p>
	Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
	</p>

	<p>
	There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
	</p>

	<source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
	page_url STRING, referrer_url STRING,
	ip STRING COMMENT 'IP Address of the User')
	COMMENT 'This is the page view table'
	PARTITIONED BY(dt STRING, country STRING)
	STORED AS SEQUENCEFILE;
	]]></source>

	</section>

	<section name="Using SerDes" href="SerDes">

	<p>
	This example CTAS statement creates the target table new_key_value_store with the
	schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the
	SELECT statement. If the SELECT statement does not specify column aliases, the
	column names will be automatically assigned to _col0, _col1, and _col2 etc.
	In addition, the new target table is created using a specific SerDe and a storage
	format independent of the source tables in the SELECT statement.
	</p>

	<source><![CDATA[CREATE TABLE new_key_value_store
	ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
	STORED AS RCFile AS
	SELECT (key % 1024) new_key, concat(key, value) key_value_pair
	FROM key_value_store
	SORT BY new_key, key_value_pair;
	]]></source>

	<p>
	<b>Being able to select data from one table to another is one of the most
	powerful features of Hive. Hive handles the conversion of the data from the source
	format to the destination format as the query is being executed!</b>
	</p>

	</section>

	<section name="Bucketed Sorted Table" href="bucketed_sorted_table">

	<source><![CDATA[CREATE TABLE page_view(viewTime INT, userid BIGINT,
	page_url STRING, referrer_url STRING,
	ip STRING COMMENT 'IP Address of the User')
	COMMENT 'This is the page view table'
	PARTITIONED BY(dt STRING, country STRING)
	CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\001'
	COLLECTION ITEMS TERMINATED BY '\002'
	MAP KEYS TERMINATED BY '\003'
	STORED AS SEQUENCEFILE;
	]]></source>

	<p>In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
	</p>

	<p>
	The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query. See
	<a href="working_with_bucketed_tables.html">Working with Bucketed tables</a> to see how these
	are used.
	</p>

	</section>

	<section name="External Tables" href="external_table?">

	<p>
	Unless a table is specified as EXTERNAL it will be stored inside a folder specified by the
	configuration property hive.metastore.warehouse.dir.
	EXTERNAL tables points to any hdfs location for its storage. You still have to make sure that the data is format is specified to match the data.

	</p>
	<source><![CDATA[CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
	page_url STRING, referrer_url STRING,
	ip STRING COMMENT 'IP Address of the User',
	country STRING COMMENT 'country of origination')
	COMMENT 'This is the staging page view table'
	ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
	STORED AS TEXTFILE
	LOCATION '<hdfs_location>';
	]]></source>

	</section>

	<section name="Create Table ... Like" href="create_table_like?">

	<p>The statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.
	</p>

	<source><![CDATA[CREATE TABLE empty_key_value_store
	LIKE key_value_store;
	]]></source>

	</section>

	<section name="drop" href="drop">
	<p>Drop it like it is hot</p>
	</section>
	</body>
	</document>