docs/xdocs/language_manual/working_with_bucketed_tables.xml - hive - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
 -->

 <document>

   <properties>
     <title>Hadoop Hive- Working with Bucketed Tables</title>
     <author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
   </properties>

   <body>

  <section name="Defining Bucketed Tables" href="defining_bucketed_tables?">

 <p>
 This is a brief example on creating a populating bucketed tables. Bucketed tables
 are fantastic in that they allow much more efficient sampling than do non-bucketed
 tables, and they may later allow for time saving operations such as mapside joins.
 However, the bucketing specified at table creation is not enforced when the table
 is written to, and so it is possible for the table's metadata to advertise
 properties which are not upheld by the table's actual layout. This should obviously
 be avoided. Here's how to do it right.
 </p>
 <p>First there’s table creation:</p>

  <source><![CDATA[CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
 COMMENT 'A bucketed copy of user_info'
 PARTITIONED BY(ds STRING)
 CLUSTERED BY(user_id) INTO 256 BUCKETS;
  ]]></source>

 <p>notice that we define user_id as the bucket column</p>
 </section>

 <section name="Populating Bucketed Tables" href="populating_bucketed_tables?">

  <source><![CDATA[set hive.enforce.bucketing = true;
 FROM user_id
 INSERT OVERWRITE TABLE user_info_bucketed
 PARTITION (ds='2009-02-25')
 SELECT userid, firstname, lastname WHERE ds='2009-02-25';
  ]]></source>

 <p>The command <strong>set hive.enforce.bucketing = true;</strong>  allows the
 correct number of reducers and the cluster by column to be automatically selected
 based on the table. Otherwise, you would need to set the number of reducers to be
 the same as the number of buckets with
 <strong>set mapred.reduce.tasks = 256;</strong> and have a
 <strong>CLUSTER BY ...</strong> clause in the select.</p>

 </section>

 <section name="Bucketing Explained" href="bucketing_explained?">
 <p>
 How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
 </p>

 </section>

 <section name="What can go wrong?" href="bucketing_gone_wrong?">
 <p>
 So, what can go wrong? As long as you
 <strong>set hive.enforce.bucketing = true</strong>, and use the syntax above,
 the tables should be populated properly. Things can go wrong if the bucketing
 column type is different during the insert and on read, or if you manually
 cluster by a value that's different from the table definition.
 </p>
 </section>
 </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<document>

	<properties>
	<title>Hadoop Hive- Working with Bucketed Tables</title>
	<author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author>
	</properties>

	<body>

	<section name="Defining Bucketed Tables" href="defining_bucketed_tables?">

	<p>
	This is a brief example on creating a populating bucketed tables. Bucketed tables
	are fantastic in that they allow much more efficient sampling than do non-bucketed
	tables, and they may later allow for time saving operations such as mapside joins.
	However, the bucketing specified at table creation is not enforced when the table
	is written to, and so it is possible for the table's metadata to advertise
	properties which are not upheld by the table's actual layout. This should obviously
	be avoided. Here's how to do it right.
	</p>
	<p>First there’s table creation:</p>

	<source><![CDATA[CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
	COMMENT 'A bucketed copy of user_info'
	PARTITIONED BY(ds STRING)
	CLUSTERED BY(user_id) INTO 256 BUCKETS;
	]]></source>

	<p>notice that we define user_id as the bucket column</p>
	</section>

	<section name="Populating Bucketed Tables" href="populating_bucketed_tables?">

	<source><![CDATA[set hive.enforce.bucketing = true;
	FROM user_id
	INSERT OVERWRITE TABLE user_info_bucketed
	PARTITION (ds='2009-02-25')
	SELECT userid, firstname, lastname WHERE ds='2009-02-25';
	]]></source>

	<p>The command <strong>set hive.enforce.bucketing = true;</strong> allows the
	correct number of reducers and the cluster by column to be automatically selected
	based on the table. Otherwise, you would need to set the number of reducers to be
	the same as the number of buckets with
	<strong>set mapred.reduce.tasks = 256;</strong> and have a
	<strong>CLUSTER BY ...</strong> clause in the select.</p>

	</section>

	<section name="Bucketing Explained" href="bucketing_explained?">
	<p>
	How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
	</p>

	</section>

	<section name="What can go wrong?" href="bucketing_gone_wrong?">
	<p>
	So, what can go wrong? As long as you
	<strong>set hive.enforce.bucketing = true</strong>, and use the syntax above,
	the tables should be populated properly. Things can go wrong if the bucketing
	column type is different during the insert and on read, or if you manually
	cluster by a value that's different from the table definition.
	</p>
	</section>
	</body>
	</document>