| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <document> |
| |
| <properties> |
| <title>Hadoop Hive- Working with Bucketed Tables</title> |
| <author email="hive-user@hadoop.apache.org">Hadoop Hive Documentation Team</author> |
| </properties> |
| |
| <body> |
| |
| <section name="Defining Bucketed Tables" href="defining_bucketed_tables?"> |
| |
| <p> |
| This is a brief example on creating a populating bucketed tables. Bucketed tables |
| are fantastic in that they allow much more efficient sampling than do non-bucketed |
| tables, and they may later allow for time saving operations such as mapside joins. |
| However, the bucketing specified at table creation is not enforced when the table |
| is written to, and so it is possible for the table's metadata to advertise |
| properties which are not upheld by the table's actual layout. This should obviously |
| be avoided. Here's how to do it right. |
| </p> |
| <p>First there’s table creation:</p> |
| |
| <source><![CDATA[CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) |
| COMMENT 'A bucketed copy of user_info' |
| PARTITIONED BY(ds STRING) |
| CLUSTERED BY(user_id) INTO 256 BUCKETS; |
| ]]></source> |
| |
| <p>notice that we define user_id as the bucket column</p> |
| </section> |
| |
| <section name="Populating Bucketed Tables" href="populating_bucketed_tables?"> |
| |
| <source><![CDATA[set hive.enforce.bucketing = true; |
| FROM user_id |
| INSERT OVERWRITE TABLE user_info_bucketed |
| PARTITION (ds='2009-02-25') |
| SELECT userid, firstname, lastname WHERE ds='2009-02-25'; |
| ]]></source> |
| |
| <p>The command <strong>set hive.enforce.bucketing = true;</strong> allows the |
| correct number of reducers and the cluster by column to be automatically selected |
| based on the table. Otherwise, you would need to set the number of reducers to be |
| the same as the number of buckets with |
| <strong>set mapred.reduce.tasks = 256;</strong> and have a |
| <strong>CLUSTER BY ...</strong> clause in the select.</p> |
| |
| </section> |
| |
| <section name="Bucketing Explained" href="bucketing_explained?"> |
| <p> |
| How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets. |
| </p> |
| |
| </section> |
| |
| <section name="What can go wrong?" href="bucketing_gone_wrong?"> |
| <p> |
| So, what can go wrong? As long as you |
| <strong>set hive.enforce.bucketing = true</strong>, and use the syntax above, |
| the tables should be populated properly. Things can go wrong if the bucketing |
| column type is different during the insert and on read, or if you manually |
| cluster by a value that's different from the table definition. |
| </p> |
| </section> |
| </body> |
| </document> |