| How to Hive |
| ----------- |
| |
| DISCLAIMER: This is a prototype version of Hive and is NOT production |
| quality. This is provided mainly as a way of illustrating the capabilities |
| of Hive and is provided as-is. However - we are working hard to make |
| Hive a production quality system. Hive has only been tested on unix(linux) |
| and mac systems using Java 1.6 for now - although it may very well work |
| on other similar platforms. It does not work on Cygwin right now. Most of |
| our testing has been on Hadoop 0.17 - so we would advise running it against |
| this version of hadoop - even though it may compile/work against other versions |
| |
| Useful mailing lists |
| -------------------- |
| |
| 1. hive-user@hadoop.apache.org - To discuss and ask usage questions. Send an |
| empty email to hive-user-subscribe@hadoop.apache.org in order to subscribe |
| to this mailing list. |
| |
| 2. hive-core@hadoop.apache.org - For discussions around code, design and |
| features. Send an empty email to hive-core-subscribe@hadoop.apache.org in |
| order to subscribe to this mailing list. |
| |
| 3. hive-commits@hadoop.apache.org - In order to monitor the commits to the |
| source repository. Send an empty email to hive-commits-subscribe@hadoop.apache.org in |
| order to subscribe to this mailing list. |
| |
| Downloading and building |
| ------------------------ |
| |
| - svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive_trunk |
| - cd hive_trunk |
| - hive_trunk> ant -Dtarget.dir=<your-install-dir> -Dhadoop.version='0.17.0' package |
| |
| You can replace 0.17.0 with 0.18.1, 0.19.0 etc to match the version of hadoop |
| that you are using. |
| |
| In the rest of the README, we use dist and <install-dir> interchangeably. |
| |
| Running Hive |
| ------------ |
| |
| Hive uses hadoop that means: |
| - you must have hadoop in your path OR |
| - export HADOOP=<hadoop-install-dir>/bin/hadoop |
| |
| To use hive command line interface (cli) from the shell: |
| $ cd <install-dir> |
| $ bin/hive |
| |
| Using Hive |
| ---------- |
| |
| Configuration management overview |
| --------------------------------- |
| |
| - hive configuration is stored in <install-dir>/conf/hive-default.xml |
| and log4j in hive-log4j.properties |
| - hive configuration is an overlay on top of hadoop - meaning the |
| hadoop configuration variables are inherited by default. |
| - hive configuration can be manipulated by: |
| o editing hive-default.xml and defining any desired variables |
| (including hadoop variables) in it |
| o from the cli using the set command (see below) |
| o by invoking hive using the syntax: |
| * bin/hive -hiveconf x1=y1 -hiveconf x2=y2 |
| this sets the variables x1 and x2 to y1 and y2 |
| |
| |
| Error Logs |
| ---------- |
| Hive uses log4j for logging. By default logs are not emitted to the |
| console by the cli. They are stored in the file: |
| - /tmp/{user.name}/hive.log |
| |
| If the user wishes - the logs can be emitted to the console by adding |
| the arguments shown below: |
| - bin/hive -hiveconf hive.root.logger=INFO,console |
| |
| Note that setting hive.root.logger via the 'set' command does not |
| change logging properties since they are determined at initialization time. |
| |
| Error logs are very useful to debug problems. Please send them with |
| any bugs (of which there are many!) to the JIRA at |
| https://issues.apache.org/jira/browse/HIVE. |
| |
| |
| DDL Operations |
| -------------- |
| |
| Creating Hive tables and browsing through them |
| |
| hive> CREATE TABLE pokes (foo INT, bar STRING); |
| |
| Creates a table called pokes with two columns, first being an |
| integer and other a string columns |
| |
| hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); |
| |
| Creates a table called invites with two columns and a partition column |
| called ds. The partition column is a virtual column. It is not part |
| of the data itself, but is derived from the partition that a particular |
| dataset is loaded into. |
| |
| |
| By default tables are assumed to be of text input format and the |
| delimiters are assumed to be ^A(ctrl-a). We will be soon publish additional |
| commands/recipes to add binary (sequencefiles) data and configurable |
| delimiters etc. |
| |
| hive> SHOW TABLES; |
| |
| lists all the tables |
| |
| hive> SHOW TABLES '.*s'; |
| |
| lists all the table that end with 's'. The pattern matching follows Java |
| regular expressions. Check out this link for documentation |
| http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html |
| |
| hive> DESCRIBE invites; |
| |
| shows the list of columns |
| |
| hive> DESCRIBE EXTENDED invites; |
| |
| shows the list of columns plus any other meta information about the table |
| |
| Altering tables. Table name can be changed and additional columns can be dropped |
| |
| hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); |
| hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); |
| hive> ALTER TABLE pokes REPLACE COLUMNS (c1 INT, c2 STRING); |
| hive> ALTER TABLE events RENAME TO 3koobecaf; |
| |
| Dropping tables |
| hive> DROP TABLE pokes; |
| |
| |
| Metadata Store |
| -------------- |
| |
| Metadata is in an embedded Derby database whose location is determined by the |
| hive configuration variable named javax.jdo.option.ConnectionURL. By default |
| (see conf/hive-default.xml) - this location is ./metastore_db |
| |
| Right now - in the default configuration, this metadata can only be seen by |
| one user at a time. |
| |
| Metastore can be stored in any database that is supported by JPOX. The |
| location and the type of the RDBMS can be controlled by the two variables |
| 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. |
| Refer to JDO (or JPOX) documentation for more details on supported databases. |
| The database schema is defined in JDO metadata annotations file package.jdo |
| at src/contrib/hive/metastore/src/model. |
| |
| In the future - the metastore itself can be a standalone server. |
| |
| |
| DML Operations |
| -------------- |
| |
| Loading data from flat files into Hive |
| |
| hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; |
| |
| Loads a file that contains two columns separated by ctrl-a into pokes table. |
| 'local' signifies that the input file is on the local system. If 'local' |
| is omitted then it looks for the file in HDFS. |
| |
| The keyword 'overwrite' signifies that existing data in the table is deleted. |
| If the 'overwrite' keyword is omitted - then data files are appended to existing data sets. |
| |
| NOTES: |
| - NO verification of data against the schema |
| - if the file is in hdfs it is moved into hive controlled file system namespace. |
| The root of the hive directory is specified by the option hive.metastore.warehouse.dir |
| in hive-default.xml. We would advise that this directory be pre-existing before |
| trying to create tables via Hive. |
| |
| hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); |
| hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); |
| |
| The two LOAD statements above load data into two different partitions of the table |
| invites. Table invites must be created as partitioned by the key ds for this to succeed. |
| |
| Loading/Extracting data using Queries |
| ------------------------------------- |
| |
| Runtime configuration |
| --------------------- |
| |
| - Hives queries are executed using map-reduce queries and as such the behavior |
| of such queries can be controlled by the hadoop configuration variables |
| |
| - The cli can be used to set any hadoop (or hive) configuration variable. For example: |
| o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 |
| o hive> SET - v |
| The latter shows all the current settings. Without the v option only the |
| variables that differ from the base hadoop configuration are displayed |
| - In particular the number of reducers should be set to a reasonable number |
| to get good performance (the default is 1!) |
| |
| |
| EXAMPLE QUERIES |
| --------------- |
| |
| Some example queries are shown below. More are available in the hive code: |
| ql/src/test/queries/{positive,clientpositive}. |
| |
| SELECTS and FILTERS |
| ------------------- |
| |
| hive> SELECT a.foo FROM invites a; |
| |
| select column 'foo' from all rows of invites table. The results are not |
| stored anywhere, but are displayed on the console. |
| |
| Note that in all the examples that follow, INSERT (into a hive table, local |
| directory or HDFS directory) is optional. |
| |
| hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; |
| |
| select all rows from invites table into an HDFS directory. The result data |
| is in files (depending on the number of mappers) in that directory. |
| NOTE: partition columns if any are selected by the use of *. They can also |
| be specified in the projection clauses. |
| |
| hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; |
| |
| Select all rows from pokes table into a local directory |
| |
| hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; |
| hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; |
| hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; |
| hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; |
| hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; |
| hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; |
| hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; |
| |
| Sum of a column. avg, min, max can also be used |
| |
| NOTE: there are some flaws with the type system that cause doubles to be |
| returned with integer types would be expected. We expect to fix these in the coming week. |
| |
| GROUP BY |
| -------- |
| |
| hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; |
| hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; |
| |
| NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is |
| to handle skews in input data. We will be optimizing this in the coming weeks. |
| |
| JOIN |
| ---- |
| |
| hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo |
| |
| MULTITABLE INSERT |
| ----------------- |
| FROM src |
| INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 |
| INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 |
| INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 |
| INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 |
| |
| STREAMING |
| --------- |
| hive> FROM invites a INSERT OVERWRITE TABLE events |
| > MAP a.foo, a.bar USING '/bin/cat' |
| > AS oof, rab WHERE a.ds > '2008-08-09'; |
| |
| This streams the data in the map phase through the script /bin/cat (like hadoop streaming). |
| Similarly - streaming can be used on the reduce side. Please look for files mapreduce*.q. |
| |
| KNOWN BUGS/ISSUES |
| ----------------- |
| * hive cli may hang for a couple of minutes because of a bug in getting metadata |
| from the derby database. let it run and you'll be fine! |
| * hive cli creates derby.log in the directory from which it has been invoked. |
| * COUNT(*) does not work for now. Use COUNT(1) instead. |
| * ORDER BY not supported yet. |
| * CASE not supported yet. |
| * Only string and thrift types (http://developers.facebook.com/thrift) have been tested. |
| * When doing Join, please put the table with big number of rows containing the same join key to |
| the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. |
| |
| FUTURE FEATURES |
| --------------- |
| * EXPLODE function to generate multiple rows from a column of list type. |
| * Table statistics for query optimization. |
| |
| Developing Hive using Eclipse |
| ------------------------ |
| 1. Follow the 3 steps in "Downloading and building" section above |
| |
| 2. Change the first line in conf/hive-log4j.properties to the following |
| line to see error messages on the console. |
| hive.root.logger=INFO,console |
| |
| 3. Run tests to make sure everything works. It may take 20 minutes. |
| ant -Dhadoop.version='0.17.0' -logfile test.log test" |
| |
| 4. Create an empty java project in Eclipse and close it. |
| |
| 5. Add the following section to Eclipse project's .project file: |
| <linkedResources> |
| <link> |
| <name>cli_src_java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/cli/src/java</location> |
| </link> |
| <link> |
| <name>common_src_java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/common/src/java</location> |
| </link> |
| <link> |
| <name>metastore_src_gen-javabean</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/metastore/src/gen-javabean</location> |
| </link> |
| <link> |
| <name>metastore_src_java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/metastore/src/java</location> |
| </link> |
| <link> |
| <name>metastore_src_model</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/metastore/src/model</location> |
| </link> |
| <link> |
| <name>ql_src_java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/ql/src/java</location> |
| </link> |
| <link> |
| <name>serde_src_gen-java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/serde/src/gen-java</location> |
| </link> |
| <link> |
| <name>serde_src_java</name> |
| <type>2</type> |
| <location>/xxx/hive_trunk/serde/src/java</location> |
| </link> |
| </linkedResources> |
| |
| 6. Add the following list to the Eclipse project's .classpath file: |
| <classpathentry kind="src" path="src"/> |
| <classpathentry kind="src" path="metastore_src_model"/> |
| <classpathentry kind="src" path="metastore_src_gen-javabean"/> |
| <classpathentry kind="src" path="serde_src_gen-java"/> |
| <classpathentry kind="src" path="cli_src_java"/> |
| <classpathentry kind="src" path="ql_src_java"/> |
| <classpathentry kind="src" path="metastore_src_java"/> |
| <classpathentry kind="src" path="serde_src_java"/> |
| <classpathentry kind="src" path="common_src_java"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/asm-3.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/commons-cli-2.0-SNAPSHOT.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/commons-collections-3.2.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/commons-lang-2.4.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/commons-logging-1.0.4.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/commons-logging-api-1.0.4.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/derby.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/jdo2-api-2.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/jpox-core-1.2.2.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/jpox-enhancer-1.2.2.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/jpox-rdbms-1.2.2.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/libfb303.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/libthrift.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/log4j-1.2.15.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/lib/velocity-1.5.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/ql/lib/antlr-3.0.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/ql/lib/antlr-runtime-3.0.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/ql/lib/commons-jexl-1.1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/ql/lib/stringtemplate-3.1b1.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/cli/lib/jline-0.9.94.jar"/> |
| <classpathentry kind="lib" path="/xxx/hive_trunk/build/hadoopcore/hadoop-0.19.0/hadoop-0.19.0-core.jar"/> |
| |
| 7. Try building hive inside Eclipse, and develop using Eclipse. |
| |
| |
| Development Tips |
| ------------------------ |
| * You may use the following line to test a specific testcase with a specific query file. |
| ant -Dhadoop.version='0.17.0' -Dtestcase=TestParse -Dqfile=udf4.q test |
| ant -Dhadoop.version='0.17.0' -Dtestcase=TestParseNegative -Dqfile=invalid_dot.q test |
| ant -Dhadoop.version='0.17.0' -Dtestcase=TestCliDriver -Dqfile=udf1.q test |
| ant -Dhadoop.version='0.17.0' -Dtestcase=TestNegativeCliDriver -Dqfile=invalid_tbl_name.q test |