tag | 98e28b7e6f1c0e83f33ca5a93799c07f466a84b7 | |
---|---|---|
tagger | Prasanna Rajaperumal <prasanna@uber.com> | Thu Jan 05 19:43:28 2017 -0800 |
object | c1f2d1e456c385b6c1b5c07e95715acd0d5bdeb1 |
[maven-release-plugin] copy for tag hoodie-0.2.8
commit | c1f2d1e456c385b6c1b5c07e95715acd0d5bdeb1 | [log] [tgz] |
---|---|---|
author | Prasanna Rajaperumal <prasanna@uber.com> | Thu Jan 05 19:43:25 2017 -0800 |
committer | Prasanna Rajaperumal <prasanna@uber.com> | Thu Jan 05 19:43:25 2017 -0800 |
tree | 8b23626e193cb04ac4ff1fe7e0150effa537426f | |
parent | 9071220a300726f05948ab9532596e064e14711b [diff] |
[maven-release-plugin] prepare release hoodie-0.2.8
Hoodie provides the following abilities on a Hive table
Ultimately, make the built Hive table, queryable via Spark & Presto as well.
We have embraced the Google Java code style. Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)
Check out code and pull it into Intellij as a normal maven project.
You might want to add your spark assembly jar to project dependencies under “Module Setttings”, to be able to run Spark from IDE
Setup your local hadoop/hive test environment. See this for reference
Create the output folder on your local HDFS
hdfs dfs -mkdir -p /tmp/hoodie/sample-table
You can run the HoodieClientExample class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table
Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query.
hive> add jar file:///tmp/hoodie-mr-0.1.jar; Added [file:///tmp/hoodie-mr-0.1.jar] to class path Added resources: [file:///tmp/hoodie-mr-0.1.jar]
Then, you need to create a table and register the sample partitions
drop table hoodie_test; CREATE EXTERNAL TABLE hoodie_test(`_row_key` string, `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, rider string, driver string, begin_lat double, begin_lon double, end_lat double, end_lon double, fare double) PARTITIONED BY (`datestr` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' OUTPUTFORMAT 'com.uber.hoodie.hadoop.HoodieOutputFormat' LOCATION 'hdfs:///tmp/hoodie/sample-table'; ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15'; ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16'; ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';
Let's first perform a query on the latest committed snapshot of the table
hive> select count(*) from hoodie_test; ... OK 100 Time taken: 18.05 seconds, Fetched: 1 row(s) hive>
Let's now perform a query, to obtain the changed rows since a commit in the past
hive> set hoodie.scan.mode=INCREMENTAL; hive> set hoodie.last.commitTs=001; hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10; OK All commits :[001, 002] 002 rider-001 driver-001 002 rider-001 driver-001 002 rider-002 driver-002 002 rider-001 driver-001 002 rider-001 driver-001 002 rider-002 driver-002 002 rider-001 driver-001 002 rider-002 driver-002 002 rider-002 driver-002 002 rider-001 driver-001 Time taken: 0.056 seconds, Fetched: 10 row(s) hive> hive>
Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below
$ cd $SPARK_INSTALL $ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop $ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false scala> sqlContext.sql("show tables").show(10000) scala> sqlContext.sql("describe hoodie_test").show(10000) scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)
Checkout the ‘hoodie-integration’ branch, build off it, and place your installation somewhere.
Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/
Change your catalog config, to make presto respect the HoodieInputFormat
$ cat etc/catalog/hive.properties connector.name=hive-hadoop2 hive.metastore.uri=thrift://localhost:10000 hive.respect-input-format-splits=true
startup your server and you should be able to query the same Hive table via Presto
show columns from hive.default.hoodie_test; select count(*) from hive.default.hoodie_test
NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround
<todo - change this after packaging is done>
If all is good you should get a command prompt similar to this one
prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh 16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml] 16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy 16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring ============================================ * * * _ _ _ _ * * | | | | | (_) * * | |__| | ___ ___ __| |_ ___ * * | __ |/ _ \ / _ \ / _` | |/ _ \ * * | | | | (_) | (_) | (_| | | __/ * * |_| |_|\___/ \___/ \__,_|_|\___| * * * ============================================ Welcome to Hoodie CLI. Please type help if you are looking for help. hoodie->
connect --path [dataset_path] : Connect to the specific dataset by its path
commits show : Show all details about the commits
commits refresh : Refresh the commits from HDFS
commit rollback --commit [commitTime] : Rollback a commit
commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics)
commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level)
commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead
stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted)
help
We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open issues or pull requests against this repo. Before you do so, please sign the Uber CLA. Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.