[maven-release-plugin] copy for tag hoodie-0.2.6
[maven-release-plugin] prepare release hoodie-0.2.6
6 files changed
tree: ab6b50d96018a018ee2d57ba544f2f37ebfc07ef
  1. deploy/
  2. hoodie-cli/
  3. hoodie-client/
  4. hoodie-common/
  5. hoodie-hadoop-mr/
  6. hoodie-hive/
  7. .gitignore
  8. .travis.yml
  9. LICENSE.txt
  10. pom.xml
  11. README.md
README.md

Hoodie - Spark Library For Upserts & Incremental Consumption


Core Functionality

Hoodie provides the following abilities on a Hive table

  • Upsert (how do I change the table efficiently?)
  • Incremental consumption (how do I obtain records that changed?)

Ultimately, make the built Hive table, queryable via Spark & Presto as well.

Code & Project Structure

  • hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
  • hoodie-common : Common code shared between different artifacts of Hoodie

We have embraced the Google Java code style. Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide)

Quickstart

Check out code and pull it into Intellij as a normal maven project.

You might want to add your spark assembly jar to project dependencies under “Module Setttings”, to be able to run Spark from IDE

Setup your local hadoop/hive test environment. See this for reference

Run the Hoodie Test Job

Create the output folder on your local HDFS

hdfs dfs -mkdir -p /tmp/hoodie/sample-table

You can run the HoodieClientExample class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table

Access via Hive

Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query.

hive> add jar file:///tmp/hoodie-mr-0.1.jar;
Added [file:///tmp/hoodie-mr-0.1.jar] to class path
Added resources: [file:///tmp/hoodie-mr-0.1.jar]

Then, you need to create a table and register the sample partitions

drop table hoodie_test;
CREATE EXTERNAL TABLE hoodie_test(`_row_key`  string,
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
 rider string,
 driver string,
 begin_lat double,
 begin_lon double,
 end_lat double,
 end_lon double,
 fare double)
PARTITIONED BY (`datestr` string)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieInputFormat'
OUTPUTFORMAT
   'com.uber.hoodie.hadoop.HoodieOutputFormat'
LOCATION
   'hdfs:///tmp/hoodie/sample-table';

ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15';
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16';
ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17';

Let's first perform a query on the latest committed snapshot of the table

hive> select count(*) from hoodie_test;
...
OK
100
Time taken: 18.05 seconds, Fetched: 1 row(s)
hive>

Let's now perform a query, to obtain the changed rows since a commit in the past

hive> set hoodie.scan.mode=INCREMENTAL;
hive> set hoodie.last.commitTs=001;
hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10;
OK
All commits :[001, 002]
002	rider-001	driver-001
002	rider-001	driver-001
002	rider-002	driver-002
002	rider-001	driver-001
002	rider-001	driver-001
002	rider-002	driver-002
002	rider-001	driver-001
002	rider-002	driver-002
002	rider-002	driver-002
002	rider-001	driver-001
Time taken: 0.056 seconds, Fetched: 10 row(s)
hive>
hive>

Access via Spark

Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below

$ cd $SPARK_INSTALL
$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
$ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false


scala> sqlContext.sql("show tables").show(10000)
scala> sqlContext.sql("describe hoodie_test").show(10000)
scala> sqlContext.sql("select count(*) from hoodie_test").show(10000)

Access via Presto

Checkout the ‘hoodie-integration’ branch, build off it, and place your installation somewhere.

  • Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/

  • Change your catalog config, to make presto respect the HoodieInputFormat

$ cat etc/catalog/hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:10000
hive.respect-input-format-splits=true

startup your server and you should be able to query the same Hive table via Presto

show columns from hive.default.hoodie_test;
select count(*) from hive.default.hoodie_test

NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround

Planned

  • Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan.
  • Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant
  • Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost performance

Hoodie Admin CLI

Launching Command Line

<todo - change this after packaging is done>

  • mvn clean install in hoodie-cli
  • ./hoodie-cli

If all is good you should get a command prompt similar to this one

prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh 
16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml]
16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy
16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
============================================
*                                          *
*     _    _                 _ _           *
*    | |  | |               | (_)          *
*    | |__| | ___   ___   __| |_  ___      *
*    |  __  |/ _ \ / _ \ / _` | |/ _ \     *
*    | |  | | (_) | (_) | (_| | |  __/     *
*    |_|  |_|\___/ \___/ \__,_|_|\___|     *
*                                          *
============================================

Welcome to Hoodie CLI. Please type help if you are looking for help. 
hoodie->

Commands

  • connect --path [dataset_path] : Connect to the specific dataset by its path

  • commits show : Show all details about the commits

  • commits refresh : Refresh the commits from HDFS

  • commit rollback --commit [commitTime] : Rollback a commit

  • commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics)

  • commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level)

  • commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead

  • stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted)

  • help

Contributing

We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open issues or pull requests against this repo. Before you do so, please sign the Uber CLA. Also, be sure to write unit tests for your bug fix or feature to show that it works as expected.