commit | dd92a9db6b288def8159f30336f6793239882c9d | [log] [tgz] |
---|---|---|
author | asingh <asingh@cloudera.com> | Tue May 26 14:31:51 2015 -0700 |
committer | Ryan Blue <blue@apache.org> | Tue May 26 14:31:51 2015 -0700 |
tree | f324d19957cd8626ea2667804c5c8f39d4dbde11 | |
parent | 8769d0f2cc4b7555dc025b7c0e49a81346a1e2dd [diff] |
PARQUET-223: Add builders for MAP and LIST types As of now, Parquet does not provide builders for Maps and Lists. This leaves margin for user errors. Having Map and List builders will make it easier for users to build these types. Author: asingh <asingh@cloudera.com> Closes #148 from SinghAsDev/map and squashes the following commits: cc7da06 [asingh] Pull changes made by Ryan 825b5b8 [asingh] Remove non-functional changes bec675b [asingh] Remove required and optional version of methods that take pre-built Type 6dcaa78 [asingh] Address review comments and some clean up 544d1e4 [asingh] Add key(Type) and value(Type) variants to MapBuilder f2a1697 [asingh] Add listKey support 68c06f5 [asingh] Add support for null value in MapBuilder f31f2b0 [asingh] Add more tests to cover list and map value types in map builder f035439 [asingh] Add Map and List value types to map 1afa2c7 [asingh] Address review comments 484495b [asingh] PARQUET-223: Add builders for MAP and LIST types
Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures.
You can find some details about the format and intended use cases in our Hadoop Summit 2013 presentation
Parquet-MR uses Maven to build and depends on both the thrift and protoc compilers.
To build and install the protobuf compiler, run:
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz tar xzf protobuf-2.5.0.tar.gz cd protobuf-2.5.0 ./configure make sudo make install sudo ldconfig
To build and install the thrift compiler, run:
wget -nv http://archive.apache.org/dist/thrift/0.7.0/thrift-0.7.0.tar.gz tar xzf thrift-0.7.0.tar.gz cd thrift-0.7.0 chmod +x ./configure ./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang sudo make install
Once protobuf and thrift are available in your path, you can build the project by running:
mvn clean install
Parquet is a very active project, and new features are being added quickly; below is the state as of June 2013.
Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.
We've implemented this for 2 popular data formats to provide a clean migration path as well:
Thrift integration is provided by the parquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter‘s Scrooge. If that’s the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the parquet-scrooge sub-project.
Avro conversion is implemented via the parquet-avro sub-project.
See the APIs:
A Loader and a Storer are provided to read and write Parquet files with Apache Pig
Storing data into Parquet in Pig is simple:
-- options you might want to fiddle with SET parquet.page.size 1048576 -- default. this is your min read/write unit. SET parquet.block.size 134217728 -- default. your memory budget for buffering data SET parquet.compression lzo -- or you can use none, gzip, snappy STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;
Reading in Pig is also simple:
mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();
If the data was stored using Pig, things will “just work”. If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.
Hive integration is provided via the parquet-hive sub-project.
to run the unit tests: mvn test
to build the jars: mvn package
The build runs in Travis CI:
<repositories> <repository> <id>sonatype-nexus-snapshots</id> <url>https://oss.sonatype.org/content/repositories/snapshots</url> <releases> <enabled>false</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <dependencies> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-common</artifactId> <version>1.0.0-SNAPSHOT</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-encoding</artifactId> <version>1.0.0-SNAPSHOT</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-column</artifactId> <version>1.0.0-SNAPSHOT</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-hadoop</artifactId> <version>1.0.0-SNAPSHOT</version> </dependency> </dependencies>
<dependencies> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-common</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-encoding</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-column</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-hadoop</artifactId> <version>1.0.0</version> </dependency> </dependencies>
If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled “Pick me up!”. Comment on the issue and/or contact the parquet-dev group with your questions and ideas.
We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:
mvn license:format
command.a+b
but a + b
and not foo(int a,int b)
but foo(int a, int b)
.We hold ourselves and the Parquet developer community to a code of conduct as described by Twitter OSS: https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md.
Copyright 2012-2013 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0