commit | 22a5ae1ea8d7ef3b93d037932c3de4a14a1e41cf | [log] [tgz] |
---|---|---|
author | ggershinsky <ggershinsky@users.noreply.github.com> | Mon Jul 06 11:01:36 2020 +0300 |
committer | Gabor Szadovszky <gabor@apache.org> | Tue Jul 28 12:52:22 2020 +0200 |
tree | 471e370b39446215e3635f088e715ed8499175cf | |
parent | dcf7d7edacb8dc7c41903dca94d20ff0809235b9 [diff] |
PARQUET-1373: Encryption key tools (#615) * comments * update key tools * double wrap for minimizing KMS calls * Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata. Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL. Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager. KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata. * major update - key rotation, crypto factory, etc * Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token. KmsClient is per token and read/write KEK caches too. Add default token value for InMemoryKMS, which has no tokens. Use concurrentHashMap for caches with computeIfAbsent. Add expiration using to the caches - both time-based and on-demand. On expiration delete the per-token entries from caches. Add method for cache invalidation per token. Add abstract methods to be implemented by RemoteKmsClients. * add in-memory KMS * Change RemoteKmsClient exceptions to IOException instead of the higher-level ParquetCryptoRuntimeException. Change to constant names to uppercase. * Add sample VaultClient. * interface changes * Add okHttp3 dependency for VaultClient sample. * wrapping changes * Use JSON serialization for key material. * separate write and read path, update caching * improved refactoring * key rotation improvements * Add TestPropertiesDrivenEncryption * get and resfresh token for all KMS clients * minor changes * Use ConcurrentHashMap for caches * caching and store updates * Rename some encryption/decryption configurations and make the test parameterized to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally. Add RemoteKmsClient mock for remote wrapping. * add removeCacheEntriesForAllTokens * Make common method setCommonKMSProperties and extract classname strings from classes * Change TestPropertiesDrivenEncryption to accomodate latest API changes. * Remove StringUtils * address review comments * key material documentation * Boolean objects Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures.
You can find some details about the format and intended use cases in our Hadoop Summit 2013 presentation
Parquet-MR uses Maven to build and depends on the thrift compiler (protoc is now managed by maven plugin).
To build and install the thrift compiler, run:
wget -nv http://archive.apache.org/dist/thrift/0.12.0/thrift-0.12.0.tar.gz tar xzf thrift-0.12.0.tar.gz cd thrift-0.12.0 chmod +x ./configure ./configure --disable-libs sudo make install
If you're on OSX and use homebrew, you can instead install Thrift 0.12.0 with brew
and ensure that it comes first in your PATH
.
brew install thrift@0.12 export PATH="/usr/local/opt/thrift@0.12.0/bin:$PATH"
Once protobuf and thrift are available in your path, you can build the project by running:
LC_ALL=C mvn clean install
Parquet is a very active project, and new features are being added quickly. Here are a few features:
Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.
We've implemented this for 2 popular data formats to provide a clean migration path as well:
Thrift integration is provided by the parquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter‘s Scrooge. If that’s the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the parquet-scrooge sub-project.
Avro conversion is implemented via the parquet-avro sub-project.
See the APIs:
A Loader and a Storer are provided to read and write Parquet files with Apache Pig
Storing data into Parquet in Pig is simple:
-- options you might want to fiddle with SET parquet.page.size 1048576 -- default. this is your min read/write unit. SET parquet.block.size 134217728 -- default. your memory budget for buffering data SET parquet.compression lzo -- or you can use none, gzip, snappy STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;
Reading in Pig is also simple:
mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();
If the data was stored using Pig, things will “just work”. If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.
Hive integration is provided via the parquet-hive sub-project.
Hive integration is now deprecated within the Parquet project. It is now maintained by Apache Hive.
To run the unit tests: mvn test
To build the jars: mvn package
The build runs in Travis CI:
The current release is version 1.11.0
<dependencies> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-common</artifactId> <version>1.11.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-encoding</artifactId> <version>1.11.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-column</artifactId> <version>1.11.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-hadoop</artifactId> <version>1.11.0</version> </dependency> </dependencies>
We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the parquet-mr Git repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote to https://github.com/apache/parquet-mr.git
If you are looking for some ideas on what to contribute, check out jira issues for this project labeled “pick-me-up”. Comment on the issue and/or contact dev@parquet.apache.org with your questions and ideas.
If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list dev@parquet.apache.org
To contribute a patch:
mvn test
in the root directory.We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:
a+b
but a + b
and not foo(int a,int b)
but foo(int a, int b)
.Thank you for getting involved!
We hold ourselves and the Parquet developer community to two codes of conduct:
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0 See also: