commit | 8bc2172e4cc6fc03795f506a656c3e91fa3c13ec | [log] [tgz] |
---|---|---|
author | kumarvishal09 <kumarvishal1802@gmail.com> | Mon Apr 29 17:33:29 2019 +0800 |
committer | Jacky Li <jacky.likun@qq.com> | Mon Jan 20 23:19:47 2020 +0800 |
tree | 1dc606d21628d87cb2f445d086d273b27a24f532 | |
parent | f8a157327c4769ad586b8efd026f0e1a943bcf1e [diff] |
[CARBONDATA-3271] Integrating deep learning framework TensorFlow Why is this PR needed? Nowadays AI model training is getting more and more popular. Currently many AI framework uses raw data files or row format data files for model training, it could not provide projection, filtering, and fast scan capability like in columnar store. So, if CarbonData supports AI framework, it can speed up model training by increase IO throughput, and provide more flexible training set selection ability to AI developers What changes were proposed in this PR? Added a basic framework: - Supports shuffle read, which reads the data in random order when feeding data to training model for each epoch. - Supports data cache to improve reading speed for multiple epoch, including local-disk and memory-cache. - Supports parallel reading using thread pool and process pool in python. - Supports reading data in object storage - Supports manifest format and CarbonData folder Support tensorflow to use the framework: Tensorflow integration: New python API in pycarbon to support TensorFlow to read data from CarbonData files for training model Important files, please review in details: reader.py tensorflow.py carbon.py carbon_arrow_reader_worker.py carbon_py_dict_reader_worker.py carbon_reader.py carbon_tf_utils.py carbon_dataset_metadata.py Does this PR introduce any user interface change? Yes. (please explain the change and update document) Main new interfaces: ``` def make_reader(dataset_url=None, workers_count=10, results_queue_size=100, num_epochs=1, obs_client=None, shuffle=True, schema_fields=None, is_batch=True, reader_pool_type='thread', data_format='carbon', cache_properties={'cache_type': None, 'cache_location': None, 'cache_size_limit': None, 'cache_row_size_estimate': None, 'cache_extra_settings': None}, **properties ): def make_tensor(reader, shuffling_queue_capacity=0, min_after_dequeue=0): def make_dataset(reader) ``` Is any new testcase added? Yes https://issues.apache.org/jira/browse/CARBONDATA-3254 Example: 1. Setup cd /yourpath/carbondata/python/ PYTHONPATH=/yourpath/carbondata/python/ pip install . --user 2. Generating a Pycarbon Dataset from MNIST Data User should do some config first: config pyspark and add carbon assembly jar to pyspark/jars folder, which can be compiled from CarbonData project. default Java sdk jar is in carbondata/store/sdk/target, user also can specific the jar location like by --carbon-sdk-path in generate_pycarbon_dataset.py. set JAVA_HOME, PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in you system environment. This creates both a train and test carbon datasets: cd pycarbon/tests/mnist/dataset_with_unischema python generate_pycarbon_mnist.py if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like: python generate_pycarbon_mnist.py --carbon-sdk-path /your_path/carbondata/store/sdk/target/carbondata-sdk.jar 3. Tensorflow training using the Carbon MNIST Dataset This will invoke a training run using MNIST carbondata, for 1 epochs, using a batch size of 100, and log every 10 intervals. python tf_example_carbon_unified_api.py if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like: python tf_example_carbon_unified_api.py --carbon-sdk-path /your_path/carbondata/store/sdk/target/carbondata-sdk.jar This closes #3479
Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.
You can find the latest CarbonData document and learn more at: http://carbondata.apache.org
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:
CarbonData is built using Apache Maven, to build CarbonData
This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.
To get involved in CarbonData:
Apache CarbonData is an open source project of The Apache Software Foundation (ASF).