8bc2172e4cc6fc03795f506a656c3e91fa3c13ec - carbondata

commit	8bc2172e4cc6fc03795f506a656c3e91fa3c13ec	[log] [tgz]
author	kumarvishal09 <kumarvishal1802@gmail.com>	Mon Apr 29 17:33:29 2019 +0800
committer	Jacky Li <jacky.likun@qq.com>	Mon Jan 20 23:19:47 2020 +0800
tree	1dc606d21628d87cb2f445d086d273b27a24f532
parent	f8a157327c4769ad586b8efd026f0e1a943bcf1e [diff]

[CARBONDATA-3271] Integrating deep learning framework TensorFlow

Why is this PR needed?
Nowadays AI model training is getting more and more popular. Currently many AI framework uses raw data files or row format data files for model training, it could not provide projection, filtering, and fast scan capability like in columnar store. So, if CarbonData supports AI framework, it can speed up model training by increase IO throughput, and provide more flexible training set selection ability to AI developers

What changes were proposed in this PR?
Added a basic framework:
- Supports shuffle read, which reads the data in random order when feeding data to training model for each epoch.
- Supports data cache to improve reading speed for multiple epoch, including local-disk and memory-cache.
- Supports parallel reading using thread pool and process pool in python.
- Supports reading data in object storage
- Supports manifest format and CarbonData folder

Support tensorflow to use the framework:
Tensorflow integration: New python API in pycarbon to support TensorFlow to read data from CarbonData files for training model

Important files, please review in details:
reader.py
tensorflow.py
carbon.py
carbon_arrow_reader_worker.py
carbon_py_dict_reader_worker.py
carbon_reader.py
carbon_tf_utils.py
carbon_dataset_metadata.py

Does this PR introduce any user interface change?
Yes. (please explain the change and update document)

Main new interfaces:
```
def make_reader(dataset_url=None,
                workers_count=10,
                results_queue_size=100,
                num_epochs=1,
                obs_client=None,
                shuffle=True,
                schema_fields=None,
                is_batch=True,
                reader_pool_type='thread',
                data_format='carbon',
                cache_properties={'cache_type': None, 'cache_location': None, 'cache_size_limit': None,
                                  'cache_row_size_estimate': None, 'cache_extra_settings': None},
                **properties
                ):

def make_tensor(reader, shuffling_queue_capacity=0, min_after_dequeue=0):

def make_dataset(reader)
```

Is any new testcase added?
Yes

https://issues.apache.org/jira/browse/CARBONDATA-3254

Example:

1. Setup

cd /yourpath/carbondata/python/
PYTHONPATH=/yourpath/carbondata/python/
pip install . --user

2. Generating a Pycarbon Dataset from MNIST Data

User should do some config first:

config pyspark and add carbon assembly jar to pyspark/jars folder, which can be compiled from CarbonData project.

default Java sdk jar is in carbondata/store/sdk/target, user also can specific the jar location like by --carbon-sdk-path in generate_pycarbon_dataset.py.

set JAVA_HOME, PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in you system environment.

This creates both a train and test carbon datasets:

cd pycarbon/tests/mnist/dataset_with_unischema
python generate_pycarbon_mnist.py

if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like:

 python generate_pycarbon_mnist.py --carbon-sdk-path  /your_path/carbondata/store/sdk/target/carbondata-sdk.jar

3. Tensorflow training using the Carbon MNIST Dataset

This will invoke a training run using MNIST carbondata,
for 1 epochs, using a batch size of 100, and log every 10 intervals.

python tf_example_carbon_unified_api.py
if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like:

python  tf_example_carbon_unified_api.py --carbon-sdk-path  /your_path/carbondata/store/sdk/target/carbondata-sdk.jar

This closes #3479

.gitignore[diff]
python/README.md[Added - diff]
python/pycarbon/__init__.py[diff]
python/pycarbon/core/Constants.py[Added - diff]
python/pycarbon/core/__init__.py[Added - diff]
python/pycarbon/core/carbon.py[Added - diff]
python/pycarbon/core/carbon_arrow_reader_worker.py[Added - diff]
python/pycarbon/core/carbon_dataset_metadata.py[Added - diff]
python/pycarbon/core/carbon_fs_utils.py[Added - diff]
python/pycarbon/core/carbon_local_memory_cache.py[Added - diff]
python/pycarbon/core/carbon_py_dict_reader_worker.py[Added - diff]
python/pycarbon/core/carbon_reader.py[Added - diff]
python/pycarbon/core/carbon_tf_utils.py[Added - diff]
python/pycarbon/core/carbon_utils.py[Added - diff]
python/pycarbon/integration/__init__.py[Added - diff]
python/pycarbon/integration/tensorflow.py[Added - diff]
python/pycarbon/reader.py[Added - diff]
python/pycarbon/test/.coveragerc[Deleted - diff]
python/pycarbon/tests/.coveragerc[Added - diff]
python/pycarbon/tests/__init__.py[Renamed from python/pycarbon/test/__init__.py - diff]
python/pycarbon/tests/conftest.py[Added - diff]
python/pycarbon/tests/core/__init__.py[Added - diff]
python/pycarbon/tests/core/test_carbon.py[Added - diff]
python/pycarbon/tests/core/test_carbon_common.py[Added - diff]
python/pycarbon/tests/core/test_carbon_end_to_end.py[Added - diff]
python/pycarbon/tests/core/test_carbon_fs_utils.py[Added - diff]
python/pycarbon/tests/core/test_carbon_memory_cache.py[Added - diff]
python/pycarbon/tests/core/test_carbon_predicates.py[Added - diff]
python/pycarbon/tests/core/test_carbon_reader.py[Added - diff]
python/pycarbon/tests/core/test_carbon_tf_dataset.py[Added - diff]
python/pycarbon/tests/core/test_carbon_tf_utils.py[Added - diff]
python/pycarbon/tests/core/test_reader.py[Added - diff]
python/pycarbon/tests/hello_world/README.md[Added - diff]
python/pycarbon/tests/hello_world/__init__.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_normal_schema/__init__.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_normal_schema/generate_dataset_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_normal_schema/python_hello_world_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_normal_schema/tensorflow_hello_world_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_normal_schema/tests/test_generate_dataset_carbon_with_normal_schema.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/__init__.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/generate_pycarbon_dataset.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/pyspark_hello_world_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/python_hello_world_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/tensorflow_hello_world_carbon.py[Added - diff]
python/pycarbon/tests/hello_world/dataset_with_unischema/tests/test_generate_dataset.py[Added - diff]
python/pycarbon/tests/im/__init__.py[Added - diff]
python/pycarbon/tests/im/test.py[Added - diff]
python/pycarbon/tests/mnist/README.md[Added - diff]
python/pycarbon/tests/mnist/__init__.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_normal_schema/README.md[Added - diff]
python/pycarbon/tests/mnist/dataset_with_normal_schema/__init__.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_normal_schema/generate_mnist_carbon.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_normal_schema/tf_carbon.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_normal_schema/tf_external_example_carbon_unified_api.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/README.md[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/__init__.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/generate_pycarbon_mnist.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/schema.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/tests/__init__.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/tests/conftest.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/tests/test_tf_mnist_carbon.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/tf_example_carbon.py[Added - diff]
python/pycarbon/tests/mnist/dataset_with_unischema/tf_example_carbon_unified_api.py[Added - diff]
python/pycarbon/tests/resources/carbondatalogo.jpg[Renamed from python/pycarbon/test/resources/carbondatalogo.jpg - diff]
python/pycarbon/tests/resources/carbondatalogo2.jpg[Renamed from python/pycarbon/test/resources/carbondatalogo2.jpg - diff]
python/pycarbon/tests/resources/flowers/10686568196_b1915544a8.jpg[Renamed from python/pycarbon/test/resources/flowers/10686568196_b1915544a8.jpg - diff]
python/pycarbon/tests/resources/flowers/10686568196_b1915544a8.txt[Renamed from python/pycarbon/test/resources/flowers/10686568196_b1915544a8.txt - diff]
python/pycarbon/tests/resources/flowers/10712722853_5632165b04.jpg[Renamed from python/pycarbon/test/resources/flowers/10712722853_5632165b04.jpg - diff]
python/pycarbon/tests/resources/flowers/10712722853_5632165b04.txt[Renamed from python/pycarbon/test/resources/flowers/10712722853_5632165b04.txt - diff]
python/pycarbon/tests/resources/flowers/subfolder/10841136265_af473efc60.jpg[Renamed from python/pycarbon/test/resources/flowers/subfolder/10841136265_af473efc60.jpg - diff]
python/pycarbon/tests/resources/flowers/subfolder/10841136265_af473efc60.txt[Renamed from python/pycarbon/test/resources/flowers/subfolder/10841136265_af473efc60.txt - diff]
python/pycarbon/tests/resources/voc/2007_000027.jpg[Renamed from python/pycarbon/test/resources/voc/2007_000027.jpg - diff]
python/pycarbon/tests/resources/voc/2007_000027.xml[Renamed from python/pycarbon/test/resources/voc/2007_000027.xml - diff]
python/pycarbon/tests/resources/voc/2007_000032.jpg[Renamed from python/pycarbon/test/resources/voc/2007_000032.jpg - diff]
python/pycarbon/tests/resources/voc/2007_000032.xml[Renamed from python/pycarbon/test/resources/voc/2007_000032.xml - diff]
python/pycarbon/tests/resources/voc/2007_000033.jpg[Renamed from python/pycarbon/test/resources/voc/2007_000033.jpg - diff]
python/pycarbon/tests/resources/voc/2007_000033.xml[Renamed from python/pycarbon/test/resources/voc/2007_000033.xml - diff]
python/pycarbon/tests/resources/voc/2007_000039.jpg[Renamed from python/pycarbon/test/resources/voc/2007_000039.jpg - diff]
python/pycarbon/tests/resources/voc/2007_000039.xml[Renamed from python/pycarbon/test/resources/voc/2007_000039.xml - diff]
python/pycarbon/tests/resources/voc/2009_001444.jpg[Renamed from python/pycarbon/test/resources/voc/2009_001444.jpg - diff]
python/pycarbon/tests/resources/voc/2009_001444.xml[Renamed from python/pycarbon/test/resources/voc/2009_001444.xml - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000032.jpg[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000032.jpg - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000032.png[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000032.png - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000033.jpg[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000033.jpg - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000033.png[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000033.png - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000042.jpg[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000042.jpg - diff]
python/pycarbon/tests/resources/vocForSegmentationClass/2007_000042.png[Renamed from python/pycarbon/test/resources/vocForSegmentationClass/2007_000042.png - diff]
python/pycarbon/tests/sdk/__init__.py[Added - diff]
python/pycarbon/tests/sdk/test_read_write_carbon.py[Renamed from python/pycarbon/test/test_read_write_carbon.py - diff]
python/pycarbon/tests/test.py[Added - diff]
python/setup.cfg[Added - diff]
python/setup.py[Added - diff]

92 files changed

tree: 1dc606d21628d87cb2f445d086d273b27a24f532

README.md

Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.

You can find the latest CarbonData document and learn more at: http://carbondata.apache.org

CarbonData cwiki

Visit count:

Status

Spark2.2:

Features

CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:

Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is “late materialized”.
Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

Building CarbonData

CarbonData is built using Apache Maven, to build CarbonData

Online Documentation

Integration

Other Technical Material

Fork and Contribute

This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

Contact us

To get involved in CarbonData:

First join by emailing to dev-subscribe@carbondata.apache.org,then you can discuss issues by emailing to dev@carbondata.apache.org or visit http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com
Report issues on Apache Jira.

About

Apache CarbonData is an open source project of The Apache Software Foundation (ASF).