[CARBONDATA-3271] Integrating deep learning framework TensorFlow

Why is this PR needed?
Nowadays AI model training is getting more and more popular. Currently many AI framework uses raw data files or row format data files for model training, it could not provide projection, filtering, and fast scan capability like in columnar store. So, if CarbonData supports AI framework, it can speed up model training by increase IO throughput, and provide more flexible training set selection ability to AI developers

What changes were proposed in this PR?
Added a basic framework:
- Supports shuffle read, which reads the data in random order when feeding data to training model for each epoch.
- Supports data cache to improve reading speed for multiple epoch, including local-disk and memory-cache.
- Supports parallel reading using thread pool and process pool in python.
- Supports reading data in object storage
- Supports manifest format and CarbonData folder

Support tensorflow to use the framework:
Tensorflow integration: New python API in pycarbon to support TensorFlow to read data from CarbonData files for training model

Important files, please review in details:
reader.py
tensorflow.py
carbon.py
carbon_arrow_reader_worker.py
carbon_py_dict_reader_worker.py
carbon_reader.py
carbon_tf_utils.py
carbon_dataset_metadata.py

Does this PR introduce any user interface change?
Yes. (please explain the change and update document)

Main new interfaces:
```
def make_reader(dataset_url=None,
                workers_count=10,
                results_queue_size=100,
                num_epochs=1,
                obs_client=None,
                shuffle=True,
                schema_fields=None,
                is_batch=True,
                reader_pool_type='thread',
                data_format='carbon',
                cache_properties={'cache_type': None, 'cache_location': None, 'cache_size_limit': None,
                                  'cache_row_size_estimate': None, 'cache_extra_settings': None},
                **properties
                ):

def make_tensor(reader, shuffling_queue_capacity=0, min_after_dequeue=0):

def make_dataset(reader)
```

Is any new testcase added?
Yes

https://issues.apache.org/jira/browse/CARBONDATA-3254

Example:

1. Setup

cd /yourpath/carbondata/python/
PYTHONPATH=/yourpath/carbondata/python/
pip install . --user

2. Generating a Pycarbon Dataset from MNIST Data

User should do some config first:

config pyspark and add carbon assembly jar to pyspark/jars folder, which can be compiled from CarbonData project.

default Java sdk jar is in carbondata/store/sdk/target, user also can specific the jar location like by --carbon-sdk-path in generate_pycarbon_dataset.py.

set JAVA_HOME, PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in you system environment.

This creates both a train and test carbon datasets:

cd pycarbon/tests/mnist/dataset_with_unischema
python generate_pycarbon_mnist.py

if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like:

 python generate_pycarbon_mnist.py --carbon-sdk-path  /your_path/carbondata/store/sdk/target/carbondata-sdk.jar

3. Tensorflow training using the Carbon MNIST Dataset

This will invoke a training run using MNIST carbondata,
for 1 epochs, using a batch size of 100, and log every 10 intervals.

python tf_example_carbon_unified_api.py
if user didn't compile CarbonData, then they can specific CarbonData java SDK jar like:

python  tf_example_carbon_unified_api.py --carbon-sdk-path  /your_path/carbondata/store/sdk/target/carbondata-sdk.jar

This closes #3479
92 files changed
tree: 1dc606d21628d87cb2f445d086d273b27a24f532
  1. .github/
  2. assembly/
  3. bin/
  4. build/
  5. common/
  6. conf/
  7. core/
  8. datamap/
  9. dev/
  10. docs/
  11. examples/
  12. format/
  13. hadoop/
  14. integration/
  15. licenses-binary/
  16. processing/
  17. python/
  18. store/
  19. streaming/
  20. tools/
  21. .gitignore
  22. LICENSE
  23. NOTICE
  24. pom.xml
  25. README.md
README.md

Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.

You can find the latest CarbonData document and learn more at: http://carbondata.apache.org

CarbonData cwiki

Visit count: HitCount

Status

Spark2.2: Build Status Coverage Status

Features

CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:

  • Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
  • Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is “late materialized”.
  • Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

Building CarbonData

CarbonData is built using Apache Maven, to build CarbonData

Online Documentation

Integration

Other Technical Material

Fork and Contribute

This is an active open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

Contact us

To get involved in CarbonData:

About

Apache CarbonData is an open source project of The Apache Software Foundation (ASF).