[GRIFFIN-326] New Data Connector for Elasticsearch

**What changes were proposed in this pull request?**

This ticket proposes the following changes,
- Deprecate the current implementation in favour of the direct implementation in the official [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20) library.
- This library is built on DataSource API built on spark 2.2.x+ and thus brings support for filter pushdowns, column pruning, unified read and write and additional optimizations.
- Many configuration options are available for ES connectivity, [check here](https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java).
- Any filters can be applied as expressions directly on the data frame and are pushed automatically to the source.

**Does this PR introduce any user-facing change?**
Yes. As mentioned above, the old connector has been deprecated and config structure for Elasticsearch data connector has changed now.

**How was this patch tested?**
Griffin test suite and additional unit test cases

Author: chitralverma <chitralverma@gmail.com>

Closes #569 from chitralverma/new-elastic-search-connector.
32 files changed
tree: a23a5d2f3aeb4a01595da282e4816ed26d0ea938
  1. .gitignore
  2. .scalafmt.conf
  3. .travis.yml
  4. CONTRIBUTING.md
  5. KEYS
  6. LICENSE
  7. NOTICE
  8. README.md
  9. griffin-doc/
  10. measure/
  11. merge_pr.py
  12. pom.xml
  13. scalastyle-config.xml
  14. service/
  15. ui/
README.md

Apache Griffin

Build Status License: Apache 2.0

The data quality (DQ) is a key criteria for many data consumers like IoT, machine learning etc., however, there is no standard agreement on how to determine “good” data. Apache Griffin is a model-driven data quality service platform where you can examine your data on-demand. It provides a standard process to define data quality measures, executions and reports, allowing those examinations across multiple data systems. When you don't trust your data, or concern that poorly controlled data can negatively impact critical decision, you can utilize Apache Griffin to ensure data quality.

Getting Started

Quick Start

You can try running Griffin in docker following the docker guide.

Environment for Dev

Follow Apache Griffin Development Environment Build Guide to set up development environment.
If you want to contribute codes to Griffin, please follow Apache Griffin Development Code Style Config Guide to keep consistent code style.

Deployment at Local

If you want to deploy Griffin in your local environment, please follow Apache Griffin Deployment Guide.

Community

For more information about Griffin, please visit our website at: griffin home page.

You can contact us via email:

You can also subscribe the latest information by sending a email to subscribe dev-list and subscribe user-list. You can also subscribe the latest information by sending a email to subscribe dev-list and user-list:

dev-subscribe@griffin.apache.org
users-subscribe@griffin.apache.org

You can access our issues on JIRA page

Contributing

See How to Contribute for details on how to contribute code, documentation, etc.

Here's the most direct way to contribute your work merged into Apache Griffin.

  • Fork the project from github
  • Clone down your fork
  • Implement your feature or bug fix and commit changes
  • Push the branch up to your fork
  • Send a pull request to Apache Griffin master branch

References