GORA-664 Add datastore for Elasticsearch (#234)

* Create basic gora-elasticsearch module

* Bump Elasticsearch version and remove redundant dependency

* Implement connection and basic schema management

- Create ElasticsearchStore class with connection initialization
- Create basic Elasticsearch types mapping
- Implement the necessary files for mapping representation (ElasticsearchMapping, ElasticsearchMappingBuilder)
- Read schema from mapping file
- Cover initialization with test

* Set up Elasticsearch client parameters

- Created gora.properties file with configuration properties
- Loaded connection parameters from configuration
- Implemented connection to Elasticsearch cluster with ElasticsearchParameters
- Covered ElasticsearchParameters with tests
- Added javadoc descriptions

* Add a property for choosing the authentication method

* Implement testing with Elasticsearch container

- Added testing dependencies
- Added GoraElasticsearchTestDriver with Elasticsearch container
- Added javadoc descriptions to GoraElasticsearchTestDriver class
- Fixed two existing tests in accordance to Elasticsearch container

* Implement some methods for schema management

Implemented schemaExists, createSchema, deleteSchema and flush methods

* Add XSD validation file for the XML mapping

* Fix XSD validation

- Relocated gora-elasticsearch.xsd file to main resources
- Covered XSD validation with test
- Added gora-elasticsearch-mapping-invalid.xml file for test

* Set up Elasticsearch container's authentication parameters

* Implement exists method

* Add comments for the connection parameters

* Fix authentication

- Set up password to Elasticsearch container properly
- Set default Elasticsearch container server’s username in gora.properties
- Added exceptions for missing arguments in authentication

* Add parameter for the XSD validation

- Defined a parameter for the XSD validation
- Added a test case for the parameter
- Made ElasticsearchStore read mapping file from properties, not configuration

* Implement some basic Input-Output operations for schema management

- Implemented delete, get and put methods
- Implemented newInstance and getUnionSchema utility methods
- Implemented basic serialization/deserialization for primitive AVRO types

* Fix createSchema method

- Added mappings while creating an Elasticsearch index
- Added getter and setter to Datatype enum

* Implement serialization/deserialization for some Avro data types

- Implemented serializeFieldValue and deserializeFieldValue methods for ARRAY, BOOLEAN, BYTES and FIXED Avro data types
- Fixed deserialization for STRING Avro data type
- Added javadoc descriptions

* Fix NPE when getting a non-existent Elasticsearch document

* Implement serialization/deserialization for MAP Avro data type

* Refactor serialization/deserialization to have better javadocs and arguments

* Implement serialization/deserialization for RECORD Avro data type

* Implement serialization/deserialization for UNION Avro data type

* Fix passed Schema argument for ARRAY deserialization

* Fix BYTES deserialization for Base64 encoded String

* Ignore testGet3UnionField test

* Add javadoc descriptions to serialization and deserialization methods

* Implement newQuery method

* Implement deleteByQuery method

* Use an Enum instead of literal strings for the Authentication Type parameter

* Use parameterized logging instead of string concatenation

* Implement execute method

* Implement getPartitions method

* Add scaling_factor support

* Remove unsupported Elasticsearch data types

* Implement Metadata Analyzer for Elasticsearch Store

* Try to fix range query by “_id” field

* Fix execute method by adding a special "gora_id" field

* Implement deleting specific fields of the records in deleteByQuery method

* Implement MapReduce test

* Fix flush method by using refresh

* Address reviewer's comments

* Add Elasticsearch specific logging dependency
28 files changed
tree: d94c1636a9a44f52cf40ef137d6a5ec9d2c29462
  1. .github/
  2. bin/
  3. conf/
  4. gora-accumulo/
  5. gora-aerospike/
  6. gora-benchmark/
  7. gora-cassandra/
  8. gora-compiler/
  9. gora-compiler-cli/
  10. gora-core/
  11. gora-couchdb/
  12. gora-dynamodb/
  13. gora-elasticsearch/
  14. gora-goraci/
  15. gora-gradle-plugin/
  16. gora-hbase/
  17. gora-hive/
  18. gora-ignite/
  19. gora-infinispan/
  20. gora-jcache/
  21. gora-jet/
  22. gora-kudu/
  23. gora-lucene/
  24. gora-maven-plugin/
  25. gora-mongodb/
  26. gora-orientdb/
  27. gora-pig/
  28. gora-redis/
  29. gora-rethinkdb/
  30. gora-solr/
  31. gora-sql/
  32. gora-tutorial/
  33. sources-dist/
  34. .asf.yaml
  35. .gitignore
  36. CHANGES.md
  37. Jenkinsfile
  38. KEYS
  39. LICENSE.md
  40. NOTICE.md
  41. pom.xml
  42. README.md
README.md

Apache Gora Project

build Jenkins license Maven Central Twitter URL

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce, Apache Spark, Apache Flink and Apache Pig support.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system or Hadoop HDFS.

  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.

  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.

  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading

  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

Background

ORM stands for Object Relation Mapping. It is a technology which abstacts the persistency layer (mostly Relational Databases) so that plain domain level objects can be used, without the cumbersome effort to save/load the data to and from the database. Gora differs from current solutions in that:

  • Gora is specially focussed at NoSQL data stores, but also has limited support for SQL databases.

  • The main use case for Gora is to access/analyze big data using Hadoop.

  • Gora uses Avro for bean definition, not byte code enhancement or annotations.

  • Object-to-data store mappings are backend specific, so that full data model can be utilized.

  • Gora is simple since it ignores complex SQL mappings.

  • Gora will support persistence, indexing and anaysis of data, using Pig, Lucene, Hive, etc.

For the latest information about Gora, please visit our website at:

http://gora.apache.org

License

Gora is provided under Apache License version 2.0. See LICENSE.txt for more details.