[BAHIR-213] Faster S3 file Source for Structured Streaming with SQS (#91)

Using FileStreamSource to read files from a S3 bucket has problems 
both in terms of costs and latency:

Latency: Listing all the files in S3 buckets every micro-batch can be both
slow and resource-intensive.

Costs: Making List API requests to S3 every micro-batch can be costly.

The solution is to use Amazon Simple Queue Service (SQS) which lets 
you find new files written to S3 bucket without the need to list all the 
files every micro-batch.

S3 buckets can be configured to send a notification to an Amazon SQS Queue
on Object Create / Object Delete events. For details see AWS documentation
here Configuring S3 Event Notifications

Spark can leverage this to find new files written to S3 bucket by reading 
notifications from SQS queue instead of listing files every micro-batch.

This PR adds a new SQSSource which uses Amazon SQS queue to find 
new files every micro-batch.

Usage

val inputDf = spark .readStream
   .format("s3-sqs")
   .schema(schema)
   .option("fileFormat", "json")
   .option("sqsUrl", "https://QUEUE_URL")
   .option("region", "us-east-1")
   .load()
15 files changed
tree: 2a91e4e8b54720a8c6d15076b5bdbcedaee2fdc6
  1. bin/
  2. common/
  3. dev/
  4. distribution/
  5. sql-cloudant/
  6. sql-streaming-akka/
  7. sql-streaming-jdbc/
  8. sql-streaming-mqtt/
  9. sql-streaming-sqs/
  10. streaming-akka/
  11. streaming-mqtt/
  12. streaming-pubnub/
  13. streaming-pubsub/
  14. streaming-twitter/
  15. streaming-zeromq/
  16. .gitattributes
  17. .gitignore
  18. .travis.yml
  19. LICENSE
  20. NOTICE
  21. pom.xml
  22. README.md
  23. scalastyle-config.xml
README.md

Apache Bahir

Apache Bahir provides extensions to distributed analytics platforms such as Apache Spark & Apache Flink.

http://bahir.apache.org/

Apache Bahir origins

The Initial Bahir source code (see issue BAHIR-1) containing the source for the Apache Spark streaming connectors for akka, mqtt, twitter, zeromq extracted from Apache Spark revision 8301fad (before the deletion of the streaming connectors akka, mqtt, twitter, zeromq).

Source code structure

Source code folder structure:

- streaming-akka
  - examples/src/main/...
  - src/main/...
- streaming-mqtt
  - examples
  - src
  - python
- ...

Building Bahir

Bahir is built using Apache Maven. To build Bahir and its example programs, run:

mvn -DskipTests clean install

Running tests

Testing first requires building Bahir. Once Bahir is built, tests can be run using:

mvn test

Example programs

Each extension currently available in Apache Bahir has an example application located under the “examples” folder.

Documentation

Currently, each submodule has its own README.md, with information on example usages and API.

Furthermore, to generate scaladocs for each module:

$ mvn package

Scaladocs is generated in, MODULE_NAME/target/site/scaladocs/index.html. __ Where MODULE_NAME is one of, sql-streaming-mqtt, streaming-akka, streaming-mqtt, streaming-zeromq, streaming-twitter. __

A note about Apache Spark integration

Currently, each module in Bahir is available through spark packages. Please follow linking sub section in module specific README.md for more details.