commit	d036820c0efa1b2e9b8021506164b67582352dff	[log] [tgz]
author	abhishekd0907 <43843989+abhishekd0907@users.noreply.github.com>	Sun Dec 29 02:10:26 2019 +0530
committer	Luciano Resende <lresende@apache.org>	Sat Dec 28 21:40:26 2019 +0100
tree	2a91e4e8b54720a8c6d15076b5bdbcedaee2fdc6
parent	1628c761f238ad064e21749bfe9ac37f0ee3503c [diff]

commit

d036820c0efa1b2e9b8021506164b67582352dff

[log] [tgz]

author

abhishekd0907 <43843989+abhishekd0907@users.noreply.github.com>

Sun Dec 29 02:10:26 2019 +0530

committer

Luciano Resende <lresende@apache.org>

Sat Dec 28 21:40:26 2019 +0100

tree

2a91e4e8b54720a8c6d15076b5bdbcedaee2fdc6

parent

1628c761f238ad064e21749bfe9ac37f0ee3503c [diff]

[BAHIR-213] Faster S3 file Source for Structured Streaming with SQS (#91) Using FileStreamSource to read files from a S3 bucket has problems both in terms of costs and latency: Latency: Listing all the files in S3 buckets every micro-batch can be both slow and resource-intensive. Costs: Making List API requests to S3 every micro-batch can be costly. The solution is to use Amazon Simple Queue Service (SQS) which lets you find new files written to S3 bucket without the need to list all the files every micro-batch. S3 buckets can be configured to send a notification to an Amazon SQS Queue on Object Create / Object Delete events. For details see AWS documentation here Configuring S3 Event Notifications Spark can leverage this to find new files written to S3 bucket by reading notifications from SQS queue instead of listing files every micro-batch. This PR adds a new SQSSource which uses Amazon SQS queue to find new files every micro-batch. Usage val inputDf = spark .readStream .format("s3-sqs") .schema(schema) .option("fileFormat", "json") .option("sqsUrl", "https://QUEUE_URL") .option("region", "us-east-1") .load()

tree: 2a91e4e8b54720a8c6d15076b5bdbcedaee2fdc6

README.md

Apache Bahir

Apache Bahir provides extensions to distributed analytics platforms such as Apache Spark & Apache Flink.

http://bahir.apache.org/

Apache Bahir origins

The Initial Bahir source code (see issue BAHIR-1) containing the source for the Apache Spark streaming connectors for akka, mqtt, twitter, zeromq extracted from Apache Spark revision 8301fad (before the deletion of the streaming connectors akka, mqtt, twitter, zeromq).

Source code structure

Source code folder structure:

- streaming-akka
  - examples/src/main/...
  - src/main/...
- streaming-mqtt
  - examples
  - src
  - python
- ...

Building Bahir

Bahir is built using Apache Maven. To build Bahir and its example programs, run:

mvn -DskipTests clean install

Running tests

Testing first requires building Bahir. Once Bahir is built, tests can be run using:

mvn test

Example programs

Each extension currently available in Apache Bahir has an example application located under the “examples” folder.

Documentation

Currently, each submodule has its own README.md, with information on example usages and API.

Furthermore, to generate scaladocs for each module:

$ mvn package

Scaladocs is generated in, MODULE_NAME/target/site/scaladocs/index.html. __ Where MODULE_NAME is one of, sql-streaming-mqtt, streaming-akka, streaming-mqtt, streaming-zeromq, streaming-twitter. __

A note about Apache Spark integration

Currently, each module in Bahir is available through spark packages. Please follow linking sub section in module specific README.md for more details.