commit | d036820c0efa1b2e9b8021506164b67582352dff | [log] [tgz] |
---|---|---|
author | abhishekd0907 <43843989+abhishekd0907@users.noreply.github.com> | Sun Dec 29 02:10:26 2019 +0530 |
committer | Luciano Resende <lresende@apache.org> | Sat Dec 28 21:40:26 2019 +0100 |
tree | 2a91e4e8b54720a8c6d15076b5bdbcedaee2fdc6 | |
parent | 1628c761f238ad064e21749bfe9ac37f0ee3503c [diff] |
[BAHIR-213] Faster S3 file Source for Structured Streaming with SQS (#91) Using FileStreamSource to read files from a S3 bucket has problems both in terms of costs and latency: Latency: Listing all the files in S3 buckets every micro-batch can be both slow and resource-intensive. Costs: Making List API requests to S3 every micro-batch can be costly. The solution is to use Amazon Simple Queue Service (SQS) which lets you find new files written to S3 bucket without the need to list all the files every micro-batch. S3 buckets can be configured to send a notification to an Amazon SQS Queue on Object Create / Object Delete events. For details see AWS documentation here Configuring S3 Event Notifications Spark can leverage this to find new files written to S3 bucket by reading notifications from SQS queue instead of listing files every micro-batch. This PR adds a new SQSSource which uses Amazon SQS queue to find new files every micro-batch. Usage val inputDf = spark .readStream .format("s3-sqs") .schema(schema) .option("fileFormat", "json") .option("sqsUrl", "https://QUEUE_URL") .option("region", "us-east-1") .load()
Apache Bahir provides extensions to distributed analytics platforms such as Apache Spark & Apache Flink.
The Initial Bahir source code (see issue BAHIR-1) containing the source for the Apache Spark streaming connectors for akka, mqtt, twitter, zeromq extracted from Apache Spark revision 8301fad (before the deletion of the streaming connectors akka, mqtt, twitter, zeromq).
Source code folder structure:
- streaming-akka - examples/src/main/... - src/main/... - streaming-mqtt - examples - src - python - ...
Bahir is built using Apache Maven. To build Bahir and its example programs, run:
mvn -DskipTests clean install
Testing first requires building Bahir. Once Bahir is built, tests can be run using:
mvn test
Each extension currently available in Apache Bahir has an example application located under the “examples” folder.
Currently, each submodule has its own README.md, with information on example usages and API.
Furthermore, to generate scaladocs for each module:
$ mvn package
Scaladocs is generated in, MODULE_NAME/target/site/scaladocs/index.html
. __ Where MODULE_NAME
is one of, sql-streaming-mqtt
, streaming-akka
, streaming-mqtt
, streaming-zeromq
, streaming-twitter
. __
Currently, each module in Bahir is available through spark packages. Please follow linking sub section in module specific README.md for more details.