While Gobblin is not tied to any specific cloud provider, Amazon Web Services is a popular choice. This document will outline how Gobblin can publish data to S3. Specifically, it will provide a step by step guide to help setup Gobblin on Amazon EC2, run Gobblin on EC2, and publish data from EC2 to S3.
It is recommended to configure Gobblin to first write data to EBS, and then publish the data to S3. This is the recommended approach because there are a few caveats when working with with S3. See the Hadoop and S3 section for more details.
This document will also provide a step by step guide for launching and configuring an EC2 instance and creating a S3 bucket. However, it is by no means a source of truth guide to working with AWS, it will only provide high level steps. The best place to learn about how to use AWS is through the Amazon documentation.
A majority of Gobblin‘s code base uses Hadoop’s FileSystem object to read and write data. The
FileSystem object is an abstract class, and typical implementations either write to the local file system, or write to HDFS. There has been significant work to create an implementation of the
FileSystem object that reads and writes to S3. The best guide to read about the different S3
FileSystem implementations is here.
There are a few different S3
FileSystem implementations, the two of note are the
s3a and the
s3 file systems. The
s3a file system is relatively new and is only available in Hadoop 2.6.0 (see the original JIRA for more information). The
s3 filesystem has been around for a while.
s3a file system uploads files to a specified bucket. The data uploaded to S3 via this file system is interoperable with other S3 tools. However, there are a few caveats when working with this file system:
S3AFileSystem.rename(Path, Path)operation will actually copy data from the source
Pathto the destination
Path, and then delete the source
Path(see the source code for more information)
S3AFileSystem.create(...)data will be first written to a staging file on the local file system, and when the file is closed, the staging file will be uploaded to S3 (see the source code for more information)
Thus, when using the
s3a file system with Gobblin it is recommended that one configures Gobblin to first write its staging data to the local filesystem, and then to publish the data to S3. The reason this is the recommended approach is that each Gobblin
Task will write data to a staging file, and once the file has been completely written it publishes the file to a output directory (it does this by using a rename function). Finally, the
DataPublisher moves the files from the staging directory to its final directory (again done using a rename function). This requires two renames operations and would be very inefficient if a
Task wrote directly to S3.
Furthermore, writing directly to S3 requires creating a staging file on the local file system, and then creating a
PutObjectRequest to upload the data to S3. This is logically equivalent to just configuring Gobblin to write to a local file and then publishing it to S3.
s3 file system stores file as blocks, similar to how HDFS stores blocks. This makes renaming of files more efficient, but data written using this file system is not interoperable with other S3 tools. This limitation may make using this file system less desirable, so the majority of this document focuses on the
s3a file system. Although the majority of the walkthrough should apply for the
s3 file system also.
This section will provide a step by step guide to setting up an EC2 instance, a S3 bucket, installing Gobblin on EC2, and configuring Gobblin to publish data to S3.
This guide will use the free-tier provided by AWS to setup EC2 and S3.
In order to use EC2 and S3, one first needs to sign up for an AWS account. The easiest way to get started with AWS is to use their free tier.
Launch Instance to create a new EC2 instance. Before the instance actually starts to run, there area a few more configuration steps necessary:
ssh -i my-private-key-file.pem ec2-user@instance-name
instance-namecan be taken from the
Public DNSfield from the instance information
chmod 600 my-private-key-file.pemto fix this
~/.ssh/configfile instead of specifying the
After following the above steps, you should be able to freely SSH into the launched EC2 instance, and monitor / control the instance from the EC2 dashboard.
Before setting up Gobblin, you need to install Java first. Depending on the AMI instance you are running Java may or may not already be installed (you can check if Java is already installed by executing
sudo yum install java-1.8.0-openjdk*to install Open JDK 8
JAVA_HOMEenvironment variable in the
JAVA_HOMEcan be found by executing
readlink `which java`
Go to the S3 dashboard
cd gobblin ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.6.0 -x test
scp -i my-private-key-file.pem gobblin-dist-[project-version].tar.gz ec2-user@instance-name:
tar -xvf gobblin-dist-[project-version].tar.gz
curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar > gobblin-dist/lib/aws-java-sdk-1.7.4.jar curl http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar > gobblin-dist/lib/hadoop-aws-2.6.0.jar
Assuming we are running Gobblin in standalone mode, the following configuration options need to be modified in the file
data.publisher.fs.uriand set it to
fs.s3a.secret.keyto the appropriate values
Assuming we want Gobblin to run in standalone mode, follow the usual steps for standalone deployment.
sh bin/gobblin-standalone.sh start --workdir /home/ec2-user/gobblin-dist/work --logdir /home/ec2-user/gobblin-dist/logs --conf /home/ec2-user/gobblin-dist/config
If you are running on the Amazon free tier, you will probably get an error in the
nohup.out file saying there is insufficient memory for the JVM. To fix this add
--jvmflags "-Xms256m -Xmx512m" to the
Data should be written to S3 during the publishing phase of Gobblin. One can confirm data was successfully written to S3 by looking at the S3 dashboard.
It is possible to write to an S3 bucket outside of an EC2 instance. The setup steps are similar to walkthrough outlined above. For more information on writing to S3 outside of AWS, check out this article.