With Heron, you have the option to use HDFS as stable storage for user submitted topology jars. Since HDFS replicates the data, it provides a scalable mechanism for distributing the user topology jars. This is desirable when the job runs in a distributed cluster and requires several hundred containers to run.
There are a few things you should be aware of HDFS uploader:
You can make Heron use HDFS uploader by modifying the uploader.yaml
config file specific for the Heron cluster. You'll need to specify the following for each cluster:
heron.class.uploader
--- Indicate the uploader class to be loaded. You should set this to org.apache.heron.uploader.hdfs.HdfsUploader
heron.uploader.hdfs.config.directory
--- Specifies the directory of the config files for hadoop. This is used by hadoop client to upload the topology jar
heron.uploader.hdfs.topologies.directory.uri
--- URI of the directory name for uploading topology jars. The name of the directory should be unique per cluster, if they are sharing the storage. In those cases, you could use the Heron environment variable ${CLUSTER}
that will be substituted by cluster name for distinction.
Below is an example configuration (in uploader.yaml
) for a HDFS uploader:
# uploader class for transferring the topology jar/tar files to storage heron.class.uploader: org.apache.heron.uploader.hdfs.HdfsUploader # Directory of config files for hadoop client to read from heron.uploader.hdfs.config.directory: /home/hadoop/hadoop # name of the directory to upload topologies for HDFS uploader heron.uploader.hdfs.topologies.directory.uri: hdfs://heron/topologies/${CLUSTER}