A Gobblin ingestion flow can be embedded into a java application using the EmbeddedGobblin
class.
The following code will run a Hello-World Gobblin job as an embedded application using a template. This will simply print “Hello World <i>!” to stdout a few times.
EmbeddedGobblin embeddedGobblin = new EmbeddedGobblin("TestJob") .setTemplate(ResourceBasedJobTemplate.forResourcePath("templates/hello-world.template")); JobExecutionResult result = embeddedGobblin.run();
Note: EmbeddedGobblin
starts and destroys an embedded Gobblin instance every time run()
is called. If an application needs to run a large number of Gobblin jobs, it should instantiate and manage its own Gobblin driver.
The code snippet above creates an EmbeddedGobblin
instance. This instance can run arbitrary Gobblin ingestion jobs, and allows the use of templates. However, the user needs to configure the job by using the exact key needed for each feature.
An alternative is to use a subclass of EmbeddedGobblin
which provides methods to more easily configure the job. For example, an easier way to run a Gobblin distcp job is to use EmbeddedGobblinDistcp
:
EmbeddedGobblinDistcp distcp = new EmbeddedGobblinDistcp(sourcePath, targetPath).delete(); distcp.run();
This subclass automatically knows which template to use, the required configurations for the job (which are included as constructor parameters), and also provides convenience methods for the most common configurations (in the case above, the method delete()
instructs the job to delete files that exist in the target but not the source).
The following is a non-extensive list of available subclasses of EmbeddedGobblin
:
EmbeddedGobblinDistcp
: distributed copy between Hadoop compatible file systems.EmbeddedWikipediaExample
: a getting started example that pulls page updated from Wikipedia.EmbeddedGobblin
allows any configuration that a standalone Gobblin job would allow. EmbeddedGobblin
itself provides a few convenience methods to alter the behavior of the Gobblin framework. Other methods allow users to set a job template to use or set job level configurations.
Method | Parameters | Description |
---|---|---|
mrMode | N/A | Gobblin should run on MR mode. |
setTemplate | Template object to use | Use a job template. |
useStateStore | State store directory | By default, embedded Gobblin is stateless and disables state store. This method enables the state store at the indicated location allowing using watermarks from previous jobs. |
distributeJar | Path to jar in local fs | Indicates that a specific jar is needed by Gobblin workers when running in distributed mode (e.g. MR mode). Gobblin will automatically add this jar to the classpath of the workers. |
setConfiguration | key - value pair | Sets a job level configuration. |
setJobTimeout | timeout and time unit, or ISO period | Sets the timeout for the Gobblin job. run() will throw a TimeoutException if the job is not done after this period. (Default: 10 days) |
setLaunchTimeout | timeout and time unit, or ISO period | Sets the timeout for launching Gobblin job. run() will throw a TimeoutException if the job has not started after this period. (Default: 10 seconds) |
setShutdownTimeout | timeout and time unit, or ISO period | Sets the timeout for shutting down embedded Gobblin after the job has finished. run() will throw a TimeoutException if the method has not returned within the timeout after the job finishes. Note that a TimeoutException may indicate that Gobblin could not release JVM resources, including threads. |
Additional to the above, subclasses of EmbeddedGobblin
might offer their own convenience methods.
After EmbeddedGobblin
has been configured it can be run with one of two methods:
run()
: blocking call. Returns a JobExecutionResult
after the job finishes and Gobblin shuts down.runAsync()
: asynchronous call. Returns a JobExecutionDriver
, which implements Future<JobExecutionResult>
.Developers can extend EmbeddedGobblin
to provide users with easier ways to launch a particular type of job. For an example see EmbeddedGobblinDistcp
.
Best practices:
EmbeddedGobblin
is based on a template. The template should be automatically loaded on construction and the constructor should call setTemplate(myTemplate)
.new MyEmbeddedGobblinExtension(params...).run()
and get a sensible job run.public EmbeddedGobblinDistcp simulate() { this.setConfiguration(CopySource.SIMULATE, Boolean.toString(true)); return this; }
EmbeddedGobblin#getCoreGobblinJars
for this list), then the constructor should call distributeJar(myJar)
for the additional jars.