Don't want to manually create spouts and bolts? Try the Heron Streamlet API.. If you find manually creating and connecting spouts and bolts to be overly cumbersome, we recommend trying out the Heron Streamlet API for Java, which enables you to create your topology logic using a highly streamlined logic inspired by functional programming concepts.
A Heron topology is a directed acyclic graph (DAG) used to process streams of data. Topologies can be stateless or stateful depending on your use case.
Heron topologies consist of two basic components:
Spouts and bolts are connected to one another via streams of data. Below is a visual illustration of a simple Heron topology:
In the diagram above, spout S1 feeds data to bolts B1 and B2 for processing; in turn, bolt B1 feeds processed data to bolts B3 and B4, while bolt B2 feeds processed data to bolt B4. This is just a simple example; you can create arbitrarily complex topologies in Heron.
There are currently two APIs available that you can use to build Heron topologies:
Once you‘ve set up a Heron cluster, you can use Heron’s CLI tool to manage the entire lifecycle of a topology, which typically goes through the following stages:
A topology‘s logical plan is analagous to a database query plan in that it maps out the basic operations associated with a topology. Here’s an example logical plan for the example Streamlet API topology below:
Whether you use the Heron Streamlet API or the topology API, Heron automatically transforms the processing logic that you create into both a logical plan and a physical plan.
A topology‘s physical plan is related to its logical plan but with the crucial difference that a physical plan determines the “physical” execution logic of a topology, i.e. how topology processes are divided between containers. Here’s a basic visual representation of a physical plan:
In this example, a Heron topology consists of two spout and five different bolts (each of which has multiple instances) that have automatically been distributed between five different containers.
Windowed computations gather results from a topology or topology component within a specified finite time frame rather than, say, on a per-tuple basis.
Here are some examples of window operations:
Sliding windows are windows that overlap, as in this figure:
For sliding windows, you need to specify two things:
In the figure above, the duration of the window is 20 seconds, while the sliding interval is 10 seconds. Each new window begins five seconds into the current window.
With sliding time windows, data can be processed in more than one window. Tuples 3, 4, and 5 above are processed in both window 1 and window 2 while tuples 6, 7, and 8 are processed in both window 2 and window 3.
Setting the duration of a window to 16 seconds and the sliding interval to 12 seconds would produce this window arrangement:
Here, the sliding interval determines that a new window is always created 12 seconds into the current window.
Tumbling windows are windows that don't overlap, as in this figure:
Tumbling windows don‘t overlap because a new window doesn’t begin until the current window has elapsed. For tumbling windows, you only need to specify the length or duration of the window but no sliding interval.
With tumbling windows, data are never processed in more than one window because the windows never overlap. Also, in the figure above, the duration of the window is 20 seconds.
Count windows are specified on the basis of the number of operations rather than a time interval. A count window of 100 would mean that a window would elapse after 100 tuples have been processed, with no relation to clock time.
With count windows, this scenario (for a count window of 50) would be completely normal:
Window | Tuples processed | Clock time |
---|---|---|
1 | 50 | 10 seconds |
2 | 50 | 12 seconds |
3 | 50 | 1 hour, 12 minutes |
4 | 50 | 5 seconds |
Time windows differ from count windows because you need to specify a time duration (in seconds) rather than a number of tuples processed.
With time windows, this scenario (for a time window of 30 seconds) would be completely normal:
Window | Tuples processed | Clock time |
---|---|---|
1 | 150 | 30 seconds |
2 | 50 | 30 seconds |
3 | 0 | 30 seconds |
4 | 375 | 30 seconds |
As explained above, windows differ along two axes: sliding (overlapping) vs. tumbling (non overlapping) and count vs. time. This produces four total types:
When creating topologies using the Streamlet API, there are three types of resources that you can specify:
For each topology, there are defaults for each resource type:
Resource | Default | Minimum |
---|---|---|
Number of containers | 1 | 1 |
CPU | 1.0 | 1.0 |
RAM | 512 MB | 192MB |
For instructions on allocating resources to topologies, see the language-specific documentation for:
Heron's original topology API required using a fundamentally tuple-driven data model. You can find more information in Heron's Data Model.