Heron is a general-purpose stream processing engine designed for speedy performance, low latency, isolation, reliability, and ease of use for developers and administrators alike. Heron was open sourced by Twitter.
We recommend reading Heron's Design Goals and Heron Topologies in conjunction with this guide.
The sections below:
You can think of a Heron cluster as a mechanism for managing the lifecycle of stream-processing entities called topologies. Topologies can be written in Java or Python.
More information can be found in the Heron Topologies document.
Apache Storm is a stream processing system originally open sourced by Twitter in 2011. Heron, also developed at Twitter, was created to overcome many of the shortcomings that Storm exhibited when run in production at Twitter scale.
Shortcoming | Solution |
---|---|
Resource isolation | Heron uses process-based isolation both between topologies and between containers within topologies, which is more reliable and easier to monitor and debug than Storm's model, which involves shared communication threads in the same JVM |
Resource efficiency | Storm requires scheduler resources to be provisioned up front, which can lead to over-provisioning. Heron avoids this problem by using cluster resources on demand. |
Throughput | For a variety of architectural reasons, Heron has consistently been shown to provide much higher throughput and much lower latency than Storm |
Heron was built to be fully backwards compatible with Storm and thus to enable topology developers to use Heron to run topologies created using Storm's topology API.
Currently, Heron is compatible with topologies written using:
If you have existing topologies created using the Storm API, you can make them Heron compatible by following these simple instructions
Heron was initially developed at Twitter with a few main goals in mind:
For a more in-depth discussion of Heron and Storm, see the Twitter Heron: Stream Processing at Scale paper.
Heron thus enables you to achieve major gains along a variety of axes---throughput, latency, reliability---without needing to sacrifice engineering resources.
For a description of the core goals of Heron as well as the principles that have guided its development, see Heron Design Goals.
From an architectural standpoint, Heron was built as an interconnected set of modular components.
The following core components of Heron topologies are discussed in depth in the sections below:
The Topology Master (TM) manages a topology throughout its entire lifecycle, from the time it‘s submitted until it’s ultimately killed. When heron
deploys a topology it starts a single TM and multiple containers. The TM creates an ephemeral ZooKeeper node to ensure that there's only one TM for the topology and that the TM is easily discoverable by any process in the topology. The TM also constructs the physical plan for a topology which it relays to different components.
TMs have a variety of configurable parameters that you can adjust at each phase of a topology's lifecycle.
Each Heron topology consists of multiple containers, each of which houses multiple Heron Instances, a Stream Manager, and a Metrics Manager. Containers communicate with the topology's TM to ensure that the topology forms a fully connected graph.
For an illustration, see the figure in the Topology Master section above.
In Heron, all topology containerization is handled by the scheduler, be it Mesos, Kubernetes, YARN, or something else. Heron schedulers typically use cgroups to manage Heron topology processes.
The Stream Manager (SM) manages the routing of tuples between topology components. Each [Heron Instance]({{< ref “#heron-instance” >}}) in a topology connects to its local SM, while all of the SMs in a given topology connect to one another to form a network. Below is a visual illustration of a network of SMs:
In addition to being a routing engine for data streams, SMs are responsible for propagating back pressure within the topology when necessary. Below is an illustration of back pressure:
In the diagram above, assume that bolt B3 (in container A) receives all of its inputs from spout S1. B3 is running more slowly than other components. In response, the SM for container A will refuse input from the SMs in containers C and D, which will lead to the socket buffers in those containers filling up, which could lead to throughput collapse.
In a situation like this, Heron's back pressure mechanism will kick in. The SM in container A will send a message to all the other SMs, then all SMs will cut off inputs from local spouts and no new data will be accepted into the topology.
Once the lagging bolt (B3) begins functioning normally, the SM in container A will notify the other SMs and stream routing within the topology will return to normal.
SMs have a variety of configurable parameters that you can adjust at each phase of a topology's lifecycle.
A Heron Instance (HI) is a process that handles a single task of a spout or bolt, which allows for easy debugging and profiling.
Currently, Heron only supports Java, so all HIs are JVM processes, but this will change in the future.
HIs have a variety of configurable parameters that you can adjust at each phase of a topology's lifecycle.
Each topology runs a Metrics Manager (MM) that collects and exports metrics from all components in a [container]({{< ref “#container” >}}). It then routes those metrics to both the [Topology Master]({{< ref “#topology-master” >}}) and to external collectors, such as Scribe, Graphite, or analogous systems.
You can adapt Heron to support additional systems by implementing your own custom metrics sink.
All of the components listed in the sections above can be found in each topology. The components listed below are cluster-level components that function outside of particular topologies.
Heron has a CLI tool called heron
that is used to manage topologies. Documentation can be found in Managing Topologies.
The Heron API server handles all requests from the Heron CLI tool, uploads topology artifacts to the designated storage system, and interacts with the scheduler.
When running Heron locally, you won't need to deploy or configure the Heron API server.
The Heron Tracker (or just Tracker) is a centralized gateway for cluster-wide information about topologies, including which topologies are running, being launched, being killed, etc. It relies on the same ZooKeeper nodes as the topologies in the cluster and exposes that information through a JSON REST API. The Tracker can be run within your Heron cluster (on the same set of machines managed by your Heron scheduler) or outside of it.
Instructions on running the tracker including JSON API docs can be found in Heron Tracker.
Heron UI is a rich visual interface that you can use to interact with topologies. Through Heron UI you can see color-coded visual representations of the logical and physical plan of each topology in your cluster.
For more information, see the Heron UI document.
The diagram below illustrates what happens when you submit a Heron topology:
{{< diagram width=“80” url=“https://www.lucidchart.com/publicSegments/view/766a2ee5-7a07-4eff-9fde-dd79d6cc355e/image.png” >}}
Component | Description |
---|---|
Client | When a topology is submitted using the heron submit command of the Heron CLI tool, it first executes the main function of the topology and creates a .defn file containing the topology's logical plan. Then, it runs org.apache.heron.scheduler.SubmitterMain , which is responsible for uploading the topology artifact to the Heron API server. |
Heron API server | When the Heron API server has been notified that a topology is being submitted, it does two things. First, it uploads the topology artifacts (a JAR for Java or a PEX for Python, plus a few other files) to a storage service; Heron supports multiple uploaders for a variety of storage systems, such as Amazon S3, HDFS, and the local filesystem. |
Heron scheduler | When the Heron CLI (client) submits a topology to the Heron API server, the API server notifies the Heron scheduler and also provides the scheduler with the topology's logical plan, physical plan, and some other artifacts. The scheduler, be it Mesos, Aurora, the local filesystem, or something else, then deploys the topology using containers. |
Storage | When the topology is deployed to containers by the scheduler, the code running in those containers then downloads the remaining necessary topology artifacts (essentially the code that will run in those containers) from the storage system. |
Shared Services
When the main scheduler (org.apache.heron.scheduler.SchedulerMain
) is invoked by the launcher, it fetches the submitted topology artifact from the topology storage, initializes the State Manager, and prepares a physical plan that specifies how multiple instances should be packed into containers. Then, it starts the specified scheduler, such as org.apache.heron.scheduler.local.LocalScheduler
, which invokes the heron-executor
for each container.
Topologies
heron-executor
process is started for each container and is responsible for executing the Topology Master or Heron Instances (Bolt/Spout) that are assigned to the container. Note that the Topology Master is always executed on container 0. When heron-executor
executes normal Heron Instances (i.e. except for container 0), it first prepares the Stream Manager and the Metrics Manager before starting org.apache.heron.instance.HeronInstance
for each instance that is assigned to the container.
Heron Instance has two threads: the gateway thread and the slave thread. The gateway thread is mainly responsible for communicating with the Stream Manager and the Metrics Manager using StreamManagerClient
and MetricsManagerClient
respectively, as well as sending/receiving tuples to/from the slave thread. On the other hand, the slave thread runs either Spout or Bolt of the topology based on the physical plan.
When a new Heron Instance is started, its StreamManagerClient
establishes a connection and registers itself with the Stream Manager. After the successful registration, the gateway thread sends its physical plan to the slave thread, which then executes the assigned instance accordingly.
Heron is primarily written in Java, C++, and Python.
A detailed guide to the Heron codebase can be found here.