This page provides some background about stream processing, describes what Samza is, and why it was built.
Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream consumers read messages from these systems, and process them or take actions based on the message contents.
Suppose you have a website, and every time someone loads a page, you send a “user viewed page” event to a messaging system. You might then have consumers which do any of the following:
A messaging system lets you decouple all of this work from the actual web page serving.
A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.
Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.) What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it's too much for a single machine to handle?
Stream processing is a higher level of abstraction on top of messaging systems, and it's meant to address precisely this category of problems.
Samza is a stream processing framework with the following features:
The available open source stream processing systems are actually quite young, and no single system offers a complete solution. New problems in this area include: how a stream processor's state should be managed, whether or not a stream should be buffered remotely on disk, what to do when duplicate messages are received or messages are lost, and how to model underlying messaging systems.
Samza's main differentiators are:
For a more in-depth discussion on Samza, and how it relates to other stream processing systems, have a look at Samza's Comparisons documentation.