Every message that is read from or written to a stream or a persistent state store needs to eventually be serialized to bytes (which are sent over the network or written to disk). There are various places where that serialization and deserialization can happen:
You can use whatever makes sense for your job; Samza doesn‘t impose any particular data model or serialization scheme on you. However, the cleanest solution is usually to use Samza’s serde layer. The following configuration example shows how to use it.
{% highlight jproperties %}
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
task.inputs=kafka.PageViewEvent
serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
serializers.registry.integer.class=org.apache.samza.serializers.IntegerSerdeFactory
systems.kafka.streams.PageViewEvent.samza.key.serde=integer systems.kafka.streams.PageViewEvent.samza.msg.serde=json
stores.LastPageViewPerUser.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory stores.LastPageViewPerUser.changelog=kafka.last-page-view-per-user stores.LastPageViewPerUser.key.serde=integer stores.LastPageViewPerUser.msg.serde=json {% endhighlight %}
Each serde is defined with a factory class. Samza comes with several builtin serdes for UTF-8 strings, binary-encoded integers, JSON and more. The following is a comprehensive list of supported serdes in Samza.
You can also create your own serializer by implementing the SerdeFactory interface.
The name you give to a serde (such as “json” and “integer” in the example above) is only for convenience in your job configuration; you can choose whatever name you like. For each stream and each state store, you can use the serde name to declare how messages should be serialized and deserialized.
If you don't declare a serde, Samza simply passes objects through between your task instance and the system stream. In that case your task needs to send and receive whatever type of object the underlying client library uses.
All the Samza APIs for sending and receiving messages are typed as Object. This means that you have to cast messages to the correct type before you can use them. It's a little bit more code, but it has the advantage that Samza is not restricted to any particular data model.