blob: f2595208d4d761ee46382da4faa862dcbff39e30 [file] [log] [blame] [view]
---
layout: case-study
hide_title: true # so we have control in case-study layout, but can still use page
title: On using Samza at Optimizely to compute analytics over session windows.
study_domain: optimizely.com
priority: 2
menu_title: Optimizely
exclude_from_loop: false
excerpt_separator: <!--more-->
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
On using Samza at Optimizely to compute analytics over session windows
<!--more-->
Optimizely is the worlds leading experimentation platform, enabling businesses to
deliver continuous experimentation and personalization across websites, mobile
apps and connected devices. At Optimizely, billions of events are tracked on a
daily basis and session metrics are provided to their users in real-time.
Prior to introducing Samza for their realtime computation, the
engineering team at Optimizely built their data-pipeline using a complex
[Lambda architecture](http://lambda-architecture.net/) using
[Druid and Hbase](https://medium.com/engineers-optimizely/building-a-scalable-data-pipeline-bfe3f531eb38).
Since some session metrics were computed using Map-Reduce jobs, they
could be delayed up to hours after the events are received. As business requirements evolved,
this solution became more and [more challenging](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3) to scale.
The engineering team at Optimizely turned to stream processing to reduce latencies.
In their solution, each up-stream client associates a _sessionId_ with the events it generates. Upon receiving each event, the Samza job extracts various
fields (e.g. ip address, location information, browser version, etc) and updates aggregated metrics
for the session. At the end of a time-window, the merged metrics for that session are ingested to HBase.
With the new solution <br/>
- The median query latency was reduced from 40+ ms to 5 ms <br/>
- Session metrics are now available in real-time <br/>
- Write-rate to Hbase is reduced, since the metrics are pre-aggregated by Samza<br/>
- Storage requirements on Hbase are drastically reduced <br/>
- Lower development effort thanks to out-of-the-box Kafka integration <br/>
Here is a testimonial from Optimizely
At Optimizely, we have built the worlds leading experimentation platform,
which ingests billions of click-stream events a day from millions of visitors
for analysis. Apache Samza has been a great asset to Optimizely's Event
ingestion pipeline allowing us to perform large scale, real time stream
computing such as aggregations (e.g. session computations) and data enrichment
on a multiple billion events / day scale. The programming model, durability
and the close integration with Apache Kafka fit our needs perfectly” says
Vignesh Sukumar, Senior Engineering Manager at Optimizely.
In addition to this case-study, Apache Samza is also leveraged for other usecases such as
data-enrichment, re-partitioning of event streams and computing realtime metrics etc.
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*
More information
- [From batching to streaming at Optimizely - Part 1](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3)
- [From batching to streaming at Optimizely - Part 2](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820)