| % THIS IS AN EXAMPLE DOCUMENT FOR VLDB 2012 |
| % based on ACM SIGPROC-SP.TEX VERSION 2.7 |
| % Modified by Gerald Weber <gerald@cs.auckland.ac.nz> |
| % Removed the requirement to include *bbl file in here. (AhmetSacan, Sep2012) |
| % Fixed the equation on page 3 to prevent line overflow. (AhmetSacan, Sep2012) |
| |
| \documentclass{vldb} |
| \usepackage{graphicx} |
| \usepackage{balance} % for \balance command ON LAST PAGE (only there!) |
| \usepackage{fontspec} |
| \usepackage{hyperref} |
| \graphicspath{{figures/}} |
| \usepackage{enumitem} |
| |
| \begin{document} |
| |
| % ****************** TITLE **************************************** |
| |
| \title{Druid: Open Source Real-time Analytics at Scale} |
| |
| % possible, but not really needed or used for PVLDB: |
| %\subtitle{[Extended Abstract] |
| %\titlenote{A full version of this paper is available as\textit{Author's Guide to Preparing ACM SIG Proceedings Using \LaTeX$2_\epsilon$\ and BibTeX} at \texttt{www.acm.org/eaddress.htm}}} |
| |
| % ****************** AUTHORS ************************************** |
| |
| % You need the command \numberofauthors to handle the 'placement |
| % and alignment' of the authors beneath the title. |
| % |
| % For aesthetic reasons, we recommend 'three authors at a time' |
| % i.e. three 'name/affiliation blocks' be placed beneath the title. |
| % |
| % NOTE: You are NOT restricted in how many 'rows' of |
| % "name/affiliations" may appear. We just ask that you restrict |
| % the number of 'columns' to three. |
| % |
| % Because of the available 'opening page real-estate' |
| % we ask you to refrain from putting more than six authors |
| % (two rows with three columns) beneath the article title. |
| % More than six makes the first-page appear very cluttered indeed. |
| % |
| % Use the \alignauthor commands to handle the names |
| % and affiliations for an 'aesthetic maximum' of six authors. |
| % Add names, affiliations, addresses for |
| % the seventh etc. author(s) as the argument for the |
| % \additionalauthors command. |
| % These 'additional authors' will be output/set for you |
| % without further effort on your part as the last section in |
| % the body of your article BEFORE References or any Appendices. |
| |
| \numberofauthors{6} % in this sample file, there are a *total* |
| % of EIGHT authors. SIX appear on the 'first-page' (for formatting |
| % reasons) and the remaining two appear in the \additionalauthors section. |
| |
| \author{ |
| % You can go ahead and credit any number of authors here, |
| % e.g. one 'row of three' or two rows (consisting of one row of three |
| % and a second row of one, two or three). |
| % |
| % The command \alignauthor (no curly braces needed) should |
| % precede each author name, affiliation/snail-mail address and |
| % e-mail address. Additionally, tag each line of |
| % affiliation/address with \affaddr, and tag the |
| % e-mail address with \email. |
| % |
| % 1st. author |
| \alignauthor |
| Fangjin Yang\\ |
| \affaddr{Metamarkets Group, Inc.}\\ |
| \email{fangjin@metamarkets.com} |
| % 2nd. author |
| \alignauthor |
| Eric Tschetter\\ |
| \email{echeddar@gmail.com} |
| % 3rd. author |
| \alignauthor |
| Xavier Léauté\\ |
| \affaddr{Metamarkets Group, Inc.}\\ |
| \email{xavier@metamarkets.com} |
| \and % use '\and' if you need 'another row' of author names |
| % 4th. author |
| \alignauthor |
| Nishant Bangarwa\\ |
| \affaddr{Metamarkets Group, Inc.}\\ |
| \email{nishant@metamarkets.com} |
| % 5th. author |
| \alignauthor |
| Nelson Ray\\ |
| \email{ncray86@gmail.com} |
| % 6th. author |
| \alignauthor |
| Gian Merlino\\ |
| \affaddr{Metamarkets Group, Inc.}\\ |
| \email{gian@metamarkets.com} |
| } |
| % There's nothing stopping you putting the seventh, eighth, etc. |
| % author on the opening page (as the 'third row') but we ask, |
| % for aesthetic reasons that you place these 'additional authors' |
| % in the \additional authors block, viz. |
| \additionalauthors{Additional authors: Deep Ganguli (Metamarkets Group, Inc., {\texttt{deep@metamarkets.com}}), Himadri Singh (Metamarkets Group, Inc., {\texttt{himadri@metamarkets.com}}), Igal Levy (Metamarkets Group, Inc., {\texttt{igal@metamarkets.com}})} |
| \date{14 March 2014} |
| % Just remember to make sure that the TOTAL number of authors |
| % is the number that will appear on the first page PLUS the |
| % number that will appear in the \additionalauthors section. |
| |
| |
| \maketitle |
| |
| \begin{abstract} |
| Druid is an open |
| source\footnote{\href{https://github.com/druid-io/druid}{https://github.com/druid-io/druid}} |
| data store built for exploratory analytics on large data sets. Druid supports |
| fast data aggregation, low latency data ingestion, and arbitrary data |
| exploration. The system combines a column-oriented storage layout, a |
| distributed, shared-nothing architecture, and an advanced indexing structure to |
| return queries on billions of rows in milliseconds. Druid is petabyte scale and |
| is deployed in production at several technology companies. |
| \end{abstract} |
| |
| \section{Introduction} |
| The recent proliferation of internet technology has created a surge |
| in machine-generated events. Individually, these events contain minimal useful |
| information and are of low value. Given the time and resources required to |
| extract meaning from large collections of events, many companies were willing |
| to discard this data instead. |
| |
| A few years ago, Google introduced MapReduce as their mechanism of leveraging |
| commodity hardware to index the internet and analyze logs. The Hadoop project |
| soon followed and was largely patterned after the insights that came out of the |
| original MapReduce paper. Hadoop has contributed much to helping companies |
| convert their low-value event streams into high-value aggregates for a variety |
| of applications such as business intelligence and A-B testing. |
| |
| As with a lot of great systems, Hadoop has opened our eyes to a new space of |
| problems. Specifically, Hadoop excels at storing and providing access to large |
| amounts of data, however, it does not make any performance guarantees around |
| how quickly that data can be accessed. Furthermore, although Hadoop is a |
| highly available system, performance degrades under heavy concurrent load. |
| Lastly, while Hadoop works well for storing data, it is not optimized for |
| ingesting data and making that data immediately readable. |
| |
| \subsection{The Need for Druid} |
| Druid was originally designed to solve problems around ingesting and exploring |
| large quantities of transactional events (log data). This form of timeseries |
| data (OLAP data) is commonly found in the business intelligence |
| space and the nature of the data tends to be very append heavy. Events typically |
| have three distinct components: a timestamp column indicating when the event |
| occurred, a set of dimension columns indicating various attributes about the |
| event, and a set of metric columns containing values (usually numeric) that can |
| be aggregated. Queries are typically issued for the sum of some set of metrics, |
| filtered by some set of dimensions, over some span of time. |
| |
| The Druid project first began out of necessity at Metamarkets to power a |
| business intelligence dashboard that allowed users to arbitrarily explore and |
| visualize event streams. Existing open source Relational Database Management |
| Systems, cluster computing frameworks, and NoSQL key/value stores were unable |
| to provide a low latency data ingestion and query platform for an interactive |
| dashboard. Queries needed to return fast enough to allow the data |
| visualizations in the dashboard to update interactively. |
| |
| In addition to the query latency needs, the system had to be multi-tenant and |
| highly available, as the dashboard is used in a highly concurrent environment. |
| Downtime is costly and many businesses cannot afford to wait if a system is |
| unavailable in the face of software upgrades or network failure. Finally, |
| Metamarkets also wanted to allow users and alerting systems to be able to make |
| business decisions in ``real-time". The time from when an event is created to |
| when that event is queryable determines how fast users and systems are able to |
| react to potentially catastrophic occurrences in their systems. |
| |
| The problems of data exploration, ingestion, and availability span multiple |
| industries. Since Druid was open sourced in October 2012, it has been deployed as a |
| video, network monitoring, operations monitoring, and online advertising |
| analytics platform at multiple companies\footnote{\href{http://druid.io/druid.html}{http://druid.io/druid.html}}. |
| |
| \begin{figure*} |
| \centering |
| \includegraphics[width = 4.5in]{cluster} |
| \caption{An overview of a Druid cluster and the flow of data through the cluster.} |
| \label{fig:cluster} |
| \end{figure*} |
| |
| \section{Architecture} |
| A Druid cluster consists of different types of nodes and each node type is |
| designed to perform a specific set of things. We believe this design separates |
| concerns and simplifies the complexity of the system. The different node types |
| operate fairly independently of each other and there is minimal interaction among |
| them. Hence, intra-cluster communication failures have minimal impact on data |
| availability. To solve complex data analysis problems, the different node |
| types come together to form a fully working system. The composition of and flow |
| of data in a Druid cluster are shown in Figure~\ref{fig:cluster}. All Druid |
| nodes announce their availability and the data they are serving over |
| Zookeeper\cite{hunt2010zookeeper}. |
| |
| \subsection{Real-time Nodes} |
| Real-time nodes encapsulate the functionality to ingest and query event |
| streams. Events indexed via these nodes are immediately available for querying. |
| These nodes are only concerned with events for some small time range. They |
| periodically hand off batches of immutable events to other nodes in the Druid |
| cluster that are specialized in dealing with batches of immutable events. |
| |
| Real-time nodes maintain an in-memory index buffer for all incoming events. |
| These indexes are incrementally populated as new events are ingested and the |
| indexes are also directly queryable. To avoid heap overflow problems, real-time |
| nodes persist their in-memory indexes to disk either periodically or after some |
| maximum row limit is reached. This persist process converts data stored in the |
| in-memory buffer to a column oriented storage format. Each persisted index is |
| immutable and real-time nodes load persisted indexes into off-heap memory such |
| that they can still be queried. On a periodic basis, each real-time node will |
| schedule a background task that searches for all locally persisted indexes. The |
| task merges these indexes together and builds an immutable block of data that |
| contains all the events that have ingested by a real-time node for some span of |
| time. We refer to this block of data as a ``segment". During the handoff stage, |
| a real-time node uploads this segment to permanent backup storage, typically |
| a distributed file system that Druid calls ``deep storage". |
| |
| \subsection{Historical Nodes} |
| Historical nodes encapsulate the functionality to load and serve the immutable |
| blocks of data (segments) created by real-time nodes. In many real-world |
| workflows, most of the data loaded in a Druid cluster is immutable and hence |
| historical nodes are typically the main workers of a Druid cluster. Historical |
| nodes follow a shared-nothing architecture and there is no single point of |
| contention among the nodes. The nodes have no knowledge of one another and are |
| operationally simple; they only know how to load, drop, and serve immutable |
| segments. |
| |
| \subsection{Broker Nodes} |
| Broker nodes act as query routers to historical and real-time nodes. Broker |
| nodes understand what segments are queryable and where those segments are |
| located. Broker nodes route incoming queries such that the queries hit the |
| right historical or real-time nodes. Broker nodes also merge partial results |
| from historical and real-time nodes before returning a final consolidated |
| result to the caller. |
| |
| \subsection{Coordinator Nodes} |
| Druid coordinator nodes are primarily in charge of data management and |
| distribution on historical nodes. The coordinator nodes tell historical nodes |
| to load new data, drop outdated data, replicate data, and move data to load |
| balance. Coordinator nodes undergo a |
| leader-election process that determines a single node that runs the coordinator |
| functionality. The remaining coordinator nodes act as redundant backups. |
| |
| A coordinator node runs periodically to determine the current state of the |
| cluster. It makes decisions by comparing the expected state of the cluster with |
| the actual state of the cluster at the time of the run. Coordinator nodes also |
| maintain a connection to a MySQL database that contains additional operational |
| parameters and configurations. One of the key pieces of information located in |
| the MySQL database is a table that contains a list of all segments that should |
| be served by historical nodes. This table can be updated by any service that |
| creates segments, such as real-time nodes. |
| |
| \subsection{Query Processing} |
| Data tables in Druid (called \emph{data sources}) are collections of |
| timestamped events partitioned into a set of segments, where each segment |
| is typically 5--10 million rows. Formally, we define a segment as a collection |
| of rows of data that span some period in time. Segments represent the |
| fundamental storage unit in Druid and replication and distribution are done at |
| a segment level. |
| |
| Druid segments are stored in a column orientation. Given that Druid is best |
| used for aggregating event streams (all data going into Druid must have a |
| timestamp), the advantages storing aggregate information as columns rather than |
| rows are well documented \cite{abadi2008column}. Column storage allows for more |
| efficient CPU usage as only what is needed is actually loaded and scanned. |
| |
| Druid has multiple column types to represent various data formats. Depending on |
| the column type, different compression methods are used to reduce the cost of |
| storing a column in memory and on disk. For example, if an entire column only |
| contains string values, storing the raw strings is unnecessarily costly. |
| String columns can be dictionary encoded instead. Dictionary encoding is a |
| common method to compress data in column stores. |
| |
| In many real world OLAP workflows, queries are issued for the aggregated |
| results of some set of metrics where some set of dimension specifications are |
| met. Consider Table~\ref{tab:sample_data}. An example query for this table may |
| ask: ``How much revenue was generated in the first hour of 2014-01-01 in the |
| city of San Francisco?". This query is filtering a sales data set based on a |
| Boolean expression of dimension values. In many real world data sets, dimension |
| columns contain strings and metric columns contain numbers. Druid creates |
| additional lookup indices for string columns such that only those rows that |
| pertain to a particular query filter are ever scanned. |
| |
| \begin{table} |
| \centering |
| \begin{tabular}{| l | l | l |} |
| \hline |
| \textbf{Timestamp} & \textbf{City} & \textbf{Revenue} \\ \hline |
| 2014-01-01T01:00:00Z & San Francisco & 25 \\ \hline |
| 2014-01-01T01:00:00Z & San Francisco & 42 \\ \hline |
| 2014-01-01T02:00:00Z & New York & 17 \\ \hline |
| 2014-01-01T02:00:00Z & New York & 170 \\ \hline |
| \end{tabular} |
| \caption{Sample sales data set.} |
| \label{tab:sample_data} |
| \end{table} |
| |
| For each unique city in |
| Table~\ref{tab:sample_data}, we can form some representation |
| indicating in which table rows a particular city is seen. We can |
| store this information in a binary array where the array indices |
| represent our rows. If a particular page is seen in a certain |
| row, that array index is marked as \texttt{1}. For example: |
| {\small\begin{verbatim} |
| San Francisco -> rows [0, 1] -> [1][1][0][0] |
| New York -> rows [2, 3] -> [0][0][1][1] |
| \end{verbatim}} |
| |
| \texttt{San Francisco} is seen in rows \texttt{0} and \texttt{1}. This mapping of column values |
| to row indices forms an inverted index \cite{tomasic1993performance}. To know which |
| rows contain {\ttfamily San Francisco} or {\ttfamily New York}, we can \texttt{OR} together |
| the two arrays. |
| {\small\begin{verbatim} |
| [0][1][0][1] OR [1][0][1][0] = [1][1][1][1] |
| \end{verbatim}} |
| |
| This approach of performing Boolean operations on large bitmap sets is commonly |
| used in search engines. Druid compresses each bitmap index using the Concise |
| algorithm \cite{colantonio2010concise}. All Boolean operations on top of these |
| Concise sets are done without decompressing the set. |
| |
| \subsection{Query Capabilities} |
| Druid supports many types of aggregations including double sums, long sums, |
| minimums, maximums, and complex aggregations such as cardinality estimation and |
| approximate quantile estimation. The results of aggregations can be combined |
| in mathematical expressions to form other aggregations. Druid supports |
| different query types ranging from simple aggregates for an interval time, |
| groupBys, and approximate top-K queries. |
| |
| \section{Performance} |
| Druid runs in production at several organizations, and to briefly demonstrate its |
| performance, we have chosen to share some real world numbers for the main production |
| cluster running at Metamarkets in early 2014. For comparison with other databases |
| we also include results from synthetic workloads on TPC-H data. |
| |
| \subsection{Query Performance} |
| Query latencies are shown in Figure~\ref{fig:query_latency} for a cluster |
| hosting approximately 10.5TB of data using 1302 processing threads and 672 |
| total cores (hyperthreaded). There are approximately 50 billion rows of data in |
| this cluster. |
| |
| \begin{figure} |
| \centering |
| \includegraphics[width = 2.3in]{avg_query_latency} |
| \caption{Query latencies of production data sources.} |
| \label{fig:query_latency} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \includegraphics[width = 2.3in]{tpch_100gb} |
| \caption{Druid \& MySQL benchmarks -- 100GB TPC-H data.} |
| \label{fig:tpch_100gb} |
| \end{figure} |
| |
| The average queries per minute during this time was approximately |
| 1000. The number of dimensions the various data sources vary from 25 to 78 |
| dimensions, and 8 to 35 metrics. Across all the various data sources, average |
| query latency is approximately 550 milliseconds, with 90\% of queries returning |
| in less than 1 second, 95\% in under 2 seconds, and 99\% of queries returning |
| in less than 10 seconds. |
| |
| Approximately 30\% of the queries are standard |
| aggregates involving different types of metrics and filters, 60\% of queries |
| are ordered group bys over one or more dimensions with aggregates, and 10\% of |
| queries are search queries and metadata retrieval queries. The number of |
| columns scanned in aggregate queries roughly follows an exponential |
| distribution. Queries involving a single column are very frequent, and queries |
| involving all columns are very rare. |
| |
| We also present Druid benchmarks on TPC-H data in Figure~\ref{fig:tpch_100g}. |
| Most TPC-H queries do not directly apply to Druid, so we selected queries more |
| typical of Druid's workload to demonstrate query performance. As a comparison, |
| we also provide the results of the same queries using MySQL using the MyISAM |
| engine (InnoDB was slower in our experiments). |
| |
| We benchmarked Druid's scan rate at 53,539,211 rows/second/core for |
| \texttt{select count(*)} equivalent query over a given time interval and |
| 36,246,530 rows/second/core for a \texttt{select sum(float)} type query. |
| |
| \subsection{Data Ingestion Performance} |
| To showcase Druid's data ingestion latency, we selected several production |
| datasources of varying dimensions, metrics, and event volumes. Druid's data |
| ingestion latency is heavily dependent on the complexity of the data set being |
| ingested. The data complexity is determined by the number of dimensions in each |
| event, the number of metrics in each event, and the types of aggregations we |
| want to perform on those metrics. |
| |
| \begin{figure} |
| \centering |
| \includegraphics[width = 2.3in]{ingestion_rate} |
| \caption{Combined cluster ingestion rates.} |
| \label{fig:ingestion_rate} |
| \end{figure} |
| |
| For the given datasources, the number of dimensions vary from 5 to 35, and the |
| number of metrics vary from 2 to 24. The peak ingestion latency we measured in |
| production was 22914.43 events/second/core on a datasource with 30 dimensions |
| and 19 metrics. |
| |
| The latency measurements we presented are sufficient to address the our stated |
| problems of interactivity. We would prefer the variability in the latencies to |
| be less, which can be achieved by adding additional |
| hardware, but we have not chosen to do so because of cost concerns. |
| |
| \section{Demonstration Details} |
| |
| We would like to do an end-to-end demonstratation of Druid, from setting up a |
| cluster, ingesting data, structuring a query, and obtaining results. We would |
| also like to showcase how to solve real-world data analysis problems with Druid |
| and demonstrate tools that can be built on top of it, including interactive |
| data visualizations, approximate algorithms, and machine-learning components. |
| We already use similar tools in production. |
| |
| \subsection{Setup} |
| |
| Users will be able to set up a local Druid cluster to better understand the |
| components and architecture of the system. Druid is designed to run on |
| commodity hardware and Druid nodes are simply java processes that need to be |
| started up. The local setup will allow users to ingest data from Twitter's |
| public API and query it. We will also provide users access to an AWS hosted |
| Druid cluster that contains several terabytes of Twitter data that we have been |
| collecting for over 2 years. There are over 3 billion tweets in this data set, |
| and new events are constantly being ingested. We will walk through a variety of |
| different queries to demonstrate Druid's arbitrary data-exploration |
| capabilities. |
| |
| Finally, we will teach users how to build a simple interactive dashboard on top |
| of Druid. The dashboard will use some of Druid's more powerful features such as |
| approximate algorithms for quickly determining the cardinality of sets, and |
| machine learning algorithms for scientific computing problems such as anomaly |
| detection. These use cases represent some of the more interesting problems we |
| use Druid for in production. |
| |
| \subsection{Goals} |
| |
| We will not only walk users through solving real-world problems with Druid and |
| different tools that have been built on top of Druid, but also answer |
| conference-specific questions such as what are the trending tweets and topics |
| at VLDB, what netizens are conversing about in the general area, and even |
| perform a sentiment analysis of VLDB. Our goal is to clearly explain why the |
| architecture of Druid makes it highly optimal for certain types of queries, and |
| the potential of the system as a real-time analytics platform. |
| |
| %\end{document} % This is where a 'short' article might terminate |
| |
| % ensure same length columns on last page (might need two sub-sequent latex runs) |
| \balance |
| |
| %ACKNOWLEDGMENTS are optional |
| \section{Acknowledgments} |
| Druid could not have been built without the help of many great people in the |
| community. We want to thank everyone that has contributed to the Druid |
| codebase for their invaluable support. |
| |
| % The following two commands are all you need in the |
| % initial runs of your .tex file to |
| % produce the bibliography for the citations in your paper. |
| \bibliographystyle{abbrv} |
| \bibliography{druid_demo} % vldb_sample.bib is the name of the Bibliography in this case |
| % You must have a proper ".bib" file |
| % and remember to run: |
| % latex bibtex latex latex |
| % to resolve all references |
| |
| \end{document} |