docs/architecture.rst - pinot - Git at Google

 ..
 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at
 ..
 ..   http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.
 ..

 Architecture
 ============

 .. figure:: pinot-architecture.png

    Pinot Architecture Overview

 Terminology
 -----------

 *Table*
     A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (documents).
 *Segment*
     Data in table is divided into (horizontal) shards referred to as segments.

 Pinot Components
 ----------------

 *Pinot Controller*
     Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
 *Pinot Server*
     Hosts one or more segments and serves queries from those segments
 *Pinot Broker*
     Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the client.

 Pinot leverages `Apache Helix <http://helix.apache.org>`_ for cluster management.
 Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system.
 Helix uses Zookeeper to store cluster state and metadata.

 Briefly, Helix divides nodes into three logical components based on their responsibilities:

 *Participant*
     The nodes that host distributed, partitioned resources
 *Spectator*
     The nodes that observe the current state of each Participant and use that information to access the resources.
     Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
 *Controller*
     The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions
     in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability

 Pinot Controller hosts Helix Controller, in addition to hosting REST APIs for Pinot cluster administration and data ingestion.
 There can be multiple instances of Pinot controller for redundancy. If there are multiple controllers, Pinot expects that all
 of them are configured with the same back-end storage system so that they have a common view of the segments (*e.g.* NFS).
 Pinot can use other storage systems such as HDFS or `ADLS <https://azure.microsoft.com/en-us/services/storage/data-lake-storage/>`_.

 Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as *resources* in helix terminology).
 Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more helix partitions of one
 or more helix resources (*i.e.* one or more segments of one or more tables).

 Pinot Brokers are modeled as Spectators. They need to know the location of each segment of a table (and each replica of the
 segments)
 and route requests to the
 appropriate server that hosts the segments of the table being queried. The broker ensures that all the rows of the table
 are queried exactly once so as to return correct, consistent results for a query. The brokers (or servers) may optimize
 to prune some of the segments as long as accuracy is not satisfied. In case of hybrid tables, the brokers ensure that
 the overlap between realtime and offline segment data is queried exactly once.
 Helix provides the framework by which spectators can learn the location (*i.e.* participant) in which each partition
 of a resource resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

 Pinot Tables
 ------------

 Pinot supports realtime, or offline, or hybrid tables. Data in Pinot tables is contained in the segments
 belonging to that table. A Pinot table is modeled as a Helix resource.  Each segment of a table is modeled as a Helix Partition,

 Table Schema defines column names and their metadata. Table configuration and schema is stored in zookeeper.

 Offline tables ingest pre-built pinot-segments from external data stores, whereas Reatime tables
 ingest data from streams (such as Kafka) and build segments.

 A hybrid Pinot table essentially has both realtime as well as offline tables.
 In such a table, offline segments may be pushed periodically (say, once a day). The retention on the offline table
 can be set to a high value (say, a few years) since segments are coming in on a periodic basis, whereas the retention
 on the realtime part can be small (say, a few days). Once an offline segment is pushed to cover a recent time period,
 the brokers automatically switch to using the offline table for segments in _that_ time period, and use realtime table
 only to cover later segments for which offline data may not be available yet.

 Note that the query does not know the existence of offline or realtime tables. It only specifies the table name
 in the query.


 Ingesting Offline data
 ^^^^^^^^^^^^^^^^^^^^^^
 Segments for offline tables are constructed outside of Pinot, typically in Hadoop via map-reduce jobs
 and ingested into Pinot via REST API provided by the Controller.
 Pinot provides libraries to create Pinot segments out of input files in AVRO, JSON or CSV formats in a hadoop job, and push
 the constructed segments to the controlers via REST APIs.

 When an Offline segment is ingested, the controller looks up the table's configuration and assigns the segment
 to the servers that host the table. It may assign multiple servers for each servers depending on the number of replicas
 configured for that table.

 Pinot supports different segment assignment strategies that are optimized for various use cases.

 Once segments are assigned, Pinot servers get notified via Helix to "host" the segment. The servers download the segments
 (as a cached local copy to serve queries) and load them into local memory. All segment data is maintained in memory as long
 as the server hosts that segment.

 Once the server has loaded the segment, Helix notifies brokers of the availability of these segments. The brokers
 start include the new
 segments for queries. Brokers support different routing strategies depending on the type of table, the segment assignment
 strategy and the use case.

 Data in offline segments are immmutable (Rows cannot be added, deleted, or modified). However, segments may be replaced modified data.

 Ingesting Realtime Data
 ^^^^^^^^^^^^^^^^^^^^^^^
 Segments for realtime tables are constructed by Pinot servers. The servers ingest rows from realtime streams (such as
 Kafka) until
 some completion threshold (such as number of rows, or a time threshold) and build a segment out of those rows. Depending
 on the type of ingestion mechanism used (stream or partition level), segments may be locally stored in the servers
 or in the controller's segment store.

 Multiple servers may ingest the same data to increase availability and share query load.

 Once a realtime segment is built and loaded the servers continue
 to consume from where they left off.

 Realtime segments are immutable once they are completed. While realtime segments are being consumed they are mutable,
 in the sense that new rows can be added to them. Rows cannot be deleted from segments.


 See `Consuming and Indexing rows in Realtime <https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_ for details.


 Pinot Segments
 --------------

 A segment is laid out in a columnar format
 so that it can be directly mapped into memory for serving queries. Columns may be single or multi-valued. Column types may be
 STRING, INT, LONG, FLOAT, DOUBLE or BYTES. Columns may be declared to be metric or dimension (or specifically as a time dimension)
 in the schema.

 Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be "no-dictionary" column in which
 case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (_e.g._ a column with cardinality
 of 3 will use only 3 bits for each dictionary ID).

 There is a forward index for each column compressed appropriately for efficient memory use.  In addition, optional inverted indices can be
 configured for any set of columns. Inverted indices, while taking up more storage, offer better query performance.

 Specialized indexes like StartTree index is also supported.
	..
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at
	..
	.. http://www.apache.org/licenses/LICENSE-2.0
	..
	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.
	..

	Architecture
	============

	.. figure:: pinot-architecture.png

	Pinot Architecture Overview

	Terminology
	-----------

	Table
	A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (documents).
	Segment
	Data in table is divided into (horizontal) shards referred to as segments.

	Pinot Components
	----------------

	Pinot Controller
	Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
	Pinot Server
	Hosts one or more segments and serves queries from those segments
	Pinot Broker
	Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the client.

	Pinot leverages `Apache Helix <http://helix.apache.org>`_ for cluster management.
	Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system.
	Helix uses Zookeeper to store cluster state and metadata.

	Briefly, Helix divides nodes into three logical components based on their responsibilities:

	Participant
	The nodes that host distributed, partitioned resources
	Spectator
	The nodes that observe the current state of each Participant and use that information to access the resources.
	Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
	Controller
	The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions
	in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability

	Pinot Controller hosts Helix Controller, in addition to hosting REST APIs for Pinot cluster administration and data ingestion.
	There can be multiple instances of Pinot controller for redundancy. If there are multiple controllers, Pinot expects that all
	of them are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS).
	Pinot can use other storage systems such as HDFS or `ADLS <https://azure.microsoft.com/en-us/services/storage/data-lake-storage/>`_.

	Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as resources in helix terminology).
	Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more helix partitions of one
	or more helix resources (i.e. one or more segments of one or more tables).

	Pinot Brokers are modeled as Spectators. They need to know the location of each segment of a table (and each replica of the
	segments)
	and route requests to the
	appropriate server that hosts the segments of the table being queried. The broker ensures that all the rows of the table
	are queried exactly once so as to return correct, consistent results for a query. The brokers (or servers) may optimize
	to prune some of the segments as long as accuracy is not satisfied. In case of hybrid tables, the brokers ensure that
	the overlap between realtime and offline segment data is queried exactly once.
	Helix provides the framework by which spectators can learn the location (i.e. participant) in which each partition
	of a resource resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

	Pinot Tables
	------------

	Pinot supports realtime, or offline, or hybrid tables. Data in Pinot tables is contained in the segments
	belonging to that table. A Pinot table is modeled as a Helix resource. Each segment of a table is modeled as a Helix Partition,

	Table Schema defines column names and their metadata. Table configuration and schema is stored in zookeeper.

	Offline tables ingest pre-built pinot-segments from external data stores, whereas Reatime tables
	ingest data from streams (such as Kafka) and build segments.

	A hybrid Pinot table essentially has both realtime as well as offline tables.
	In such a table, offline segments may be pushed periodically (say, once a day). The retention on the offline table
	can be set to a high value (say, a few years) since segments are coming in on a periodic basis, whereas the retention
	on the realtime part can be small (say, a few days). Once an offline segment is pushed to cover a recent time period,
	the brokers automatically switch to using the offline table for segments in _that_ time period, and use realtime table
	only to cover later segments for which offline data may not be available yet.

	Note that the query does not know the existence of offline or realtime tables. It only specifies the table name
	in the query.


	Ingesting Offline data
	^^^^^^^^^^^^^^^^^^^^^^
	Segments for offline tables are constructed outside of Pinot, typically in Hadoop via map-reduce jobs
	and ingested into Pinot via REST API provided by the Controller.
	Pinot provides libraries to create Pinot segments out of input files in AVRO, JSON or CSV formats in a hadoop job, and push
	the constructed segments to the controlers via REST APIs.

	When an Offline segment is ingested, the controller looks up the table's configuration and assigns the segment
	to the servers that host the table. It may assign multiple servers for each servers depending on the number of replicas
	configured for that table.

	Pinot supports different segment assignment strategies that are optimized for various use cases.

	Once segments are assigned, Pinot servers get notified via Helix to "host" the segment. The servers download the segments
	(as a cached local copy to serve queries) and load them into local memory. All segment data is maintained in memory as long
	as the server hosts that segment.

	Once the server has loaded the segment, Helix notifies brokers of the availability of these segments. The brokers
	start include the new
	segments for queries. Brokers support different routing strategies depending on the type of table, the segment assignment
	strategy and the use case.

	Data in offline segments are immmutable (Rows cannot be added, deleted, or modified). However, segments may be replaced modified data.

	Ingesting Realtime Data
	^^^^^^^^^^^^^^^^^^^^^^^
	Segments for realtime tables are constructed by Pinot servers. The servers ingest rows from realtime streams (such as
	Kafka) until
	some completion threshold (such as number of rows, or a time threshold) and build a segment out of those rows. Depending
	on the type of ingestion mechanism used (stream or partition level), segments may be locally stored in the servers
	or in the controller's segment store.

	Multiple servers may ingest the same data to increase availability and share query load.

	Once a realtime segment is built and loaded the servers continue
	to consume from where they left off.

	Realtime segments are immutable once they are completed. While realtime segments are being consumed they are mutable,
	in the sense that new rows can be added to them. Rows cannot be deleted from segments.


	See `Consuming and Indexing rows in Realtime <https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_ for details.


	Pinot Segments
	--------------

	A segment is laid out in a columnar format
	so that it can be directly mapped into memory for serving queries. Columns may be single or multi-valued. Column types may be
	STRING, INT, LONG, FLOAT, DOUBLE or BYTES. Columns may be declared to be metric or dimension (or specifically as a time dimension)
	in the schema.

	Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be "no-dictionary" column in which
	case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (_e.g._ a column with cardinality
	of 3 will use only 3 bits for each dictionary ID).

	There is a forward index for each column compressed appropriately for efficient memory use. In addition, optional inverted indices can be
	configured for any set of columns. Inverted indices, while taking up more storage, offer better query performance.

	Specialized indexes like StartTree index is also supported.