blob: e836f7b832d3057fc19869a9bbc912221289a5ef [file] [log] [blame]
..
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
..
Architecture
============
.. figure:: pinot-architecture.png
Pinot Architecture Overview
Terminology
-----------
*Table*
A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (documents).
*Segment*
Data in table is divided into (horizontal) shards referred to as segments.
Pinot Components
----------------
*Pinot Controller*
Manages other pinot components (brokers, servers) as well as controls assignment of tables/segments to servers.
*Pinot Server*
Hosts one or more segments and serves queries from those segments
*Pinot Broker*
Accepts queries from clients and routes them to one or more servers, and returns consolidated response to the client.
Pinot leverages `Apache Helix <http://helix.apache.org>`_ for cluster management.
Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system.
Helix uses Zookeeper to store cluster state and metadata.
Briefly, Helix divides nodes into three logical components based on their responsibilities:
*Participant*
The nodes that host distributed, partitioned resources
*Spectator*
The nodes that observe the current state of each Participant and use that information to access the resources.
Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
*Controller*
The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions
in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability
Pinot Controller hosts Helix Controller, in addition to hosting REST APIs for Pinot cluster administration and data ingestion.
There can be multiple instances of Pinot controller for redundancy. If there are multiple controllers, Pinot expects that all
of them are configured with the same back-end storage system so that they have a common view of the segments (*e.g.* NFS).
Pinot can use other storage systems such as HDFS or `ADLS <https://azure.microsoft.com/en-us/services/storage/data-lake-storage/>`_.
Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as *resources* in helix terminology).
Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more helix partitions of one
or more helix resources (*i.e.* one or more segments of one or more tables).
Pinot Brokers are modeled as Spectators. They need to know the location of each segment of a table (and each replica of the
segments)
and route requests to the
appropriate server that hosts the segments of the table being queried. The broker ensures that all the rows of the table
are queried exactly once so as to return correct, consistent results for a query. The brokers (or servers) may optimize
to prune some of the segments as long as accuracy is not satisfied. In case of hybrid tables, the brokers ensure that
the overlap between realtime and offline segment data is queried exactly once.
Helix provides the framework by which spectators can learn the location (*i.e.* participant) in which each partition
of a resource resides. The brokers use this mechanism to learn the servers that host specific segments of a table.
Pinot Tables
------------
Pinot supports realtime, or offline, or hybrid tables. Data in Pinot tables is contained in the segments
belonging to that table. A Pinot table is modeled as a Helix resource. Each segment of a table is modeled as a Helix Partition,
Table Schema defines column names and their metadata. Table configuration and schema is stored in zookeeper.
Offline tables ingest pre-built pinot-segments from external data stores, whereas Reatime tables
ingest data from streams (such as Kafka) and build segments.
A hybrid Pinot table essentially has both realtime as well as offline tables.
In such a table, offline segments may be pushed periodically (say, once a day). The retention on the offline table
can be set to a high value (say, a few years) since segments are coming in on a periodic basis, whereas the retention
on the realtime part can be small (say, a few days). Once an offline segment is pushed to cover a recent time period,
the brokers automatically switch to using the offline table for segments in _that_ time period, and use realtime table
only to cover later segments for which offline data may not be available yet.
Note that the query does not know the existence of offline or realtime tables. It only specifies the table name
in the query.
Ingesting Offline data
^^^^^^^^^^^^^^^^^^^^^^
Segments for offline tables are constructed outside of Pinot, typically in Hadoop via map-reduce jobs
and ingested into Pinot via REST API provided by the Controller.
Pinot provides libraries to create Pinot segments out of input files in AVRO, JSON or CSV formats in a hadoop job, and push
the constructed segments to the controlers via REST APIs.
When an Offline segment is ingested, the controller looks up the table's configuration and assigns the segment
to the servers that host the table. It may assign multiple servers for each servers depending on the number of replicas
configured for that table.
Pinot supports different segment assignment strategies that are optimized for various use cases.
Once segments are assigned, Pinot servers get notified via Helix to "host" the segment. The servers download the segments
(as a cached local copy to serve queries) and load them into local memory. All segment data is maintained in memory as long
as the server hosts that segment.
Once the server has loaded the segment, Helix notifies brokers of the availability of these segments. The brokers
start include the new
segments for queries. Brokers support different routing strategies depending on the type of table, the segment assignment
strategy and the use case.
Data in offline segments are immmutable (Rows cannot be added, deleted, or modified). However, segments may be replaced modified data.
Ingesting Realtime Data
^^^^^^^^^^^^^^^^^^^^^^^
Segments for realtime tables are constructed by Pinot servers. The servers ingest rows from realtime streams (such as
Kafka) until
some completion threshold (such as number of rows, or a time threshold) and build a segment out of those rows. Depending
on the type of ingestion mechanism used (stream or partition level), segments may be locally stored in the servers
or in the controller's segment store.
Multiple servers may ingest the same data to increase availability and share query load.
Once a realtime segment is built and loaded the servers continue
to consume from where they left off.
Realtime segments are immutable once they are completed. While realtime segments are being consumed they are mutable,
in the sense that new rows can be added to them. Rows cannot be deleted from segments.
See `Consuming and Indexing rows in Realtime <https://cwiki.apache.org/confluence/display/PINOT/Consuming+and+Indexing+rows+in+Realtime>`_ for details.
Pinot Segments
--------------
A segment is laid out in a columnar format
so that it can be directly mapped into memory for serving queries. Columns may be single or multi-valued. Column types may be
STRING, INT, LONG, FLOAT, DOUBLE or BYTES. Columns may be declared to be metric or dimension (or specifically as a time dimension)
in the schema.
Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be "no-dictionary" column in which
case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (_e.g._ a column with cardinality
of 3 will use only 3 bits for each dictionary ID).
There is a forward index for each column compressed appropriately for efficient memory use. In addition, optional inverted indices can be
configured for any set of columns. Inverted indices, while taking up more storage, offer better query performance.
Specialized indexes like StartTree index is also supported.