commit | 83a9d18adfab35393dbb0edbf73b9b06687b3943 | [log] [tgz] |
---|---|---|
author | Seunghyun Lee <snlee@linkedin.com> | Wed May 24 20:44:20 2017 -0700 |
committer | GitHub <noreply@github.com> | Wed May 24 20:44:20 2017 -0700 |
tree | 03fd25fc80ffbeaead8dfa4ac8950b2b31327727 | |
parent | ab2636617b038e61d645d3e003772a4b411cd8f7 [diff] |
Fixed ColumnarToStarTreeConverter to correctly load the config file. (#1472) Because loadConfigFiles() is deprecated, StarTreeIndexSpec is not loaded correctly and the converter would always use the default settings even if the config file for the StarTree index is given.
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
These three presentations on Pinot give an overview of Pinot and how it is used at LinkedIn:
Pinot is well suited for analytical use cases on immutable append-only data that require low latency between an event being ingested and it being available to be queried.
Because of the design choices we made to achieve these goals, there are certain limitations present in Pinot:
Pinot works very well for querying time series data with lots of Dimensions and Metrics. Example - Query (profile views, ad campaign performance, etc.) in an analytical fashion (who viewed this profile in the last weeks, how many ads were clicked per campaign).
Before we get to quick start, lets go over the terminology.
Pinot has following Roles/Components:
Pinot leverages Apache Helix for cluster management.
For more information on Pinot Design and Architecture can be found here
You can either build Pinot manually or use Docker to run Pinot.
git clone https://github.com/linkedin/pinot.git cd pinot mvn install package -DskipTests cd pinot-distribution/target/pinot-0.016-pkg chmod +x bin/*.sh
We will load BaseBall stats from 1878 to 2013 into Pinot and run queries against it. There are 100000 records and 15 columns (schema) in this dataset.
Execute the quick-start-offline.sh script in bin folder which performs the following:
If you have Docker, run docker run -it -p 9000:9000 linkedin/pinot-quickstart-offline
. If you have built Pinot, run bin/quick-start-offline.sh
.
We should see the following output.
Deployed Zookeeper Deployed controller, broker and server Added baseballStats schema Creating baseballStats table Built index segment for baseballStats Pushing segments to the controller
At this point we can post queries. Here are some of the sample queries. Sample queries:
/*Total number of documents in the table*/ select count(*) from baseballStats limit 0 /*Top 5 run scorers of all time*/ select sum('runs') from baseballStats group by playerName top 5 limit 0 /*Top 5 run scorers of the year 2000*/ select sum('runs') from baseballStats where yearID=2000 group by playerName top 5 limit 0 /*Top 10 run scorers after 2000*/ select sum('runs') from baseballStats where yearID>=2000 group by playerName limit 0 /*Select playerName,runs,homeRuns for 10 records from the table and order them by yearID*/ select playerName,runs,homeRuns from baseballStats order by yearID limit 10
There are 3 ways to interact with Pinot - simple web interface, REST api and java client. Open your browser and go to http://localhost:9000/query/ and run any of the queries provided above. See Pinot Query Syntax for more info.
There are two ways to ingest data into Pinot - batch and realtime. Previous baseball stats demonstrated ingestion in batch. Typically these batch jobs are run on Hadoop periodically (e.g every hour/day/week/month). Data freshness depends on job granularity.
Lets look at an example where we ingest data in realtime. We will subscribe to meetup.com rsvp feed and index the rsvp events in real time. Execute quick-start-realtime.sh script in bin folder which performs the following:
If you have Docker, run docker run -it -p 9000:9000 linkedin/pinot-quickstart-realtime
. If you have built Pinot, run bin/quick-start-realtime.sh
.
Starting Kafka Created topic "meetupRSVPEvents". Starting controller, server and broker Added schema and table Realtime quick start setup complete Starting meetup data stream and publishing to kafka
Open Pinot Query Console at http://localhost:9000/query and run queries. Here are some sample queries
/*Total number of documents in the table*/ select count(*) from meetupRsvp limit 0 /*Top 10 cities with the most rsvp*/ select sum(rsvp_count) from meetupRsvp group by group_city top 10 limit 0 /*Show 10 most recent rsvps*/ select * from meetupRsvp order by mtime limit 10 /*Show top 10 rsvp'ed events*/ select sum(rsvp_count) from meetupRsvp group by event_name top 10 limit 0
At LinkedIn, it powers more than 50+ applications such as Who Viewed My Profile, Who Viewed My Jobs and many more, with interactive-level response times. Pinot ingests close to a Billion per day in real time and processes 100 million queries per day.
Please join or post questions to this group. https://groups.google.com/forum/#!forum/pinot_users