blob: 8ca3235ca0d6eab02e83cafe3c75ea24ad64eaaf [file] [log] [blame]
..
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
..
.. warning:: The documentation is not up-to-date and has moved to `Apache Pinot Docs <https://docs.pinot.apache.org/>`_.
.. _getting-started:
Getting Started
===============
A quick way to get familiar with Pinot is to run the Pinot examples. The examples can be run either by compiling the
code or by running the prepackaged Docker images.
To demonstrate Pinot, let's start a simple one node cluster, along with the required Zookeeper. This demo setup also
creates a table, generates some Pinot segments, then uploads them to the cluster in order to make them queryable.
All of the setup is automated, so the only thing required at the beginning is to start the demonstration cluster.
.. _compiling-code-section:
Compiling the code
~~~~~~~~~~~~~~~~~~
One can also run the Pinot demonstration by checking out the code on GitHub, compiling it, and running it. Compiling
Pinot requires JDK 8 or later and Apache Maven 3.
#. Check out the code from GitHub (https://github.com/apache/incubator-pinot)
#. With Maven installed, run ``mvn install package -DskipTests -Pbin-dist`` in the directory in which you checked out Pinot.
#. Make the generated scripts executable:
.. code-block:: none
cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh
Trying out Batch quickstart demo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To run the demo with compiled code:
``bin/quick-start-batch.sh``
Once the Pinot cluster is running, you can query it by going to http://localhost:9000
You can also use the REST API to query Pinot, as well as the Java client. As this is outside of the scope of this
introduction, the reference documentation to use the Pinot client APIs is in the :doc:`client_api` section.
Pinot uses PQL, a SQL-like query language, to query data. Here are some sample queries:
.. code-block:: sql
/*Total number of documents in the table*/
SELECT count(*) FROM baseballStats LIMIT 0
/*Top 5 run scorers of all time*/
SELECT sum('runs') FROM baseballStats GROUP BY playerName TOP 5 LIMIT 0
/*Top 5 run scorers of the year 2000*/
SELECT sum('runs') FROM baseballStats WHERE yearID=2000 GROUP BY playerName TOP 5 LIMIT 0
/*Top 10 run scorers after 2000*/
SELECT sum('runs') FROM baseballStats WHERE yearID>=2000 GROUP BY playerName
/*Select playerName,runs,homeRuns for 10 records from the table and order them by yearID*/
SELECT playerName,runs,homeRuns FROM baseballStats ORDER BY yearID LIMIT 10
The full reference for the PQL query language is present in the :ref:`pql` section of the Pinot documentation.
Trying out Streaming quickstart demo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pinot can ingest data from streaming sources such as Kafka.
To run the demo with compiled code:
``bin/quick-start-streaming.sh``
Once started, the demo will start Kafka, create a Kafka topic, and create a realtime Pinot table. Once created, Pinot
will start ingesting events from the Kafka topic into the table. The demo also starts a consumer that consumes events
from the Meetup API and pushes them into the Kafka topic that was created, causing new events modified on Meetup to
show up in Pinot.
.. role:: sql(code)
:language: sql
To show new events appearing, one can run :sql:`SELECT * FROM meetupRsvp ORDER BY mtime DESC LIMIT 50` repeatedly, which shows the
last events that were ingested by Pinot.
Experimenting with Pinot
~~~~~~~~~~~~~~~~~~~~~~~~
Now we have a quick start Pinot cluster running locally. Below are step-by-step instructions on
how to add a simple table to the Pinot system, how to upload a segment, and how to query the segment.
Suppose we have a transcript in CSV format containing students' basic info and their scores for each subject.
+------------+------------+-----------+-----------+-----------+-----------+
| studentID | firstName | lastName | gender | subject | score |
+============+============+===========+===========+===========+===========+
| 200 | Lucy | Smith | Female | Maths | 3.8 |
+------------+------------+-----------+-----------+-----------+-----------+
| 200 | Lucy | Smith | Female | English | 3.5 |
+------------+------------+-----------+-----------+-----------+-----------+
| 201 | Bob | King | Male | Maths | 3.2 |
+------------+------------+-----------+-----------+-----------+-----------+
| 202 | Nick | Young | Male | Physics | 3.6 |
+------------+------------+-----------+-----------+-----------+-----------+
When we create a CSV file, we will also need a separate CSV config JSON file.
First, however, we will create a working directory called ``getting-started`` (in this example, it is on ``Desktop``), and create two additional directories within ``getting-started`` called ``data``
and ``config``.
Note that we can create a variable for the working directory called ``WORKING_DIR``.
.. code-block:: none
$ mkdir getting-started
$ WORKING_DIR=/Users/host1/Desktop/getting-started
$ cd $WORKING_DIR
$ mkdir getting-started/data
$ mkdir getting started/config
We will create the transcript CSV file in ``data``, and the CSV config file in ``config``.
.. code-block:: none
$ touch getting-started/data/test.csv
$ touch getting-started/config/csv-record-reader-config.json
The ``test.csv`` file should look like this, with no header line at the top:
.. code-block:: none
200,Lucy,Smith,Female,Maths,3.8
200,Lucy,Smith,Female,English,3.5
201,Bob,King,Male,Maths,3.2
202,Nick,Young,Male,Physics,3.6
Instead of using a header line, we will use the CSV config JSON file ``csv-record-reader-config.json`` to specify the header:
.. code-block:: none
{
"header":"studentID,firstName,lastName,gender,subject,score",
"fileFormat":"CSV"
}
In order to set up a table, we need to specify the schema of this transcript in ``transcript-schema.json``, which we will store in ``config``:
.. code-block:: none
$ touch getting-started/config/transcript-schema.json
``transcript-schema.json`` should look like this:
.. code-block:: none
{
"schemaName": "transcript",
"dimensionFieldSpecs": [
{
"name": "studentID",
"dataType": "STRING"
},
{
"name": "firstName",
"dataType": "STRING"
},
{
"name": "lastName",
"dataType": "STRING"
},
{
"name": "gender",
"dataType": "STRING"
},
{
"name": "subject",
"dataType": "STRING"
}
],
"metricFieldSpecs": [
{
"name": "score",
"dataType": "FLOAT"
}
]
}
Then, we need to specify the table config in another JSON file (also stored in ``config``), which links the schema to the table:
.. code-block:: none
$ touch getting-started/config/transcript-table-config.json
``transcript-table-config.json`` should look like this:
.. code-block:: none
{
"tableName": "transcript",
"segmentsConfig" : {
"replication" : "1",
"schemaName" : "transcript",
"segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
},
"tenants" : {
"broker":"DefaultTenant",
"server":"DefaultTenant"
},
"tableIndexConfig" : {
"invertedIndexColumns" : [],
"loadMode" : "HEAP",
"lazyLoad" : "false"
},
"tableType":"OFFLINE",
"metadata": {}
}
To create pinot table, we can navigate to the directory in ``pinot-distribution`` that contains
``pinot-admin.sh``, and use the command below:
.. code-block:: none
$ ./pinot-admin.sh AddTable -schemaFile $WORKING_DIR/config/transcript-schema.json -tableConfigFile $WORKING_DIR/config/transcript-table-config.json -exec
Executing command: AddTable -tableConfigFile /Users/host1/Desktop/getting-started/config/transcript-table-config.json -schemaFile /Users/host1/Desktop/getting-started/config/transcript-schema.json -controllerHost [controller_host] -controllerPort 9000 -exec
{"status":"Table transcript_OFFLINE successfully added"}
At this point, the directory tree for our ``getting-started`` should look like this:
.. code-block:: none
|-- getting-started
|-- data
|-- test.csv
|-- config
|-- csv-record-reader-config.json
|-- transcript-schema.json
|-- transcript-table-config.json
In order to upload our data to the Pinot cluster, we need to convert our CSV file into a Pinot Segment, which will be put in a new directory $WORKING_DIR/test2:
.. code-block:: none
$ ./pinot-admin.sh CreateSegment -dataDir $WORKING_DIR/data -format CSV -outDir $WORKING_DIR/test2 -tableName transcript -segmentName transcript_0 -overwrite -schemaFile $WORKING_DIR/config/transcript-schema.json -readerConfigFile $WORKING_DIR/config/csv-record-reader-config.json
Executing command: CreateSegment -generatorConfigFile null -dataDir /Users/host1/Desktop/getting-started/data -format CSV -outDir /Users/host1/Desktop/getting-started/test2 -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/Desktop/getting-started/config/transcript-schema.json -readerConfigFile /Users/host1/Desktop/getting-started/config/csv-record-reader-config.json -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
Accepted files: [file:/Users/host1/Desktop/getting-started/data/test.csv]
Finished building StatsCollector!
Collected stats for 4 documents
Created dictionary for STRING column: studentID with cardinality: 1, max length in bytes: 4, range: null to null
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /Users/host1/Desktop/getting-started/test2/transcript_0_0 to v3 format
v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/getting-started/test2/transcript_0_0/v3
Deleting files in v1 segment directory: /Users/host1/Desktop/getting-started/test2/transcript_0_0
Driver, record read time : 1
Driver, stats collector time : 0
Driver, indexing time : 0
Once we have the Pinot Segment, we can upload it to our cluster:
.. code-block:: none
$ ./pinot-admin.sh UploadSegment -segmentDir $WORKING_DIR/test2/
Executing command: UploadSegment -controllerHost [controller_host] -controllerPort 9000 -segmentDir /Users/host1/Desktop/test2/
Compressing segment transcript_0_0
Uploading segment transcript_0_0.tar.gz
Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9
You did it! Now we can query the data in Pinot.
To get all the number of rows in the table:
.. code-block:: none
$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select count(*) from transcript
Result: {"aggregationResults":[{"function":"count_star","value":"4"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":4,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":7,"segmentStatistics":[],"traceInfo":{}}
To get the average score of subject Maths:
.. code-block:: none
$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where subject = "Maths"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.50000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":4,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":33,"segmentStatistics":[],"traceInfo":{}}
To get the average score for Lucy Smith:
.. code-block:: none
$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where firstName = "Lucy" and lastName = "Smith"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.65000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":6,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":67,"segmentStatistics":[],"traceInfo":{}}