blob: 1cc28399dff8434f2984a470c81af46c25675939 [file] [log] [blame] [view]
---
layout: doc_page
title: "Tutorial: Querying data"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Tutorial: Querying data
This tutorial will demonstrate how to query data in Druid, with examples for Druid's native query format and Druid SQL.
The tutorial assumes that you've already completed one of the 4 ingestion tutorials, as we will be querying the sample Wikipedia edits data.
* [Tutorial: Loading a file](../tutorials/tutorial-batch.html)
* [Tutorial: Loading stream data from Kafka](../tutorials/tutorial-kafka.html)
* [Tutorial: Loading a file using Hadoop](../tutorials/tutorial-batch-hadoop.html)
* [Tutorial: Loading stream data using Tranquility](../tutorials/tutorial-tranquility.html)
## Native JSON queries
Druid's native query format is expressed in JSON. We have included a sample native TopN query under `quickstart/tutorial/wikipedia-top-pages.json`:
```json
{
"queryType" : "topN",
"dataSource" : "wikipedia",
"intervals" : ["2015-09-12/2015-09-13"],
"granularity" : "all",
"dimension" : "page",
"metric" : "count",
"threshold" : 10,
"aggregations" : [
{
"type" : "count",
"name" : "count"
}
]
}
```
This query retrieves the 10 Wikipedia pages with the most page edits on 2015-09-12.
Let's submit this query to the Druid broker:
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8082/druid/v2?pretty
```
You should see the following query results:
```json
[ {
"timestamp" : "2015-09-12T00:46:58.771Z",
"result" : [ {
"count" : 33,
"page" : "Wikipedia:Vandalismusmeldung"
}, {
"count" : 28,
"page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
}, {
"count" : 27,
"page" : "Jeremy Corbyn"
}, {
"count" : 21,
"page" : "Wikipedia:Administrators' noticeboard/Incidents"
}, {
"count" : 20,
"page" : "Flavia Pennetta"
}, {
"count" : 18,
"page" : "Total Drama Presents: The Ridonculous Race"
}, {
"count" : 18,
"page" : "User talk:Dudeperson176123"
}, {
"count" : 18,
"page" : "Wikipédia:Le Bistro/12 septembre 2015"
}, {
"count" : 17,
"page" : "Wikipedia:In the news/Candidates"
}, {
"count" : 17,
"page" : "Wikipedia:Requests for page protection"
} ]
} ]
```
## Druid SQL queries
Druid also supports a dialect of SQL for querying. Let's run a SQL query that is equivalent to the native JSON query shown above:
```
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
```
The SQL queries are submitted as JSON over HTTP.
### TopN query example
The tutorial package includes an example file that contains the SQL query shown above at `quickstart/tutorial/wikipedia-top-pages-sql.json`. Let's submit that query to the Druid broker:
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8082/druid/v2/sql
```
The following results should be returned:
```json
[
{
"page": "Wikipedia:Vandalismusmeldung",
"Edits": 33
},
{
"page": "User:Cyde/List of candidates for speedy deletion/Subpage",
"Edits": 28
},
{
"page": "Jeremy Corbyn",
"Edits": 27
},
{
"page": "Wikipedia:Administrators' noticeboard/Incidents",
"Edits": 21
},
{
"page": "Flavia Pennetta",
"Edits": 20
},
{
"page": "Total Drama Presents: The Ridonculous Race",
"Edits": 18
},
{
"page": "User talk:Dudeperson176123",
"Edits": 18
},
{
"page": "Wikipédia:Le Bistro/12 septembre 2015",
"Edits": 18
},
{
"page": "Wikipedia:In the news/Candidates",
"Edits": 17
},
{
"page": "Wikipedia:Requests for page protection",
"Edits": 17
}
]
```
### dsql client
For convenience, the Druid package includes a SQL command-line client, located at `bin/dsql` from the Druid package root.
Let's now run `bin/dsql`; you should see the following prompt:
```bash
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql>
```
To submit the query, paste it to the `dsql` prompt and press enter:
```bash
dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
┌──────────────────────────────────────────────────────────┬───────┐
│ page │ Edits │
├──────────────────────────────────────────────────────────┼───────┤
│ Wikipedia:Vandalismusmeldung │ 33 │
│ User:Cyde/List of candidates for speedy deletion/Subpage │ 28 │
│ Jeremy Corbyn │ 27 │
│ Wikipedia:Administrators' noticeboard/Incidents 21
Flavia Pennetta 20
Total Drama Presents: The Ridonculous Race 18
User talk:Dudeperson176123 18
Wikipédia:Le Bistro/12 septembre 2015 18
Wikipedia:In the news/Candidates 17
Wikipedia:Requests for page protection 17
└──────────────────────────────────────────────────────────┴───────┘
Retrieved 10 rows in 0.06s.
```
### Additional Druid SQL queries
#### Timeseries
`SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR);`
```bash
dsql> SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR);
┌──────────────────────────┬──────────────┐
HourTime LinesDeleted
├──────────────────────────┼──────────────┤
2015-09-12T00:00:00.000Z 1761
2015-09-12T01:00:00.000Z 16208
2015-09-12T02:00:00.000Z 14543
2015-09-12T03:00:00.000Z 13101
2015-09-12T04:00:00.000Z 12040
2015-09-12T05:00:00.000Z 6399
2015-09-12T06:00:00.000Z 9036
2015-09-12T07:00:00.000Z 11409
2015-09-12T08:00:00.000Z 11616
2015-09-12T09:00:00.000Z 17509
2015-09-12T10:00:00.000Z 19406
2015-09-12T11:00:00.000Z 16284
2015-09-12T12:00:00.000Z 18672
2015-09-12T13:00:00.000Z 30520
2015-09-12T14:00:00.000Z 18025
2015-09-12T15:00:00.000Z 26399
2015-09-12T16:00:00.000Z 24759
2015-09-12T17:00:00.000Z 19634
2015-09-12T18:00:00.000Z 17345
2015-09-12T19:00:00.000Z 19305
2015-09-12T20:00:00.000Z 22265
2015-09-12T21:00:00.000Z 16394
2015-09-12T22:00:00.000Z 16379
2015-09-12T23:00:00.000Z 15289
└──────────────────────────┴──────────────┘
Retrieved 24 rows in 0.08s.
```
#### GroupBy
`SELECT channel, SUM(added) FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5;`
```bash
dsql> SELECT channel, SUM(added) FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5;
┌───────────────┬─────────┐
channel EXPR$1
├───────────────┼─────────┤
#en.wikipedia │ 3045299 │
#it.wikipedia │ 711011 │
#fr.wikipedia │ 642555 │
#ru.wikipedia │ 640698 │
#es.wikipedia │ 634670 │
└───────────────┴─────────┘
Retrieved 5 rows in 0.05s.
```
#### Scan
` SELECT user, page FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00' LIMIT 5;`
```bash
dsql> SELECT user, page FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00' LIMIT 5;
┌────────────────────────┬────────────────────────────────────────────────────────┐
user page
├────────────────────────┼────────────────────────────────────────────────────────┤
Thiago89 Campeonato Mundial de Voleibol Femenino Sub-20 de 2015
91.34.200.249 Friede von Schönbrunn
TuHan-Bot Trĩ vàng
Lowercase sigmabot III User talk:ErrantX
BattyBot Hans W. Jung
└────────────────────────┴────────────────────────────────────────────────────────┘
Retrieved 5 rows in 0.04s.
```
#### EXPLAIN PLAN FOR
By prepending `EXPLAIN PLAN FOR ` to a Druid SQL query, it is possible to see what native Druid queries a SQL query will plan into.
Using the TopN query above as an example:
`EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;`
```bash
dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
PLAN
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}])
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Retrieved 1 row in 0.03s.
```
## Further reading
The [Queries documentation](../querying/querying.html) has more information on Druid's native JSON queries.
The [Druid SQL documentation](../querying/sql.html) has more information on using Druid SQL queries.