We‘ve already written about pairing R with RDruid, but Python has powerful and free open-source analysis tools too. Collectively, these are often referred to as the SciPy Stack. To pair SciPy’s analytic power with the advantages of querying time-series data in Druid, we created the pydruid connector. This allows Python users to query Druid—and export the results to useful formats—in a way that makes sense to them.
pydruid should run with Python 2.x, and is known to run with Python 2.7.5.
Install pydruid in the same way as you'd install any other Python module on your system. The simplest way is:
pip install pydruid
You should also install Pandas to execute the simple examples below:
pip install pandas
When you import pydruid into your example, it will try to load Pandas as well.
Download Druid and unpack Druid. If you are not familiar with Druid, see this introductory tutorial.
From the Druid home directory, start the Druid Realtime node:
$DRUID_HOME/run_example_server.sh
When prompted, choose the “wikipedia” example. After the Druid realtime node is done starting up, messages should appear that start with the following:
2014-04-03 18:01:32,852 INFO [wikipedia-incremental-persist] ...
These messages confirm that the realtime node is ingesting data from the Wikipedia edit stream, and that data can be queried.
Let's say we want to see the top few languages for Wikipedia articles, in terms of number of edits. This is the query we could post directly to Druid:
{ "queryType": "topN", "dataSource": "wikipedia", "dimension": "language", "threshold": 4, "metric": "edit_count", "granularity": "all", "filter": { "type": "selector", "dimension": "namespace", "value": "article" }, "aggregations": [ { "type": "longSum", "name": "edit_count", "fieldName": "count" } ], "intervals":["2013-06-01T00:00/2020-01-01T00"] }
The results should appear similar to the following:
[ { "timestamp" : "2014-04-03T17:59:00.000Z", "result" : [ { "language" : "en", "edit_count" : 4726 }, { "language" : "fr", "edit_count" : 1273 }, { "language" : "de", "edit_count" : 857 }, { "language" : "ja", "edit_count" : 176 } ] } ]
NOTE: Due to limitations in the way the wikipedia example is set up, you may see a limited number of results appear.
Here's that same query in Python:
from pydruid.client import * query = PyDruid('http://localhost:8083', 'druid/v2/') top_langs = query.topn( datasource = "wikipedia", granularity = "all", intervals = "2013-06-01T00:00/2020-01-01T00", dimension = "language", filter = Dimension("namespace") == "article", aggregations = {"edit_count": longsum("count")}, metric = "edit_count", threshold = 4 ) print top_langs # Do this if you want to see the raw JSON
Let's break this query down:
query
object is instantiated with the location of the Druid realtime node. query
exposes various querying methods, including topn
.minute
or hour
. We want to see the sum count across the entire interval, and so we choose all
.article
(therefore excluding edits to Wikipedia pages that aren’t articles).count
metric to edit_count
.See the pydruid documentation for more information about queries.
Now that Druid is returning data, we'll pass that data to a Pandas dataframe, which allows us to analyze and visualize it:
from pydruid.client import * from pylab import plt # Need to have matplotlib installed query = PyDruid('http://localhost:8083', 'druid/v2/') top_langs = query.topn( datasource = "wikipedia", granularity = "all", intervals = "2013-06-01T00:00/2020-01-01T00", dimension = "language", filter = Dimension("namespace") == "article", aggregations = {"edit_count": longsum("count")}, metric = "edit_count", threshold = 4 ) print top_langs # Do this if you want to see the raw JSON df = query.export_pandas() # Client will import Pandas, no need to do so separately. df = df.drop('timestamp', axis=1) # Don't need the timestamp column here df.index = range(1, len(df)+1) # Get a naturally numbered index print df df.plot(x='language', kind='bar') plt.show()
Printing the results gives:
edit_count language 1 834 en 2 256 de 3 185 fr 4 38 ja
The bar graph will look something like this:
If you were to repeat the query, you should see larger numbers under edit_count, since the Druid realtime node is continuing to ingest data from Wikipedia.
In this blog, we showed how you can run ad-hoc queries against a data set that is being streamed into Druid. And while this is only a small example of pydruid and the power of Python, it serves as an effective introductory demonstration of the benefits of pairing Druid‘s ability to make data available in real-time with SciPi’s powerful analytics tools.