The HTTP storage plugin lets you query APIs over HTTP/REST. The plugin expects JSON responses.
The HTTP plugin is new in Drill 1.18 and is an Alpha feature. It works well, and we encourage you to use it and provide feedback. However, we reserve the right to change the plugin based on that feedback.
To configure the plugin, create a new storage plugin, and add the following configuration options which apply to ALL connections defined in this plugin:
{ "type": "http", "cacheResults": true, "connections": {}, "timeout": 0, "proxyHost": null, "proxyPort": 0, "proxyType": null, "proxyUsername": null, "proxyPassword": null, "enabled": true }
The required options are:
type
: This should be http
cacheResults
: Enable caching of the HTTP responses. Defaults to false
timeout
: Sets the response timeout in seconds. Defaults to 0
which is no timeout.connections
: This field contains the details for individual connections. See the section Configuring API Connections for Details.You can configure Drill to work behind a corporate proxy. Details are listed below.
The HTTP Storage plugin allows you to configure multiple APIS which you can query directly from this plugin. To do so, first add a connections
parameter to the configuration . Next give the connection a name, which will be used in queries. For instance stockAPI
or jira
.
The connection
property can accept the following options.
url
: The base URL which Drill will query.
Many APIs require parameters to be passed directly in the URL instead of as query arguments. For example, github‘s API allows you to query an organization’s repositories with the following URL: https://github.com/orgs/{org}/repos
As of Drill 1.20.0, you can simply set the URL in the connection using the curly braces. If your API includes URL parameters you must include them in the WHERE
clause in your query, or specify a default value in the configuration.
As an example, the API above, you would have to query as shown below:
SELECT * FROM api.github WHERE org = 'apache'
This query would replace the org
in the URL with the value from the WHERE
clause, in this case apache
. You can specify a default value as follows: https://someapi.com/ {param1}/{param2=default}
. In this case, the default would be used if and only if there isn't a parameter supplied in the query.
WHERE org='apache' OR org='linux'
, these parameters could not be pushed down in the current state.requireTail
: Set to true
if the query must contain an additional part of the service URL as a table name, false
if the URL needs no additional suffix other than that provided by WHERE
clause filters. (See below.)
If your service requires parameters, you have three choices. Suppose your connection is called sunrise
. First, can include them directly in your URL if the parameters a fixed for a given service:
url: "https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=2019-10-02", requireTail: false
Query your table like this:
SELECT * FROM api.sunrise;
Second, you can specify the base URL here and the full URL in your query. Use this form if the parameters define a table-like concept (the set of data to return):
url: "https://api.sunrise-sunset.org/json", requireTail: true
SQL query:
SELECT * FROM api.sunrise.`?lat=36.7201600&lng=-4.4203400&date=2019-10-02`
If the URL requires a tail, specify it as if it were a table name. (See example below.) Drill directly appends the “tail” to the base URL to create the final URL.
Third, you can use the params
field below to specify the parameters as filters if the parameters specify which data sets to return:
url: "https://api.sunrise-sunset.org/json", requireTail: false, params: ["lat", "lng", "date"]
SQL query:
SELECT * FROM api.sunrise WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = '2019-10-02'
In this case, Drill appends the parameters to the URL, adding a question mark to separate the two.
method
: The request method. Must be GET
or POST
. Other methods are not allowed and will default to GET
.
postBody
: Contains data, in the form of key value pairs, which are sent during a POST
request. The post body should be in the of a block of text with key/value pairs:
postBody: "key1=value1 key2=value2"
postBodyLocation
: If the API uses the POST
method, you can send parameters in several different ways:
query_string
: Parameters from the query are pushed down to the query string. Static parameters are pushed to the post body.post_body
: Both static and parameters from the query are pushed to the post body as key/value pairsjson_body
: Both static and parameters from the query are pushed to the post body as json.headers
: Often APIs will require custom headers as part of the authentication. This field allows you to define key/value pairs which are submitted with the http request. The format is:
headers: { "key1": "Value1", "key2": "Value2" }
params
: Allows you to map SQL WHERE
clause conditions to query parameters.url: "https://api.sunrise-sunset.org/json", params: ["lat", "lng", "date"]
SQL query:
SELECT * FROM api.sunrise WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = '2019-10-02'
HTTP parameters are untyped; Drill converts any value you provide into a string. Drill allows you to use any data type which can convert unambiguously to a string: BIT
, INT
, BIGINT
, FLOAT4
, FLOAT8
, VARDECIMAL
, VARCHAR
. The BIT
type is translated to true
and false
. Note that none of the date or interval types are allowed: each of those requires formatting.
Note the need for back-tick quotes around the names; date
is a reserved word. Notice also that the date is a string because of the formatting limitation mentioned above.
Only equality conditions can be translated to parameters. The above filters are translated to:
lat=36.7201600&lng=-4.4203400&date=2019-10-02
If your query contains other conditions (!=
, <
, etc.) then those conditions are applied in Drill after the REST service returns the data.
Only fields listed in the params
config filed will become parameters, all other expressions are handled within Drill as explained above.
At present, Drill requires the values to be literals (constants). Drill does not currently allow expressions. That is, the following will not become an HTTP parameter:
WHERE `lat` = 36 + 0.7201600
Drill will add parameters to the URL in the order listed in the config. Use this feature if the API is strict about parameter ordering.
At present Drill does not enforce that parameters are provided in the query: Drill assumes parameters are optional.
REST responses often have structure beyond the data you want to query. For example:
"results": { "sunrise":"7:27:02 AM", "sunset":"5:05:55 PM", "solar_noon":"12:16:28 PM", "day_length":"9:38:53", "civil_twilight_begin":"6:58:14 AM", "civil_twilight_end":"5:34:43 PM", "nautical_twilight_begin":"6:25:47 AM", "nautical_twilight_end":"6:07:10 PM", "astronomical_twilight_begin":"5:54:14 AM", "astronomical_twilight_end":"6:38:43 PM" }, "status":"OK" }
Drill can handle JSON structures such as the above; you can use SQL to obtain the results you want. However, the SQL will be simpler if we skip over the portions we don't want and simply read the results
fields as our SQL fields. We do that with the dataPath
configuration:
dataPath: "results"
The dataPath
can contain any number of fields, for example: "response/content/data"
. Drill will ignore all JSON content outside of the data path.
At present, there is no provision to check the status
code in a response such as that shown above. Drill assumes that the server will uses HTTP status codes to indicate a bad request or other error.
The REST plugin accepts three different types of input: json
, csv
and xml
. The default is json
. If you are using XML
as a data type, there is an additional configuration option called xmlDataLevel
which reduces the level of unneeded nesting found in XML files. You can find more information in the documentation for Drill's XML format plugin.
Read the documentation for configuring json options, including schema provisioning.
authType
: If your API requires authentication, specify the authentication type. Defaults to none
. If the authType
is set to basic
, username
and password
must be set in the configuration as well.
username
: The username for basic authentication.
password
: The password for basic authentication.
If you have an HTTP plugin with multiple endpoints that all use the same credentials, you can set the authType
to basic
and set global credentials in the storage plugin configuration.
Simply add the following to the storage plugin configuration:
"credentialsProvider": { "credentialsProviderType": "PlainCredentialsProvider", "credentials": { "username": "user1", "password": "user1Pass" } }
Note that the authType
still must be set to basic
and that any endpoint credentials will override the global credentials.
Some APIs support a query parameter which is used to limit the number of results returned by the API. In this case you can set the limitQueryParam
config variable to the query parameter name and Drill will automatically include this in your query. For instance, if you have an API which supports a limit query parameter called maxRecords
and you set the abovementioned config variable then execute the following query:
SELECT <fields> FROM api.limitedApi LIMIT 10
Drill will send the following request to your API:
https://<api>?maxRecords=10
If the API which you are querying requires OAuth2.0 for authentication read the documentation for configuring Drill to use OAuth2.0.
If you want to use automatic pagination in Drill, click here to read the documentation for pagination.
When a user makes HTTP calls, the response code will be from 100-599. 400 series error codes can contain useful information and in some cases you would not want Drill to throw errors on 400 series errors. This option allows you to define Drill's behavior on 400 series error codes. When set to true
, Drill will throw an exception and halt execution on 400 series errors, false
will return an empty result set (with implicit fields populated).
Default is true
, but when set to false, Drill will trust all SSL certificates. Useful for debugging or on internal corporate networks using self-signed certificates or private certificate authorities.
Some APIs are case sensitive with the fields which are pushed down. If the endpoint that you are working with is in fact case sensitive, simply set this to true
. Defaults to false
.
This plugin is different from other plugins in that it the table component of the FROM
clause is different. In normal Drill queries, the FROM
clause is constructed as follows:
FROM <storage plugin>.<schema>.<table>
For example, you might have:
FROM dfs.test.`somefile.csv` -- or FROM mongo.stats.sales_data
The HTTP/REST plugin the FROM
clause enables you to pass arguments to your REST call if you set the requireTail
property to true
. The structure is:
FROM <plugin>.<connection>.<arguments> --Actual example: FROM http.sunrise.`/json?lat=36.7201600&lng=-4.4203400&date=today`
Or, as explained above, you can have the URL act like a table and pass parameters using a WHERE
clause “filter” conditions.
Some users access HTTP services from behind a proxy firewall. Drill provides three ways specify proxy configuration.
Drill recognizes the usual Linux proxy environment variables:
http_proxy
, HTTP_PROXY
https_proxy
, HTTP_PROXY
all_proxy
, ALL_PROXY
This technique works well if your system is already configured to handle proxies.
You can also specify proxy configuration in the drill-override.conf
file. See drill-override-example.conf
for a template. Use the boot configuration is an attribute of your network environment. Doing so will ensure every Drillbit and every HTTP/HTTPS request uses the same proxy configuration.
First, you can use the same form of URL you would use with the environment variables:
drill.exec.net_proxy.http_url: "http://foo.com/1234"
There is one setting for HTTP, another for HTTPS.
Alternatively, you can specify each field separately:
drill.exec.net_proxy.http: { type: "none", # none, http, socks. Blank same as none. host: "", port: 80, user_name: "", password: "" },
The valid proxy types are none
, http
and socks
. Blank is the same as none
.
Again, there is a parallel section for HTTPS.
The final way to configure proxy is in the HTTP storage plugin itself. The proxy applies to all connections defined in that plugin. Use this approach if the proxy applies only to some external services, or if each service has a different proxy (defined by creating a separate plugin config for each service.)
proxy_type: "direct", proxy_host: "", proxy_port: 80, proxy_user_name: "", proxy_password: ""
The valid proxy types are direct
, http
or socks
. Blank is the same as direct
.
The API sunrise-sunset.org returns data in the following format:
"results": { "sunrise":"7:27:02 AM", "sunset":"5:05:55 PM", "solar_noon":"12:16:28 PM", "day_length":"9:38:53", "civil_twilight_begin":"6:58:14 AM", "civil_twilight_end":"5:34:43 PM", "nautical_twilight_begin":"6:25:47 AM", "nautical_twilight_end":"6:07:10 PM", "astronomical_twilight_begin":"5:54:14 AM", "astronomical_twilight_end":"6:38:43 PM" }, "status":"OK" }
To query this API, set the configuration as follows:
{ "type": "http", "cacheResults": false, "enabled": true, "timeout": 5, "connections": { "sunrise": { "url": "https://api.sunrise-sunset.org/json", "requireTail": true, "method": "GET", "headers": null, "authType": "none", "userName": null, "password": null, "postBody": null, "inputType": "json", "errorOn400": true } }
Then, to execute a query:
SELECT api_results.results.sunrise AS sunrise, api_results.results.sunset AS sunset FROM http.sunrise.`?lat=36.7201600&lng=-4.4203400&date=today` AS api_results;
Which yields the following results:
+------------+------------+ | sunrise | sunset | +------------+------------+ | 7:17:46 AM | 5:01:33 PM | +------------+------------+ 1 row selected (0.632 seconds)
We can improvide the above configuration to use WHERE
clause filters and a dataPath
to skip over the unwanted parts of the message body. Set the configuration as follows:
{ "type": "http", "cacheResults": false, "enabled": true, "timeout": 5, "connections": { "sunrise": { "url": "https://api.sunrise-sunset.org/json", "requireTail": false, "method": "GET", "dataPath": "results", "headers": null, "params": [ "lat", "lng", "date" ], "authType": "none", "userName": null, "password": null, "postBody": null, "errorOn400": true } }
Then, to execute a query:
SELECT sunrise, sunset FROM http.sunrise WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = 'today'
Which yields the same results as before.
JIRA Cloud has a REST API which is documented here.
To connect Drill to JIRA Cloud, use the following configuration:
{ "type": "http", "cacheResults": false, "timeout": 5, "connections": { "jira": { "url": "https://<project>.atlassian.net/rest/api/3/", "method": "GET", "dataPath": "issues", "headers": { "Accept": "application/json" }, "authType": "basic", "userName": "<username>", "password": "<API Key>", "postBody": null } }, "enabled": true }
Once you've configured Drill to query the API, you can now easily access any of your data in JIRA. The JIRA API returns highly nested data, however with a little preparation, it is pretty straightforward to transform it into a more useful table. For instance, the query below:
SELECT key, t.fields.issueType.name AS issueType, SUBSTR(t.fields.created, 1, 10) AS created, SUBSTR(t.fields.updated, 1, 10) AS updated, t.fields.assignee.displayName as assignee, t.fields.creator.displayName as creator, t.fields.summary AS summary, t.fields.status.name AS currentStatus, t.fields.priority.name AS priority, t.fields.labels AS labels, t.fields.subtasks AS subtasks FROM http.jira.`search?jql=project%20%3D%20<project>&&maxResults=100 AS t`
The query below counts the number of issues by priority:
SELECT t.fields.priority.name AS priority, COUNT(*) AS issue_count FROM http.jira.`search?jql=project%20%3D%20<project>&&maxResults=100` AS t GROUP BY priority ORDER BY issue_count DESC
2. At this time, the plugin does not support any authentication other than basic authentication.
4. This plugin only reads JSON and CSV responses.
5. POST
bodies can only be in the format of key/value pairs. Some APIs accept JSON based POST
bodies but this is not currently supported.
When using dataPath
, the returned message should a single JSON object. The field pointed to by the dataPath
should contain a single JSON object or an array of objects.
When not using dataPath
, the response should be a single JSON object, an array of JSON objects, or a series of line-delimited JSON objects (the so-called jsonlines format.)
Parameters are considered optional; no error will be given if a query omits parameters. An enhancement would be to mark parameters as required: all are required or just some. If parameters are required, but omitted, the report service will likely return an error.
If anything goes wrong, Drill will provide a detailed error message, including URL:
DATA_READ ERROR: Failed to read the HTTP response body Error message: Read timed out Connection: sunrise Plugin: api URL: https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=today Fragment: 0:0
If using a “tail” in the query, verify that the tail is quoted using back-ticks as shown in the examples.
Check that the URL is correct. If not, check the plugin configuration properties described above to find out why the pieces were assembled as you want.
If the query works but delivers unexpected results, check the Drill log file. Drill logs a message like the following at the info level when opening the HTTP connection:
Connection: sunrise, Method: GET, URL: https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=today
If the query runs, but produces odd results, try a simple SELECT *
query. This may reveal if there is unexpected message context in addition to the data. Use the dataPath
property to ignore the extra content.
The HTTP plugin includes four implicit fields which can be used for debugging. These fields do not appear in star queries. They are:
_response_code
: The response code from the HTTP request. This field is an INT
._response_message
: The response message._response_protocol
: The response protocol._response_url
: The actual URL sent to the API.There are some situations where a user might want to join data with an API result and the pushdowns prevent that from happening. The main situation where this happens is when an API has parameters which are part of the URL AND these parameters are dynamically populated via a join.
In this case, there are two functions http_get_url
and http_get
which you can use to faciliate these joins.
http_request('<storage_plugin_name>', <params>)
: This function accepts a storage plugin as input and an optional list of parameters to include in a URL.http_get(<url>, <params>)
: This function works in the same way except that it does not pull any configuration information from existing storage plugins. The input url for the http_get
function must be a valid URL.Let's say that you have a storage plugin called github
with an endpoint called repos
which points to the url: https://github.com/orgs/{org}/repos. It is easy enough to write a query like this:
SELECT * FROM github.repos WHERE org='apache'
However, if you had a file with organizations and wanted to join this with the API, the query would fail. Using the functions listed above you could get this data as follows:
SELECT http_request('github.repos', `org`) FROM dfs.`some_data.csvh`
or
SELECT http_get('https://github.com/orgs/{org}/repos', `org`) FROM dfs.`some_data.csvh`
** WARNING: This functionality will execute an HTTP Request FOR EVERY ROW IN YOUR DATA. Use with caution. **