blob: 35105554fd1a1718dc55a407bb412add7a3c3759 [file] [log] [blame] [view]
# Generic REST API Storage Plugin
The HTTP storage plugin lets you query APIs over HTTP/REST. The plugin
expects JSON responses.
The HTTP plugin is new in Drill 1.18 and is an Alpha feature. It works well, and we
encourage you to use it and provide feedback. However, we reserve the right to change
the plugin based on that feedback.
## Configuration
To configure the plugin, create a new storage plugin, and add the following configuration options which apply to ALL connections defined in this plugin:
```json
{
"type": "http",
"cacheResults": true,
"connections": {},
"timeout": 0,
"proxyHost": null,
"proxyPort": 0,
"proxyType": null,
"proxyUsername": null,
"proxyPassword": null,
"enabled": true
}
```
The required options are:
* `type`: This should be `http`
* `cacheResults`: Enable caching of the HTTP responses. Defaults to `false`
* `timeout`: Sets the response timeout in seconds. Defaults to `0` which is no timeout.
* `connections`: This field contains the details for individual connections. See the section *Configuring API Connections for Details*.
You can configure Drill to work behind a corporate proxy. Details are listed below.
### Configuring the API Connections
The HTTP Storage plugin allows you to configure multiple APIS which you can query directly from this plugin. To do so, first add a `connections` parameter to the configuration
. Next give the connection a name, which will be used in queries. For instance `stockAPI` or `jira`.
The `connection` property can accept the following options.
#### URL
`url`: The base URL which Drill will query.
##### Parameters in the URL
Many APIs require parameters to be passed directly in the URL instead of as query arguments. For example, github's API allows you to query an organization's repositories with the following
URL: https://github.com/orgs/{org}/repos
As of Drill 1.20.0, you can simply set the URL in the connection using the curly braces. If your API includes URL parameters you must include them in the `WHERE` clause in your
query, or specify a default value in the configuration.
As an example, the API above, you would have to query as shown below:
```sql
SELECT *
FROM api.github
WHERE org = 'apache'
```
This query would replace the `org`in the URL with the value from the `WHERE` clause, in this case `apache`. You can specify a default value as follows: `https://someapi.com/
{param1}/{param2=default}`. In this case, the default would be used if and only if there isn't a parameter supplied in the query.
#### Limitations on URL Parameters
* Drill does not support boolean expressions of URL parameters in queries. For instance, for the above example, if you were to include `WHERE org='apache' OR org='linux'`,
these parameters could not be pushed down in the current state.
* All URL parameter clauses must be equality only.
### Passing Parameters in the Query
`requireTail`: Set to `true` if the query must contain an additional part of the service
URL as a table name, `false` if the URL needs no additional suffix other than that
provided by `WHERE` clause filters. (See below.)
If your service requires parameters, you have three choices. Suppose your connection is called
`sunrise`. First, can include them directly in your URL if the parameters a fixed for a given
service:
```json
url: "https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=2019-10-02",
requireTail: false
```
Query your table like this:
```sql
SELECT * FROM api.sunrise;
```
Second, you can specify the base URL here and the full URL in your query. Use this form if the
parameters define a table-like concept (the set of data to return):
```json
url: "https://api.sunrise-sunset.org/json",
requireTail: true
```
SQL query:
```sql
SELECT * FROM api.sunrise.`?lat=36.7201600&lng=-4.4203400&date=2019-10-02`
```
If the URL requires a tail, specify it as if it were a table name. (See example
below.) Drill directly appends the "tail" to the base URL to create the final URL.
Third, you can use the `params` field below to specify the parameters as filters
if the parameters specify which data sets to return:
```json
url: "https://api.sunrise-sunset.org/json",
requireTail: false,
params: ["lat", "lng", "date"]
```
SQL query:
```sql
SELECT * FROM api.sunrise
WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = '2019-10-02'
```
In this case, Drill appends the parameters to the URL, adding a question mark
to separate the two.
#### Method
`method`: The request method. Must be `GET` or `POST`. Other methods are not allowed and will default to `GET`.
`postBody`: Contains data, in the form of key value pairs, which are sent during a `POST` request.
The post body should be in the of a block of text with key/value pairs:
```json
postBody: "key1=value1
key2=value2"
```
#### Headers
`headers`: Often APIs will require custom headers as part of the authentication. This field allows
you to define key/value pairs which are submitted with the http request. The format is:
```json
headers: {
"key1": "Value1",
"key2": "Value2"
}
```
#### Query Parmeters as Filters
* `params`: Allows you to map SQL `WHERE` clause conditions to query parameters.
```json
url: "https://api.sunrise-sunset.org/json",
params: ["lat", "lng", "date"]
```
SQL query:
```sql
SELECT * FROM api.sunrise
WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = '2019-10-02'
```
HTTP parameters are untyped; Drill converts any value you provide into a string.
Drill allows you to use any data type which can convert unambiguously to a string:
`BIT`, `INT`, `BIGINT`, `FLOAT4`, `FLOAT8`, `VARDECIMAL`, `VARCHAR`. The `BIT` type
is translated to `true` and `false`. Note that none of the date or interval types
are allowed: each of those requires formatting.
Note the need for back-tick quotes around the names;
`date` is a reserved word. Notice also that the date is a string because of
the formatting limitation mentioned above.
Only equality conditions can be translated to parameters. The above filters are
translated to:
```
lat=36.7201600&lng=-4.4203400&date=2019-10-02
```
If your query contains other conditions (`!=`, `<`, etc.) then those conditions are applied
in Drill after the REST service returns the data.
Only fields listed in the `params` config filed will become parameters, all other
expressions are handled within Drill as explained above.
At present, Drill requires the values to be literals (constants). Drill does not
currently allow expressions. That is, the following will not become an HTTP parameter:
```sql
WHERE `lat` = 36 + 0.7201600
```
Drill will add parameters to the URL in the order listed in the config. Use this
feature if the API is strict about parameter ordering.
At present Drill does not enforce that parameters are provided in the query: Drill
assumes parameters are optional.
### Data Path
REST responses often have structure beyond the data you want to query. For example:
```json
"results":
{
"sunrise":"7:27:02 AM",
"sunset":"5:05:55 PM",
"solar_noon":"12:16:28 PM",
"day_length":"9:38:53",
"civil_twilight_begin":"6:58:14 AM",
"civil_twilight_end":"5:34:43 PM",
"nautical_twilight_begin":"6:25:47 AM",
"nautical_twilight_end":"6:07:10 PM",
"astronomical_twilight_begin":"5:54:14 AM",
"astronomical_twilight_end":"6:38:43 PM"
},
"status":"OK"
}
```
Drill can handle JSON structures such as the above; you can use SQL to obtain the
results you want. However, the SQL will be simpler if we skip over the portions we
don't want and simply read the `results` fields as our SQL fields. We do that with
the `dataPath` configuration:
```json
dataPath: "results"
```
The `dataPath` can contain any number of fields, for example: `"response/content/data"`.
Drill will ignore all JSON content outside of the data path.
At present, there is no provision to check the `status` code in a response such
as that shown above. Drill assumes that the server will uses HTTP status codes to
indicate a bad request or other error.
#### Input Type
The REST plugin accepts three different types of input: `json`, `csv` and `xml`. The default is `json`. If you are using `XML` as a data type, there is an additional
configuration option called `xmlDataLevel` which reduces the level of unneeded nesting found in XML files. You can find more information in the documentation for Drill's XML
format plugin.
#### JSON Configuration
Drill has a collection of JSON configuration options to allow you to configure how Drill interprets JSON files. These are set at the global level, however the HTTP plugin
allows you to configure these options individually per connection and override the Drill defaults. The options are:
* `allowNanInf`: Configures the connection to interpret `NaN` and `Inf` values
* `allTextMode`: By default, Drill attempts to infer data types from JSON data. If the data is malformed, Drill may throw schema change exceptions. If your data is
inconsistent, you can enable `allTextMode` which when true, Drill will read all JSON values as strings, rather than try to infer the data type.
* `readNumbersAsDouble`: By default Drill will attempt to interpret integers, floating point number types and strings. One challenge is when data is consistent, Drill may
throw schema change exceptions. In addition to `allTextMode`, you can make Drill less sensitive by setting the `readNumbersAsDouble` to `true` which causes Drill to read all
numeric fields in JSON data as `double` data type rather than trying to distinguish between ints and doubles.
* `enableEscapeAnyChar`: Allows a user to escape any character with a \
All of these can be set by adding the `jsonOptions` to your connection configuration as shown below:
```json
"jsonOptions": {
"allTextMode": true,
"readNumbersAsDouble": true
}
```
#### Authorization
`authType`: If your API requires authentication, specify the authentication
type. Defaults to `none`.
If the `authType` is set to `basic`, `username` and `password` must be set in the configuration as well.
`username`: The username for basic authentication.
`password`: The password for basic authentication.
#### Limiting Results
Some APIs support a query parameter which is used to limit the number of results returned by the API. In this case you can set the `limitQueryParam` config variable to the query parameter name and Drill will automatically include this in your query. For instance, if you have an API which supports a limit query parameter called `maxRecords` and you set the abovementioned config variable then execute the following query:
```sql
SELECT <fields>
FROM api.limitedApi
LIMIT 10
```
Drill will send the following request to your API:
```
https://<api>?maxRecords=10
```
### OAuth2.0
If the API which you are querying requires OAuth2.0 for authentication [read the documentation for configuring Drill to use OAuth2.0](OAuth.md).
### Pagination
If you want to use automatic pagination in Drill, [click here to read the documentation for pagination](Pagination.md).
#### errorOn400
When a user makes HTTP calls, the response code will be from 100-599. 400 series error codes can contain useful information and in some cases you would not want Drill to throw
errors on 400 series errors. This option allows you to define Drill's behavior on 400 series error codes. When set to `true`, Drill will throw an exception and halt execution
on 400 series errors, `false` will return an empty result set (with implicit fields populated).
#### verifySSLCert
Default is `true`, but when set to false, Drill will trust all SSL certificates. Useful for debugging or on internal corporate networks using self-signed certificates or
private certificate authorities.
#### caseSensitiveFilters
Some APIs are case sensitive with the fields which are pushed down. If the endpoint that you are working with is in fact case sensitive, simply set this to `true`. Defaults to `false`.
## Usage
This plugin is different from other plugins in that it the table component of the `FROM` clause
is different. In normal Drill queries, the `FROM` clause is constructed as follows:
```sql
FROM <storage plugin>.<schema>.<table>
```
For example, you might have:
```sql
FROM dfs.test.`somefile.csv`
-- or
FROM mongo.stats.sales_data
```
The HTTP/REST plugin the `FROM` clause enables you to pass arguments to your REST call if you
set the `requireTail` property to `true`. The structure is:
```sql
FROM <plugin>.<connection>.<arguments>
--Actual example:
FROM http.sunrise.`/json?lat=36.7201600&lng=-4.4203400&date=today`
```
Or, as explained above, you can have the URL act like a table and pass parameters
using a `WHERE` clause "filter" conditions.
## Proxy Setup
Some users access HTTP services from behind a proxy firewall. Drill provides three ways specify proxy
configuration.
### Proxy Environment Variables
Drill recognizes the usual Linux proxy environment variables:
* `http_proxy`, `HTTP_PROXY`
* `https_proxy`, `HTTP_PROXY`
* `all_proxy`, `ALL_PROXY`
This technique works well if your system is already configured to
handle proxies.
### Boot Configuration
You can also specify proxy configuration in the `drill-override.conf` file.
See `drill-override-example.conf` for a template. Use the boot configuration
is an attribute of your network environment. Doing so will ensure every
Drillbit and every HTTP/HTTPS request uses the same proxy configuration.
First, you can use the same form of URL you would use with the environment
variables:
```
drill.exec.net_proxy.http_url: "http://foo.com/1234"
```
There is one setting for HTTP, another for HTTPS.
Alternatively, you can specify each field separately:
```
drill.exec.net_proxy.http: {
type: "none", # none, http, socks. Blank same as none.
host: "",
port: 80,
user_name: "",
password: ""
},
```
The valid proxy types are `none`, `http` and `socks`. Blank is the same
as `none`.
Again, there is a parallel section for HTTPS.
### In the HTTP Storage Plugin Config
The final way to configure proxy is in the HTTP storage plugin itself. The proxy
applies to all connections defined in that plugin. Use this approach if the proxy
applies only to some external services, or if each service has a different proxy
(defined by creating a separate plugin config for each service.)
```json
proxy_type: "direct",
proxy_host: "",
proxy_port: 80,
proxy_user_name: "",
proxy_password: ""
```
The valid proxy types are `direct`, `http` or `socks`. Blank is the same
as `direct`.
## Examples
### Example 1: Reference Data, A Sunrise/Sunset API
The API sunrise-sunset.org returns data in the following format:
```json
"results":
{
"sunrise":"7:27:02 AM",
"sunset":"5:05:55 PM",
"solar_noon":"12:16:28 PM",
"day_length":"9:38:53",
"civil_twilight_begin":"6:58:14 AM",
"civil_twilight_end":"5:34:43 PM",
"nautical_twilight_begin":"6:25:47 AM",
"nautical_twilight_end":"6:07:10 PM",
"astronomical_twilight_begin":"5:54:14 AM",
"astronomical_twilight_end":"6:38:43 PM"
},
"status":"OK"
}
```
To query this API, set the configuration as follows:
```json
{
"type": "http",
"cacheResults": false,
"enabled": true,
"timeout": 5,
"connections": {
"sunrise": {
"url": "https://api.sunrise-sunset.org/json",
"requireTail": true,
"method": "GET",
"headers": null,
"authType": "none",
"userName": null,
"password": null,
"postBody": null,
"inputType": "json",
"errorOn400": true
}
}
```
Then, to execute a query:
```sql
SELECT api_results.results.sunrise AS sunrise,
api_results.results.sunset AS sunset
FROM http.sunrise.`?lat=36.7201600&lng=-4.4203400&date=today` AS api_results;
```
Which yields the following results:
```
+------------+------------+
| sunrise | sunset |
+------------+------------+
| 7:17:46 AM | 5:01:33 PM |
+------------+------------+
1 row selected (0.632 seconds)
```
#### Using Parameters
We can improvide the above configuration to use `WHERE` clause filters and
a `dataPath` to skip over the unwanted parts of the message
body. Set the configuration as follows:
```json
{
"type": "http",
"cacheResults": false,
"enabled": true,
"timeout": 5,
"connections": {
"sunrise": {
"url": "https://api.sunrise-sunset.org/json",
"requireTail": false,
"method": "GET",
"dataPath": "results",
"headers": null,
"params": [ "lat", "lng", "date" ],
"authType": "none",
"userName": null,
"password": null,
"postBody": null,
"errorOn400": true
}
}
```
Then, to execute a query:
```sql
SELECT sunrise, sunset
FROM http.sunrise
WHERE `lat` = 36.7201600 AND `lng` = -4.4203400 AND `date` = 'today'
```
Which yields the same results as before.
### Example 2: JIRA
JIRA Cloud has a REST API which is
[documented here](https://developer.atlassian.com/cloud/jira/platform/rest/v3/?utm_source=%2Fcloud%2Fjira%2Fplatform%2Frest%2F&utm_medium=302).
To connect Drill to JIRA Cloud, use the following configuration:
```json
{
"type": "http",
"cacheResults": false,
"timeout": 5,
"connections": {
"jira": {
"url": "https://<project>.atlassian.net/rest/api/3/",
"method": "GET",
"dataPath": "issues",
"headers": {
"Accept": "application/json"
},
"authType": "basic",
"userName": "<username>",
"password": "<API Key>",
"postBody": null
}
},
"enabled": true
}
```
Once you've configured Drill to query the API, you can now easily access any of your data in JIRA.
The JIRA API returns highly nested data, however with a little preparation, it
is pretty straightforward to transform it into a more useful table. For instance, the
query below:
```sql
SELECT key,
t.fields.issueType.name AS issueType,
SUBSTR(t.fields.created, 1, 10) AS created,
SUBSTR(t.fields.updated, 1, 10) AS updated,
t.fields.assignee.displayName as assignee,
t.fields.creator.displayName as creator,
t.fields.summary AS summary,
t.fields.status.name AS currentStatus,
t.fields.priority.name AS priority,
t.fields.labels AS labels,
t.fields.subtasks AS subtasks
FROM http.jira.`search?jql=project%20%3D%20<project>&&maxResults=100 AS t`
```
The query below counts the number of issues by priority:
```sql
SELECT t.fields.priority.name AS priority,
COUNT(*) AS issue_count
FROM http.jira.`search?jql=project%20%3D%20<project>&&maxResults=100` AS t
GROUP BY priority
ORDER BY issue_count DESC
```
<img src="images/issue_count.png" alt="Issue Count by Priority"/>
## Limitations
1. The plugin is supposed to follow redirects, however if you are using authentication,
you may encounter errors or empty responses if you are counting on the endpoint for
redirection.
~~2. At this time, the plugin does not support any authentication other than basic authentication.~~
3. This plugin does not implement join filter pushdowns (only constant pushdowns are
supported). Join pushdown has the potential to improve performance if you use the HTTP service
joined to another table.
~~4. This plugin only reads JSON and CSV responses.~~
5. `POST` bodies can only be in the format of key/value pairs. Some APIs accept
JSON based `POST` bodies but this is not currently supported.
6. When using `dataPath`, the returned message should a single JSON object. The field
pointed to by the `dataPath` should contain a single JSON object or an array of objects.
7. When not using `dataPath`, the response should be a single JSON object, an array of
JSON objects, or a series of line-delimited JSON objects (the so-called
[jsonlines](http://jsonlines.org/) format.)
8. Parameters are considered optional; no error will be given if a query omits
parameters. An enhancement would be to mark parameters as required: all are required
or just some. If parameters are required, but omitted, the report service will
likely return an error.
## Troubleshooting
If anything goes wrong, Drill will provide a detailed error message, including URL:
```
DATA_READ ERROR: Failed to read the HTTP response body
Error message: Read timed out
Connection: sunrise
Plugin: api
URL: https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=today
Fragment: 0:0
```
If using a "tail" in the query, verify that the tail is quoted using back-ticks
as shown in the examples.
Check that the URL is correct. If not, check the plugin configuration properties
described above to find out why the pieces were assembled as you want.
If the query works but delivers unexpected results, check the Drill log file.
Drill logs a message like the following at the info level when opening the HTTP connection:
```
Connection: sunrise, Method: GET,
URL: https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400&date=today
```
If the query runs, but produces odd results, try a simple `SELECT *` query. This may reveal
if there is unexpected message context in addition to the data. Use the `dataPath` property
to ignore the extra content.
## Implicit Fields
The HTTP plugin includes four implicit fields which can be used for debugging. These fields do not appear in star queries. They are:
* `_response_code`: The response code from the HTTP request. This field is an `INT`.
* `_response_message`: The response message.
* `_response_protocol`: The response protocol.
* `_response_url`: The actual URL sent to the API.