layout: doc_page title: “Ingestion Spec”

Ingestion Spec

A Druid ingestion spec consists of 3 components:

{
  "dataSchema" : {...},
  "ioConfig" : {...},
  "tuningConfig" : {...}
}
FieldTypeDescriptionRequired
dataSchemaJSON ObjectSpecifies the the schema of the incoming data. All ingestion specs can share the same dataSchema.yes
ioConfigJSON ObjectSpecifies where the data is coming from and where the data is going. This object will vary with the ingestion method.yes
tuningConfigJSON ObjectSpecifies how to tune various ingestion parameters. This object will vary with the ingestion method.no

DataSchema

An example dataSchema is shown below:

"dataSchema" : {
  "dataSource" : "wikipedia",
  "parser" : {
    "type" : "string",
    "parseSpec" : {
      "format" : "json",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": [
          "page",
          "language",
          "user",
          "unpatrolled",
          "newPage",
          "robot",
          "anonymous",
          "namespace",
          "continent",
          "country",
          "region",
          "city",
          {
            "type": "long",
            "name": "countryNum"
          },
          {
            "type": "float",
            "name": "userLatitude"
          },
          {
            "type": "float",
            "name": "userLongitude"
          }
        ],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [{
    "type" : "count",
    "name" : "count"
  }, {
    "type" : "doubleSum",
    "name" : "added",
    "fieldName" : "added"
  }, {
    "type" : "doubleSum",
    "name" : "deleted",
    "fieldName" : "deleted"
  }, {
    "type" : "doubleSum",
    "name" : "delta",
    "fieldName" : "delta"
  }],
  "granularitySpec" : {
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
  },
  "transformSpec" : null
}
FieldTypeDescriptionRequired
dataSourceStringThe name of the ingested datasource. Datasources can be thought of as tables.yes
parserJSON ObjectSpecifies how ingested data can be parsed.yes
metricsSpecJSON Object arrayA list of aggregators.yes
granularitySpecJSON ObjectSpecifies how to create segments and roll up data.yes
transformSpecJSON ObjectSpecifes how to filter and transform input data. See transform specs.no

Parser

If type is not included, the parser defaults to string. For additional data formats, please see our extensions list.

String Parser

FieldTypeDescriptionRequired
typeStringThis should say string in general, or hadoopyString when used in a Hadoop indexing job.no
parseSpecJSON ObjectSpecifies the format, timestamp, and dimensions of the data.yes

ParseSpec

ParseSpecs serve two purposes:

  • The String Parser use them to determine the format (i.e. JSON, CSV, TSV) of incoming rows.
  • All Parsers use them to determine the timestamp and dimensions of incoming rows.

If format is not included, the parseSpec defaults to tsv.

JSON ParseSpec

Use this with the String Parser to load JSON.

FieldTypeDescriptionRequired
formatStringThis should say json.no
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
flattenSpecJSON ObjectSpecifies flattening configuration for nested JSON data. See Flattening JSON for more info.no

JSON Lowercase ParseSpec

This is a special variation of the JSON ParseSpec that lower cases all the column names in the incoming JSON data. This parseSpec is required if you are updating to Druid 0.7.x from Druid 0.6.x, are directly ingesting JSON with mixed case column names, do not have any ETL in place to lower case those column names, and would like to make queries that include the data you created using 0.6.x and 0.7.x.

FieldTypeDescriptionRequired
formatStringThis should say jsonLowercase.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes

CSV ParseSpec

Use this with the String Parser to load CSV. Strings are parsed using the com.opencsv library.

FieldTypeDescriptionRequired
formatStringThis should say csv.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
listDelimiterStringA custom delimiter for multi-value dimensions.no (default == ctrl+A)
columnsJSON arraySpecifies the columns of the data.yes

TSV / Delimited ParseSpec

Use this with the String Parser to load any delimited text that does not require special escaping. By default, the delimiter is a tab, so this will load TSV.

FieldTypeDescriptionRequired
formatStringThis should say tsv.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes
delimiterStringA custom delimiter for data values.no (default == \t)
listDelimiterStringA custom delimiter for multi-value dimensions.no (default == ctrl+A)
columnsJSON String arraySpecifies the columns of the data.yes

TimeAndDims ParseSpec

Use this with non-String Parsers to provide them with timestamp and dimensions information. Non-String Parsers handle all formatting decisions on their own, without using the ParseSpec.

FieldTypeDescriptionRequired
formatStringThis should say timeAndDims.yes
timestampSpecJSON ObjectSpecifies the column and format of the timestamp.yes
dimensionsSpecJSON ObjectSpecifies the dimensions of the data.yes

TimestampSpec

FieldTypeDescriptionRequired
columnStringThe column of the timestamp.yes
formatStringiso, posix, millis, micro, nano, auto or any Joda time format.no (default == ‘auto’

DimensionsSpec

FieldTypeDescriptionRequired
dimensionsJSON arrayA list of dimension schema objects or dimension names. Providing a name is equivalent to providing a String-typed dimension schema with the given name. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as String-typed dimension columns.yes
dimensionExclusionsJSON String arrayThe names of dimensions to exclude from ingestion.no (default == []
spatialDimensionsJSON Object arrayAn array of spatial dimensionsno (default == []

Dimension Schema

A dimension schema specifies the type and name of a dimension to be ingested.

For string columns, the dimension schema can also be used to enable or disable bitmap indexing by setting the createBitmapIndex boolean. By default, bitmap indexes are enabled for all string columns. Only string columns can have bitmap indexes; they are not supported for numeric columns.

For example, the following dimensionsSpec section from a dataSchema ingests one column as Long (countryNum), two columns as Float (userLatitude, userLongitude), and the other columns as Strings, with bitmap indexes disabled for the comment column.

"dimensionsSpec" : {
  "dimensions": [
    "page",
    "language",
    "user",
    "unpatrolled",
    "newPage",
    "robot",
    "anonymous",
    "namespace",
    "continent",
    "country",
    "region",
    "city",
    {
      "type": "string",
      "name": "comment",
      "createBitmapIndex": false
    },
    {
      "type": "long",
      "name": "countryNum"
    },
    {
      "type": "float",
      "name": "userLatitude"
    },
    {
      "type": "float",
      "name": "userLongitude"
    }
  ],
  "dimensionExclusions" : [],
  "spatialDimensions" : []
}

metricsSpec

The metricsSpec is a list of aggregators. If rollup is false in the granularity spec, the metrics spec should be an empty list and all columns should be defined in the dimensionsSpec instead (without rollup, there isn't a real distinction between dimensions and metrics at ingestion time). This is optional, however.

GranularitySpec

The default granularity spec is uniform, and can be changed by setting the type field. Currently, uniform and arbitrary types are supported.

Uniform Granularity Spec

This spec is used to generated segments with uniform intervals.

FieldTypeDescriptionRequired
segmentGranularitystringThe granularity to create segments at.no (default == ‘DAY’)
queryGranularitystringThe minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of “minute” will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows. A granularity of ‘NONE’ means millisecond granularity.no (default == ‘NONE’)
rollupbooleanrollup or notno (default == true)
intervalsstringA list of intervals for the raw data being ingested. Ignored for real-time ingestion.no. If specified, batch ingestion tasks may skip determining partitions phase which results in faster ingestion.

Arbitrary Granularity Spec

This spec is used to generate segments with arbitrary intervals (it tries to create evenly sized segments). This spec is not supported for real-time processing.

FieldTypeDescriptionRequired
queryGranularitystringThe minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of “minute” will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows. A granularity of ‘NONE’ means millisecond granularity.no (default == ‘NONE’)
rollupbooleanrollup or notno (default == true)
intervalsstringA list of intervals for the raw data being ingested. Ignored for real-time ingestion.no. If specified, batch ingestion tasks may skip determining partitions phase which results in faster ingestion.

Transform Spec

Transform specs allow Druid to transform and filter input data during ingestion. See Transform specs

IO Config

The IOConfig spec differs based on the ingestion task type.

Tuning Config

The TuningConfig spec differs based on the ingestion task type.

Evaluating Timestamp, Dimensions and Metrics

Druid will interpret dimensions, dimension exclusions, and metrics in the following order:

  • Any column listed in the list of dimensions is treated as a dimension.
  • Any column listed in the list of dimension exclusions is excluded as a dimension.
  • The timestamp column and columns/fieldNames required by metrics are excluded by default.
  • If a metric is also listed as a dimension, the metric must have a different name than the dimension name.