Docs refactor of ingestion. Carries #11541 (#11576) * Docs refactor of ingestion. Carries #11541 * Update docs/misc/math-expr.md * add Apache license * fix header, add topics to sidebar * Update docs/ingestion/partitioning.md * pick up changes to and md from c7fdf1d, #11479 Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: Jihoon Son <jihoonson@apache.org>

commit: 6524d838d71829a7d182b608fdbd9e1b5357f54f [log] [tgz]
author: Charles Smith <techdocsmith@gmail.com> Fri Aug 13 08:42:03 2021 -0700
committer: GitHub <noreply@github.com> Fri Aug 13 08:42:03 2021 -0700
tree: 16e634b001ee6819ef7762ff1ce9bd23835f1677
parent: aaf0aaad8f3936358ab77415df7dcc68d3ee7800 [diff]
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
index b724e3c..4ca0e8d 100644
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md

@@ -927,8 +927,8 @@
 |`maxBytesInMemory`|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is `maxBytesInMemory` * (2 + `maxPendingPersists`)|no (default = 1/6 of max JVM memory)|
 |`splitHintSpec`|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](../ingestion/native-batch.md#split-hint-spec) for more details.|no (default = size-based split hint spec)|
 |`partitionsSpec`|Defines how to partition data in each time chunk, see [`PartitionsSpec`](../ingestion/native-batch.md#partitionsspec)|no (default = `dynamic`)|
-|`indexSpec`|Defines segment storage format options to be used at indexing time, see [IndexSpec](../ingestion/index.md#indexspec)|no|
-|`indexSpecForIntermediatePersists`|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](../ingestion/index.md#indexspec) for possible values.|no|
+|`indexSpec`|Defines segment storage format options to be used at indexing time, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec)|no|
+|`indexSpecForIntermediatePersists`|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec) for possible values.|no|
 |`maxPendingPersists`|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with `maxRowsInMemory` * (2 + `maxPendingPersists`).|no (default = 0, meaning one persist can be running concurrently with ingestion, and none can be queued up)|
 |`pushTimeout`|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|no (default = 0)|
 |`segmentWriteOutMediumFactory`|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](../ingestion/native-batch.md#segmentwriteoutmediumfactory).|no (default is the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|

diff --git a/docs/design/segments.md b/docs/design/segments.md
index e54353b..a358299 100644
--- a/docs/design/segments.md
+++ b/docs/design/segments.md

@@ -27,7 +27,7 @@
 time. In a basic setup, one segment file is created for each time
 interval, where the time interval is configurable in the
 `segmentGranularity` parameter of the
-[`granularitySpec`](../ingestion/index.md#granularityspec).  For Druid to
+[`granularitySpec`](../ingestion/ingestion-spec.md#granularityspec).  For Druid to
 operate well under heavy query load, it is important for the segment
 file size to be within the recommended range of 300MB-700MB. If your
 segment files are larger than this range, then consider either
@@ -187,7 +187,7 @@
 A ColumnDescriptor is essentially an object that allows us to use Jackson's polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serialization/deserialization logic that can deserialize the rest of the binary.
 
 ### Compression
-Druid compresses blocks of values for string, long, float, and double columns, using [LZ4](https://github.com/lz4/lz4-java) by default, and bitmaps for string columns and numeric null values are compressed using [Roaring](https://github.com/RoaringBitmap/RoaringBitmap). We recommend sticking with these defaults unless experimental verification with your own data and query patterns suggest that non-default options will perform better in your specific case. For example, for bitmap in string columns, the differences between using Roaring and CONCISE are most pronounced for high cardinality columns. In this case, Roaring is substantially faster on filters that match a lot of values, but in some cases CONCISE can have a lower footprint due to the overhead of the Roaring format (but is still slower when lots of values are matched). Currently, compression is configured on at the segment level rather than individual columns, see [IndexSpec](../ingestion/index.md#indexspec) for more details.
+Druid compresses blocks of values for string, long, float, and double columns, using [LZ4](https://github.com/lz4/lz4-java) by default, and bitmaps for string columns and numeric null values are compressed using [Roaring](https://github.com/RoaringBitmap/RoaringBitmap). We recommend sticking with these defaults unless experimental verification with your own data and query patterns suggest that non-default options will perform better in your specific case. For example, for bitmap in string columns, the differences between using Roaring and CONCISE are most pronounced for high cardinality columns. In this case, Roaring is substantially faster on filters that match a lot of values, but in some cases CONCISE can have a lower footprint due to the overhead of the Roaring format (but is still slower when lots of values are matched). Currently, compression is configured on at the segment level rather than individual columns, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec) for more details.
 
 ## Sharding Data to Create Segments
 

diff --git a/docs/development/extensions-core/datasketches-hll.md b/docs/development/extensions-core/datasketches-hll.md
index c359dc7..195c501 100644
--- a/docs/development/extensions-core/datasketches-hll.md
+++ b/docs/development/extensions-core/datasketches-hll.md

@@ -59,7 +59,7 @@
  }
 ```
 
-> It is very common to use `HLLSketchBuild` in combination with [rollup](../../ingestion/index.html#rollup) to create a [metric](../../ingestion/index.html#metricsspec) on high-cardinality columns.  In this example, a metric called `userid_hll` is included in the `metricsSpec`.  This will perform a HLL sketch on the `userid` field at ingestion time, allowing for highly-performant approximate `COUNT DISTINCT` query operations and improving roll-up ratios when `userid` is then left out of the `dimensionsSpec`.
+> It is very common to use `HLLSketchBuild` in combination with [rollup](../../ingestion/rollup.md) to create a [metric](../../ingestion/ingestion-spec.html#metricsspec) on high-cardinality columns.  In this example, a metric called `userid_hll` is included in the `metricsSpec`.  This will perform a HLL sketch on the `userid` field at ingestion time, allowing for highly-performant approximate `COUNT DISTINCT` query operations and improving roll-up ratios when `userid` is then left out of the `dimensionsSpec`.
 >
 > ```
 > :

diff --git a/docs/development/extensions-core/kafka-ingestion.md b/docs/development/extensions-core/kafka-ingestion.md
index 52b751e..0b03217 100644
--- a/docs/development/extensions-core/kafka-ingestion.md
+++ b/docs/development/extensions-core/kafka-ingestion.md

@@ -126,7 +126,7 @@
 |Field|Description|Required|
 |--------|-----------|---------|
 |`type`|The supervisor type, this should always be `kafka`.|yes|
-|`dataSchema`|The schema that will be used by the Kafka indexing task during ingestion. See [`dataSchema`](../../ingestion/index.md#dataschema) for details.|yes|
+|`dataSchema`|The schema that will be used by the Kafka indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema) for details.|yes|
 |`ioConfig`|A KafkaSupervisorIOConfig object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task. See [KafkaSupervisorIOConfig](#kafkasupervisorioconfig) below.|yes|
 |`tuningConfig`|A KafkaSupervisorTuningConfig object for configuring performance-related settings for the supervisor and indexing tasks. See [KafkaSupervisorTuningConfig](#kafkasupervisortuningconfig) below.|no|
 

diff --git a/docs/development/extensions-core/kinesis-ingestion.md b/docs/development/extensions-core/kinesis-ingestion.md
index 855707e..cae9018 100644
--- a/docs/development/extensions-core/kinesis-ingestion.md
+++ b/docs/development/extensions-core/kinesis-ingestion.md

@@ -114,7 +114,7 @@
 |Field|Description|Required|
 |--------|-----------|---------|
 |`type`|The supervisor type, this should always be `kinesis`.|yes|
-|`dataSchema`|The schema that will be used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/index.md#dataschema).|yes|
+|`dataSchema`|The schema that will be used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema).|yes|
 |`ioConfig`|A KinesisSupervisorIOConfig object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task. See [KinesisSupervisorIOConfig](#kinesissupervisorioconfig) below.|yes|
 |`tuningConfig`|A KinesisSupervisorTuningConfig object for configuring performance-related settings for the supervisor and indexing tasks. See [KinesisSupervisorTuningConfig](#kinesissupervisortuningconfig) below.|no|
 

diff --git a/docs/development/extensions-core/orc.md b/docs/development/extensions-core/orc.md
index 6856baf..e73583d 100644
--- a/docs/development/extensions-core/orc.md
+++ b/docs/development/extensions-core/orc.md

@@ -44,7 +44,7 @@
 * The 'contrib' extension supported a `typeString` property, which provided the schema of the
 ORC file, of which was essentially required to have the types correct, but notably _not_ the column names, which
 facilitated column renaming. In the 'core' extension, column renaming can be achieved with
-[`flattenSpec`](../../ingestion/index.md#flattenspec). For example, `"typeString":"struct<time:string,name:string>"`
+[`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec). For example, `"typeString":"struct<time:string,name:string>"`
 with the actual schema `struct<_col0:string,_col1:string>`, to preserve Druid schema would need replaced with:
 
 ```json
@@ -67,7 +67,7 @@
 
 * The 'contrib' extension supported a `mapFieldNameFormat` property, which provided a way to specify a dimension to
  flatten `OrcMap` columns with primitive types. This functionality has also been replaced with
- [`flattenSpec`](../../ingestion/index.md#flattenspec). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
+ [`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
  for a dimension `nestedData_dim1`, to preserve Druid schema could be replaced with
 
  ```json

diff --git a/docs/ingestion/compaction.md b/docs/ingestion/compaction.md
index 68ce4df..8ccf9a9 100644
--- a/docs/ingestion/compaction.md
+++ b/docs/ingestion/compaction.md

@@ -36,7 +36,7 @@
 - Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. You cannot go from coarser granularity to finer granularity.
 - You can change the dimension order to improve sorting and reduce segment size.
 - You can remove unused columns in compaction or implement an aggregation metric for older data.
-- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](index.md#perfect-rollup-vs-best-effort-rollup).
+- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](./rollup.md#perfect-rollup-vs-best-effort-rollup).
 
 Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
 
@@ -82,7 +82,7 @@
 
 ### Rollup
 Druid only rolls up the output segment when `rollup` is set for all input segments.
-See [Roll-up](../ingestion/index.md#rollup) for more details.
+See [Roll-up](../ingestion/rollup.md) for more details.
 You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes).
 
 ## Setting up manual compaction

diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index cb1b3d2..df0f165 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md

@@ -79,12 +79,19 @@
 Especially if you want to use the Hadoop ingestion, you still need to use the [Parser](#parser).
 If your data is formatted in some format not listed in this section, please consider using the Parser instead.
 
-All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](index.md#ioconfig).
+All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](ingestion-spec.md#ioconfig).
 
 ### JSON
 
-The `inputFormat` to load data of JSON format. An example is:
+Configure the JSON `inputFormat` to load JSON data as follows:
 
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no |
+
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -94,18 +101,19 @@
 }
 ```
 
-The JSON `inputFormat` has the following components:
+### CSV
+
+Configure the CSV `inputFormat` to load CSV data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `json`. | yes |
-| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
-| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no |
+| type | String | This should say `csv`. | yes |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) |
 
-### CSV
-
-The `inputFormat` to load data of the CSV format. An example is:
-
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -116,30 +124,9 @@
 }
 ```
 
-The CSV `inputFormat` has the following components:
-
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-| type | String | This should say `csv`. | yes |
-| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) |
-| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing |
-| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) |
-| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) |
-
 ### TSV (Delimited)
 
-```json
-"ioConfig": {
-  "inputFormat": {
-    "type": "tsv",
-    "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
-    "delimiter":"|"
-  },
-  ...
-}
-```
-
-The `inputFormat` to load data of a delimited format. An example is:
+Configure the TSV `inputFormat` to load TSV data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
@@ -152,15 +139,32 @@
 
 Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.
 
+ For example:
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "tsv",
+    "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
+    "delimiter":"|"
+  },
+  ...
+}
+```
+
 ### ORC
 
-> You need to include the [`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension to use the ORC input format.
+To use the ORC input format, load the Druid Orc extension ( [`druid-orc-extensions`](../development/extensions-core/orc.md)). 
+> To upgrade from versions earlier than 0.15.0 to 0.15.0 or new, read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension).
 
-> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a higher version,
-> please read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension) carefully.
+Configure the ORC `inputFormat` to load ORC data as follows:
 
-The `inputFormat` to load data of ORC format. An example is:
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `orc`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested ORC data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| binaryAsString | Boolean | Specifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string. | no (default = false) |
 
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -181,20 +185,19 @@
 }
 ```
 
-The ORC `inputFormat` has the following components:
+### Parquet
+
+To use the Parquet input format load the Druid Parquet extension ([`druid-parquet-extensions`](../development/extensions-core/parquet.md)).
+
+Configure the Parquet `inputFormat` to load Parquet data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `orc`. | yes |
-| flattenSpec | JSON Object | Specifies flattening configuration for nested ORC data. See [`flattenSpec`](#flattenspec) for more info. | no |
-| binaryAsString | Boolean | Specifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string. | no (default = false) |
+|type| String| This should be set to `parquet` to read Parquet file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
+| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
 
-### Parquet
-
-> You need to include the [`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an extension to use the Parquet input format.
-
-The `inputFormat` to load data of Parquet format. An example is:
-
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -215,21 +218,22 @@
 }
 ```
 
-The Parquet `inputFormat` has the following components:
+### Avro Stream
+
+To use the Avro Stream input format load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
+
+For more information on how Druid handles Avro types, see [Avro Types](../development/extensions-core/avro.md#avro-types) section for
+
+Configure the Avro `inputFormat` to load Avro data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
+|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
+|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes |
+| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
 
-### Avro Stream
-
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream input format.
-
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
-
-The `inputFormat` to load data of Avro format in stream ingestion. An example is:
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -263,13 +267,6 @@
 }
 ```
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes |
-| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
-
 ##### Avro Bytes Decoder
 
 If `type` is not included, the avroBytesDecoder defaults to `schema_repo`.
@@ -430,11 +427,20 @@
 
 ### Avro OCF
 
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro OCF input format.
+To load the Avro OCF input format, load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
 
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
+See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
 
-The `inputFormat` to load data of Avro OCF format. An example is:
+Configure the Avro OCF `inputFormat` to load Avro OCF data as follows:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
+|schema| JSON Object |Define a reader schema to be used when parsing Avro records, this is useful when parsing multiple versions of Avro OCF file data | no (default will decode using the writer schema contained in the OCF file) |
+| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
+
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -470,18 +476,19 @@
 }
 ```
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-|schema| JSON Object |Define a reader schema to be used when parsing Avro records, this is useful when parsing multiple versions of Avro OCF file data | no (default will decode using the writer schema contained in the OCF file) |
-| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
-
 ### Protobuf
 
 > You need to include the [`druid-protobuf-extensions`](../development/extensions-core/protobuf.md) as an extension to use the Protobuf input format.
 
-The `inputFormat` to load data of Protobuf format. An example is:
+Configure the Protobuf `inputFormat` to load Protobuf data as follows:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
+|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes |
+
+For example:
 ```json
 "ioConfig": {
   "inputFormat": {
@@ -506,18 +513,18 @@
 }
 ```
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes |
-
 ### FlattenSpec
 
-The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-An example `flattenSpec` is:
+The `flattenSpec` bridges the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. It is an object within the `inputFormat` object.
 
+Configure your `flattenSpec` as follows:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
+| fields | Specifies the fields of interest and how they are accessed. See [Field flattening specifications](#field-flattening-specifications) for more detail. | `[]` |
+
+For example:
 ```json
 "flattenSpec": {
   "useFieldDiscovery": true,
@@ -528,21 +535,12 @@
   ]
 }
 ```
-> Conceptually, after input data records are read, the `flattenSpec` is applied first before
-> any other specs such as [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec),
-> [`dimensionsSpec`](./index.md#dimensionsspec), or [`metricsSpec`](./index.md#metricsspec). Keep this in mind when writing
-> your ingestion spec.
+After Druid reads the input data records, it applies the flattenSpec before applying any other specs such as [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec),
+> [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), or [`metricsSpec`](./ingestion-spec.md#metricsspec). Keep this in mind when writing your ingestion spec.
 
 Flattening is only supported for [data formats](data-formats.md) that support nesting, including `avro`, `json`, `orc`,
 and `parquet`.
 
-A `flattenSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
-| fields | Specifies the fields of interest and how they are accessed. [See below for details.](#field-flattening-specifications) | `[]` |
-
 #### Field flattening specifications
 
 Each entry in the `fields` list can have the following components:
@@ -550,7 +548,7 @@
 | Field | Description | Default |
 |-------|-------------|---------|
 | type | Options are as follows:<br><br><ul><li>`root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.</li><li>`path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.</li><li>`jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.</li></ul> | none (required) |
-| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).| none (required) |
+| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).| none (required) |
 | expr | Expression for accessing the field while flattening. For type `path`, this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. For other types, this parameter is ignored. | none (required for types `path` and `jq`) |
 
 #### Notes on flattening
@@ -1157,7 +1155,7 @@
 |-------|------|-------------|----------|
 | type | String | This should say `protobuf`. | yes |
 | `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to Protobuf record. | yes |
-| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data.  The format must be JSON. See [JSON ParseSpec](./index.md) for more configuration options.  Note that timeAndDims parseSpec is no longer supported. | yes |
+| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data.  The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more configuration options.  Note that timeAndDims parseSpec is no longer supported. | yes |
 
 Sample spec:
 

diff --git a/docs/ingestion/data-model.md b/docs/ingestion/data-model.md
new file mode 100644
index 0000000..53141ba
--- /dev/null
+++ b/docs/ingestion/data-model.md

@@ -0,0 +1,58 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+You can control other important operations that are based on the primary timestamp in the
+[`granularitySpec`](./ingestion-spec.md#granularityspec). If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time when necessary.
+
+If you disable [rollup](./rollup.md), then Druid treats the set of
+dimensions like a set of columns to ingest. The dimensions behave exactly as you would expect from any database that does not support a rollup feature.
+
+At ingestion time, you configure dimensions in the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
+
+## Metrics
+
+Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you specify a metric, you can apply an aggregation function to each row during ingestion. This
+has the following benefits:
+
+Rollup is a form of aggregation that collapses dimensions while aggregating the values in the metrics, that is, it collapses rows but retains its summary information."
+- [Rollup](rollup.md) is a form of aggregation that combines multiple rows with the same timestamp value and dimension values. For example, the [rollup tutorial](../tutorials/tutorial-rollup.md) demonstrates using rollup to collapse netflow data to a single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
+- Druid can compute some aggregators, especially approximate ones, more quickly at query time if they are partially computed at ingestion time, including data that has not been rolled up.
+
+ At ingestion time, you configure Metrics in the [`metricsSpec`](./ingestion-spec.md#metricsspec).

diff --git a/docs/ingestion/hadoop.md b/docs/ingestion/hadoop.md
index 4fd4bcc..f349f2a 100644
--- a/docs/ingestion/hadoop.md
+++ b/docs/ingestion/hadoop.md

@@ -115,7 +115,7 @@
 
 ## `dataSchema`
 
-This field is required. See the [`dataSchema`](index.md#legacy-dataschema-spec) section of the main ingestion page for details on
+This field is required. See the [`dataSchema`](ingestion-spec.md#legacy-dataschema-spec) section of the main ingestion page for details on
 what it should contain.
 
 ## `ioConfig`
@@ -328,8 +328,8 @@
 |combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.|no (default == false)|
 |useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default == false)|
 |jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below for details.|no (default == null)|
-|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](index.md#indexspec) on the main ingestion page for more information.|no|
-|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](index.md#indexspec) for possible values.|no (default = same as indexSpec)|
+|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](ingestion-spec.md#indexspec) on the main ingestion page for more information.|no|
+|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](ingestion-spec.md#indexspec) for possible values.|no (default = same as indexSpec)|
 |numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and CPU usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.|no (default == 0)|
 |forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs. Hash-based partitioning always uses an extendable shardSpec. For single-dimension partitioning, this option should be set to true to use an extendable shardSpec. For partitioning, please check [Partitioning specification](#partitionsspec). This option can be useful when you need to append more data to existing dataSource.|no (default = false)|
 |useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default = false)|

diff --git a/docs/ingestion/index.md b/docs/ingestion/index.md
index 6d36dd3..567d281 100644
--- a/docs/ingestion/index.md
+++ b/docs/ingestion/index.md

@@ -22,33 +22,24 @@
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+During ingestion Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md). Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic introduces streaming and batch ingestion methods. The following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.
 
 ## Ingestion methods
 
-The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
+The tables below list Druid's most common data ingestion methods, along with comparisons to help you choose
 the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details
 about how each method works, as well as configuration properties specific to that method, check out its documentation
 page.
@@ -59,6 +50,8 @@
 [Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. Alternatively, the Kinesis
 indexing service works with Amazon Kinesis Data Streams.
 
+Streaming ingestion uses an ongoing process called a supervisor that reads from the data stream to ingest data into Druid.
+
 This table compares the options:
 
 | **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) |
@@ -88,656 +81,6 @@
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
+| **[Rollup modes](./rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec-1) for details. |
 
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.

diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
new file mode 100644
index 0000000..d57f4f5
--- /dev/null
+++ b/docs/ingestion/ingestion-spec.md

@@ -0,0 +1,483 @@
+---
+id: ingestion-spec
+title: Ingestion spec reference
+sidebar_label: Ingestion spec
+description: Reference for the configuration options in the ingestion spec.
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
+
+Ingestion specs consists of three main components:
+
+- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
+   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
+- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
+   documentation for each [ingestion method](./index.md#ingestion-methods).
+- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
+  [ingestion method](./index.md#ingestion-methods).
+
+Example ingestion spec for task type `index_parallel` (native batch):
+
+```
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia",
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "auto"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          { "page" },
+          { "language" },
+          { "type": "long", "name": "userId" }
+        ]
+      },
+      "metricsSpec": [
+        { "type": "count", "name": "count" },
+        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+      ],
+      "granularitySpec": {
+        "segmentGranularity": "day",
+        "queryGranularity": "none",
+        "intervals": [
+          "2013-08-31/2013-09-01"
+        ]
+      }
+    },
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "local",
+        "baseDir": "examples/indexing/",
+        "filter": "wikipedia_data.json"
+      },
+      "inputFormat": {
+        "type": "json",
+        "flattenSpec": {
+          "useFieldDiscovery": true,
+          "fields": [
+            { "type": "path", "name": "userId", "expr": "$.user.id" }
+          ]
+        }
+      }
+    },
+    "tuningConfig": {
+      "type": "index_parallel"
+    }
+  }
+}
+```
+
+The specific options supported by these sections will depend on the [ingestion method](./index.md#ingestion-methods) you have chosen.
+For more examples, refer to the documentation for each ingestion method.
+
+You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
+available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
+[Kafka](../development/extensions-core/kafka-ingestion.md),
+[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
+[native batch](native-batch.md) mode.
+
+## `dataSchema`
+
+> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
+except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
+
+The `dataSchema` is a holder for the following components:
+
+- [datasource name](#datasource)
+- [primary timestamp](#timestampspec)
+- [dimensions](#dimensionsspec)
+- [metrics](#metricsspec)
+- [transforms and filters](#transformspec) (if needed).
+
+An example `dataSchema` is:
+
+```
+"dataSchema": {
+  "dataSource": "wikipedia",
+  "timestampSpec": {
+    "column": "timestamp",
+    "format": "auto"
+  },
+  "dimensionsSpec": {
+    "dimensions": [
+      { "page" },
+      { "language" },
+      { "type": "long", "name": "userId" }
+    ]
+  },
+  "metricsSpec": [
+    { "type": "count", "name": "count" },
+    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+  ],
+  "granularitySpec": {
+    "segmentGranularity": "day",
+    "queryGranularity": "none",
+    "intervals": [
+      "2013-08-31/2013-09-01"
+    ]
+  }
+}
+```
+
+### `dataSource`
+
+The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
+[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
+`dataSource` is:
+
+```
+"dataSource": "my-first-datasource"
+```
+
+### `timestampSpec`
+
+The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
+configuring the [primary timestamp](./data-model.md#primary-timestamp). An example `timestampSpec` is:
+
+```
+"timestampSpec": {
+  "column": "timestamp",
+  "format": "auto"
+}
+```
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+A `timestampSpec` can have the following components:
+
+|Field|Description|Default|
+|-----|-----------|-------|
+|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
+|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
+|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
+
+### `dimensionsSpec`
+
+The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
+configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec` is:
+
+```
+"dimensionsSpec" : {
+  "dimensions": [
+    "page",
+    "language",
+    { "type": "long", "name": "userId" }
+  ],
+  "dimensionExclusions" : [],
+  "spatialDimensions" : []
+}
+```
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+A `dimensionsSpec` can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
+| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
+| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
+
+#### Dimension objects
+
+Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
+a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
+
+Dimension objects can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| type | Either `string`, `long`, `float`, or `double`. | `string` |
+| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
+| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
+| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
+
+#### Inclusions and exclusions
+
+Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
+
+Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
+
+Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
+
+1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
+2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the `flattenSpec`. The `useFieldDiscovery` parameter determines whether the original root-level fields will be retained or discarded.
+3. Any field listed in `dimensionExclusions` is excluded.
+4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
+5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
+6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
+7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
+
+> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
+> schemaless dimension interpretation.
+
+### `metricsSpec`
+
+The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
+to apply at ingestion time. This is most useful when [rollup](./rollup.md) is enabled, since it's how you configure
+ingestion-time aggregation.
+
+An example `metricsSpec` is:
+
+```
+"metricsSpec": [
+  { "type": "count", "name": "count" },
+  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+]
+```
+
+> Generally, when [rollup](./rollup.md) is disabled, you should have an empty `metricsSpec` (because without rollup,
+> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
+> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
+> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
+> be done by defining a metric in a `metricsSpec`.
+
+### `granularitySpec`
+
+The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
+the following operations:
+
+1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
+2. Truncating the timestamp, if desired (via `queryGranularity`).
+3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
+4. Specifying whether ingestion-time [rollup](./rollup.md) should be used or not (via `rollup`).
+
+Other than `rollup`, these operations are all based on the [primary timestamp](./data-model.md#primary-timestamp).
+
+An example `granularitySpec` is:
+
+```
+"granularitySpec": {
+  "segmentGranularity": "day",
+  "queryGranularity": "none",
+  "intervals": [
+    "2013-08-31/2013-09-01"
+  ],
+  "rollup": true
+}
+```
+
+A `granularitySpec` can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
+| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
+| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
+| rollup | Whether to use ingestion-time [rollup](./rollup.md) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
+| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
+
+### `transformSpec`
+
+The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
+records during ingestion time. It is optional. An example `transformSpec` is:
+
+```
+"transformSpec": {
+  "transforms": [
+    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
+  ],
+  "filter": {
+    "type": "selector",
+    "dimension": "country",
+    "value": "San Serriffe"
+  }
+}
+```
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+#### Transforms
+
+The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
+"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
+
+If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
+shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
+
+Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
+they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
+with another field containing all nulls, which will act similarly to removing the field.
+
+Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
+They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
+a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
+`timestampSpec`.
+
+Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": "<output name>",
+  "expression": "<expr>"
+}
+```
+
+The `expression` is a [Druid query expression](../misc/math-expr.md).
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+#### Filter
+
+The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
+ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
+`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
+
+### Legacy `dataSchema` spec
+
+> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
+except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
+
+The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
+
+- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
+
+#### `parser` (Deprecated)
+
+In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
+items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
+For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
+
+For details about major components of the `parseSpec`, refer to their subsections:
+
+- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](./data-model.md#primary-timestamp).
+- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](./data-model.md#dimensions).
+- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
+
+An example `parser` is:
+
+```
+"parser": {
+  "type": "string",
+  "parseSpec": {
+    "format": "json",
+    "flattenSpec": {
+      "useFieldDiscovery": true,
+      "fields": [
+        { "type": "path", "name": "userId", "expr": "$.user.id" }
+      ]
+    },
+    "timestampSpec": {
+      "column": "timestamp",
+      "format": "auto"
+    },
+    "dimensionsSpec": {
+      "dimensions": [
+        { "page" },
+        { "language" },
+        { "type": "long", "name": "userId" }
+      ]
+    }
+  }
+}
+```
+
+#### `flattenSpec`
+
+In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
+bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
+See [Flatten spec](./data-formats.md#flattenspec) for more details.
+
+## `ioConfig`
+
+The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
+filesystem, or any other supported source system. The `inputFormat` property applies to all
+[ingestion method](./index.md#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
+uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
+The rest of `ioConfig` is specific to each individual ingestion method.
+An example `ioConfig` to read JSON data is:
+
+```json
+"ioConfig": {
+    "type": "<ingestion-method-specific type code>",
+    "inputFormat": {
+      "type": "json"
+    },
+    ...
+}
+```
+For more details, see the documentation provided by each [ingestion method](./index.md#ingestion-methods).
+
+## `tuningConfig`
+
+Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
+properties apply to all [ingestion methods](./index.md#ingestion-methods), but most are specific to each individual
+ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
+is:
+
+```plaintext
+"tuningConfig": {
+  "type": "<ingestion-method-specific type code>",
+  "maxRowsInMemory": 1000000,
+  "maxBytesInMemory": <one-sixth of JVM memory>,
+  "indexSpec": {
+    "bitmap": { "type": "roaring" },
+    "dimensionCompression": "lz4",
+    "metricCompression": "lz4",
+    "longEncoding": "longs"
+  },
+  <other ingestion-method-specific properties>
+}
+```
+
+|Field|Description|Default|
+|-----|-----------|-------|
+|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
+|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
+|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
+|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
+|indexSpec|Tune how data is indexed. See below for more information.|See table below|
+|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
+
+#### `indexSpec`
+
+The `indexSpec` object can include the following properties:
+
+|Field|Description|Default|
+|-----|-----------|-------|
+|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
+|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
+|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
+|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
+
+Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
+[ingestion method](./index.md#ingestion-methods) for details.

diff --git a/docs/ingestion/native-batch.md b/docs/ingestion/native-batch.md
index f477fab..9c1ef73 100644
--- a/docs/ingestion/native-batch.md
+++ b/docs/ingestion/native-batch.md

@@ -204,7 +204,7 @@
 
 This field is required.
 
-See [Ingestion Spec DataSchema](../ingestion/index.md#dataschema)
+See [Ingestion Spec DataSchema](../ingestion/ingestion-spec.md#dataschema)
 
 If you specify `intervals` explicitly in your dataSchema's `granularitySpec`, batch ingestion will lock the full intervals
 specified when it starts up, and you will learn quickly if the specified interval overlaps with locks held by other
@@ -238,10 +238,10 @@
 |numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create when using a `hashed` `partitionsSpec`. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no|
 |splitHintSpec|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](#split-hint-spec) for more details.|size-based split hint spec|no|
 |partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` or `single_dim` if `forceGuaranteedRollup` = true|no|
-|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no|
-|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no|
+|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](ingestion-spec.md#indexspec)|null|no|
+|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](ingestion-spec.md#indexspec) for possible values.|same as indexSpec|no|
 |maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
-|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, `intervals` in `granularitySpec` must be set and `hashed` or `single_dim` must be used for `partitionsSpec`. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
+|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, `intervals` in `granularitySpec` must be set and `hashed` or `single_dim` must be used for `partitionsSpec`. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
 |reportParseExceptions|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|false|no|
 |pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no|
 |segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
@@ -282,7 +282,7 @@
 ### `partitionsSpec`
 
 PartitionsSpec is used to describe the secondary partitioning method.
-You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want.
+You should use different partitionsSpec depending on the [rollup mode](rollup.md) you want.
 For perfect rollup, you should use either `hashed` (partitioning based on the hash of dimensions in each row) or
 `single_dim` (based on ranges of a single dimension). For best-effort rollup, you should use `dynamic`.
 
@@ -291,13 +291,14 @@
 | PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup mode | Secondary partition pruning at query time |
 |----------------|-----------------|---------------------|-----------------------|-------------------------------|
 | `dynamic` | Fastest  | [Dynamic partitioning](#dynamic-partitioning) based on the number of rows in a segment. | Best-effort rollup | N/A |
-| `hashed`  | Moderate | Multiple dimension [hash-based partitioning](#hash-based-partitioning) may reduce both your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.<br/><br/>Note that `partitionDimensions` must be set at ingestion time to enable secondary partition pruning at query time.|
-| `single_dim` | Slowest | Single dimension [range partitioning](#single-dimension-range-partitioning) may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows the range of `partitionDimension` values in each segment, given a query including a filter on the `partitionDimension`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimension` for query processing. |
+| `hashed`  | Moderate | Multiple dimension [hash-based partitioning](#hash-based-partitioning) may reduce both your datasource size and query latency by improving data locality. See [Partitioning](./partitioning.md) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.<br/><br/>Note that `partitionDimensions` must be set at ingestion time to enable secondary partition pruning at query time.|
+| `single_dim` | Slowest | Single dimension [range partitioning](#single-dimension-range-partitioning) may reduce your datasource size and query latency by improving data locality. See [Partitioning](./partitioning.md) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows the range of `partitionDimension` values in each segment, given a query including a filter on the `partitionDimension`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimension` for query processing. |
+
 
 The recommended use case for each partitionsSpec is:
 - If your data has a uniformly distributed column which is frequently used in your queries,
 consider using `single_dim` partitionsSpec to maximize the performance of most of your queries.
-- If your data doesn't have a uniformly distributed column, but is expected to have a [high rollup ratio](./index.md#maximizing-rollup-ratio)
+- If your data doesn't have a uniformly distributed column, but is expected to have a [high rollup ratio](./rollup.md#maximizing-rollup-ratio)
 when you roll up with some dimensions, consider using `hashed` partitionsSpec.
 It could reduce the size of datasource and query latency by improving data locality.
 - If the above two scenarios are not the case or you don't need to roll up your datasource,
@@ -741,7 +742,7 @@
 
 This field is required.
 
-See the [`dataSchema`](../ingestion/index.md#dataschema) section of the ingestion docs for details.
+See the [`dataSchema`](./ingestion-spec.md#dataschema) section of the ingestion docs for details.
 
 If you do not specify `intervals` explicitly in your dataSchema's granularitySpec, the Local Index Task will do an extra
 pass over the data to determine the range to lock when it starts up.  If you specify `intervals` explicitly, any rows
@@ -772,10 +773,10 @@
 |numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no|
 |partitionDimensions|Deprecated. Use `partitionsSpec` instead. The dimensions to partition on. Leave blank to select all dimensions. Only used with `forceGuaranteedRollup` = true, will be ignored otherwise.|null|no|
 |partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` if `forceGuaranteedRollup` = true|no|
-|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no|
-|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no|
+|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](ingestion-spec.md#indexspec)|null|no|
+|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](ingestion-spec.md#indexspec) for possible values.|same as indexSpec|no|
 |maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
-|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
+|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
 |reportParseExceptions|DEPRECATED. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1.|false|no|
 |pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no|
 |segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
@@ -786,7 +787,7 @@
 ### `partitionsSpec`
 
 PartitionsSpec is to describe the secondary partitioning method.
-You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want.
+You should use different partitionsSpec depending on the [rollup mode](rollup.md) you want.
 For perfect rollup, you should use `hashed`.
 
 |property|description|default|required?|
@@ -815,7 +816,7 @@
 
 While ingesting data using the Index task, it creates segments from the input data and pushes them. For segment pushing,
 the Index task supports two segment pushing modes, i.e., _bulk pushing mode_ and _incremental pushing mode_ for
-[perfect rollup and best-effort rollup](../ingestion/index.md#rollup), respectively.
+[perfect rollup and best-effort rollup](rollup.md), respectively.
 
 In the bulk pushing mode, every segment is pushed at the very end of the index task. Until then, created segments
 are stored in the memory and local storage of the process running the index task. As a result, this mode might cause a
@@ -1376,8 +1377,8 @@
 The Druid input source can be used for a variety of purposes, including:
 
 - Creating new datasources that are rolled-up copies of existing datasources.
-- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
-- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
+- Changing the [partitioning or sorting](./partitioning.md) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](./ingestion-spec.md#transformspec).
 
 When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
 of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you

diff --git a/docs/ingestion/partitioning.md b/docs/ingestion/partitioning.md
new file mode 100644
index 0000000..ed3318e
--- /dev/null
+++ b/docs/ingestion/partitioning.md

@@ -0,0 +1,69 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+You can use segment partitioning and sorting within your Druid datasources to reduce the size of your data and increase performance.
+
+One way to partition is to load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover how to use multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain time threshold while you continuously add new data from a stream.
+
+The following table shows how each ingestion method handles partitioning:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
+|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
+|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
+|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
+
+
+## Learn more
+See the following topics for more information:
+* [`partitionsSpec`](native-batch.md#partitionsspec) for more detail on partitioning with Native Batch ingestion.
+* [Reindexing](data-management.md#reingesting-data) and [Compaction](compaction.md) for information on how to repartition existing data in Druid.
+

diff --git a/docs/ingestion/rollup.md b/docs/ingestion/rollup.md
new file mode 100644
index 0000000..9c849f3
--- /dev/null
+++ b/docs/ingestion/rollup.md

@@ -0,0 +1,79 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade-off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource compare the number of rows in Druid (`COUNT`) with the number of ingested events. For example, run a [Druid SQL](../querying/sql.md) query where "count" refers to a `count`-type metric generated at ingestion time as follows:
+
+```sql
+SELECT SUM("cnt") / (COUNT(*) * 1.0) FROM datasource
+```
+The higher the result, the greater the benefit you gain from rollup. See [Counting the number of ingested events](schema-design.md#counting) for more details about how counting works with rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio.
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _Best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this for one of the following reasons:
+- The ingestion method parallelizes ingestion without a shuffling step required for perfect rollup.
+- The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received.
+In both of these cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion run in this mode.
+
+Ingestion methods that guarantee perfect rollup use an additional preprocessing step to determine intervals and partitioning before data ingestion. This preprocessing step scans the entire input dataset. While this step increases the time required for ingestion, it provides information necessary for perfect rollup.
+
+The following table shows how each method handles rollup:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
+|[Hadoop](hadoop.md)|Always perfect.|
+|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
+|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
+
+## Learn more
+See the following topic for more information:
+* [Rollup tutorial](../tutorials/tutorial-rollup.md) for an example of how to configure rollup, and of how the feature modifies your data.

diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index a1a68e4..dc0ac9d 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md

@@ -24,17 +24,17 @@
 
 ## Druid's data model
 
-For general information, check out the documentation on [Druid's data model](index.md#data-model) on the main
+For general information, check out the documentation on [Druid's data model](./data-model.md) on the main
 ingestion overview page. The rest of this page discusses tips for users coming from other kinds of systems, as well as
 general tips and common practices.
 
-* Druid data is stored in [datasources](index.md#datasources), which are similar to tables in a traditional RDBMS.
-* Druid datasources can be ingested with or without [rollup](#rollup). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
+* Druid data is stored in [datasources](./data-model.md), which are similar to tables in a traditional RDBMS.
+* Druid datasources can be ingested with or without [rollup](./rollup.md). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
 * Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
 * All columns in Druid datasources, other than the timestamp column, are either dimensions or metrics. This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems) of OLAP data.
 * Typical production datasources have tens to hundreds of columns.
-* [Dimension columns](index.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
-* [Metric columns](index.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
+* [Dimension columns](./data-model.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
+* [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
 
 
 ## If you're coming from a...
@@ -89,7 +89,7 @@
 non-timeseries data, even in the same datasource.
 
 To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and
-sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details.
+sort by metric name, like timeseries databases often do. See [Partitioning and sorting](./partitioning.md) for more details.
 
 Tips for modeling timeseries data in Druid:
 
@@ -98,13 +98,13 @@
 - Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called
 "metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this
 first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality;
-see [partitioning and sorting](index.md#partitioning) below for details).
+see [partitioning and sorting](./partitioning.md) below for details).
 - Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries
 database systems.
 - Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able
 to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to
 be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx).
-- Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one
+- Consider enabling [rollup](./rollup.md), which will allow Druid to potentially combine multiple points into one
 row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is
 naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource.
 - If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
@@ -124,8 +124,8 @@
 
 - If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
 [automatic detection of dimension columns](#schema-less-dimensions).
-- If you have nested data, flatten it using a [`flattenSpec`](index.md#flattenspec).
-- Consider enabling [rollup](#rollup) if you have mainly analytical use cases for your log data. This will
+- If you have nested data, flatten it using a [`flattenSpec`](./ingestion-spec.md#flattenspec).
+- Consider enabling [rollup](./rollup.md) if you have mainly analytical use cases for your log data. This will
 mean you lose the ability to retrieve individual events from Druid, but you potentially gain substantial compression and
 query performance boosts.
 
@@ -134,13 +134,13 @@
 ### Rollup
 
 Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This is a form
-of summarization or pre-aggregation. For more details, see the [Rollup](index.md#rollup) section of the ingestion
+of summarization or pre-aggregation. For more details, see the [Rollup](./rollup.md) section of the ingestion
 documentation.
 
 ### Partitioning and sorting
 
 Optimally partitioning and sorting your data can have substantial impact on footprint and performance. For more details,
-see the [Partitioning](index.md#partitioning) section of the ingestion documentation.
+see the [Partitioning](./partitioning.md) section of the ingestion documentation.
 
 <a name="sketches"></a>
 
@@ -180,19 +180,19 @@
 than string columns. But unlike string columns, numeric columns don't have indexes, so they can be slower to filter on.
 You may want to experiment to find the optimal choice for your use case.
 
-For details about how to configure numeric dimensions, see the [`dimensionsSpec`](index.md#dimensionsspec) documentation.
+For details about how to configure numeric dimensions, see the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec) documentation.
 
 
 
 ### Secondary timestamps
 
 Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on.
+[partitioning and sorting](./partitioning.md) your data, so it should be the timestamp that you will most often filter on.
 Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column.
 
 If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this
-is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in milliseconds format.
-If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and
+is to ingest them as [long-typed dimensions](./ingestion-spec.md#dimensionsspec) in milliseconds format.
+If necessary, you can get them into this format using a [`transformSpec`](./ingestion-spec.md#transformspec) and
 [expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps.
 
 At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions)
@@ -215,7 +215,7 @@
 ```
 
 Druid is capable of flattening JSON, Avro, or Parquet input data.
-Please read about [`flattenSpec`](index.md#flattenspec) for more details.
+Please read about [`flattenSpec`](./ingestion-spec.md#flattenspec) for more details.
 
 <a name="counting"></a>
 

diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md
index 2e62bd2..f7e653d 100644
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md

@@ -164,7 +164,7 @@
 the `rowStats` map contains information about row counts. There is one entry for each ingestion phase. The definitions of the different row counts are shown below:
 * `processed`: Number of rows successfully ingested without parsing errors
 * `processedWithError`: Number of rows that were ingested, but contained a parsing error within one or more columns. This typically occurs where input rows have a parseable structure but invalid types for columns, such as passing in a non-numeric String value for a numeric column.
-* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](index.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
+* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](./ingestion-spec.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
 * `unparseable`: Number of rows that could not be parsed at all and were discarded. This tracks input rows without a parseable structure, such as passing in non-JSON data when using a JSON parser.
 
 The `errorMsg` field shows a message describing the error that caused a task to fail. It will be null if the task was successful.

diff --git a/docs/operations/segment-optimization.md b/docs/operations/segment-optimization.md
index 73d1212..92be3a3 100644
--- a/docs/operations/segment-optimization.md
+++ b/docs/operations/segment-optimization.md

@@ -25,7 +25,7 @@
 
 In Apache Druid, it's important to optimize the segment size because
 
-  1. Druid stores data in segments. If you're using the [best-effort roll-up](../ingestion/index.md#rollup) mode,
+  1. Druid stores data in segments. If you're using the [best-effort roll-up](../ingestion/rollup.md) mode,
   increasing the segment size might introduce further aggregation which reduces the dataSource size.
   2. When a query is submitted, that query is distributed to all Historicals and realtime tasks
   which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool

diff --git a/docs/querying/granularities.md b/docs/querying/granularities.md
index b59a960..0b4ea53 100644
--- a/docs/querying/granularities.md
+++ b/docs/querying/granularities.md

@@ -181,7 +181,7 @@
 ```
 
 Having a query time `granularity` that is smaller than the `queryGranularity` parameter set at
-[ingestion time](../ingestion/index.md#granularityspec) is unreasonable because information about that
+[ingestion time](../ingestion/ingestion-spec.md#granularityspec) is unreasonable because information about that
 smaller granularity is not present in the indexed data. So, if the query time granularity is smaller than the ingestion
 time query granularity, Druid produces results that are equivalent to having set `granularity` to `queryGranularity`.
 

diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md
index b5c23ec..b33a295 100644
--- a/docs/querying/multi-value-dimensions.md
+++ b/docs/querying/multi-value-dimensions.md

@@ -59,7 +59,7 @@
 * `SORTED_SET`: results in the removal of duplicate values
 * `ARRAY`: retains the original order of the values
 
-See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
+See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.
 
 
 ## Querying multi-value dimensions

diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index 749dccc..18fba89 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md

@@ -176,11 +176,11 @@
    You do not need to adjust transformation or filtering settings, as applying ingestion time transforms and 
    filters are out of scope for this tutorial.
 
-8. The Configure schema settings are where you configure what [dimensions](../ingestion/index.md#dimensions) 
-   and [metrics](../ingestion/index.md#metrics) are ingested. The outcome of this configuration represents exactly how the 
+8. The Configure schema settings are where you configure what [dimensions](../ingestion/data-model.md#dimensions) 
+   and [metrics](../ingestion/data-model.md#metrics) are ingested. The outcome of this configuration represents exactly how the 
    data will appear in Druid after ingestion. 
 
-   Since our dataset is very small, you can turn off [rollup](../ingestion/index.md#rollup) 
+   Since our dataset is very small, you can turn off [rollup](../ingestion/rollup.md) 
    by unsetting the **Rollup** switch and confirming the change when prompted.
 
    ![Data loader schema](../assets/tutorial-batch-data-loader-05.png "Data loader schema")

diff --git a/docs/tutorials/tutorial-kafka.md b/docs/tutorials/tutorial-kafka.md
index 65a339f..c640f01 100644
--- a/docs/tutorials/tutorial-kafka.md
+++ b/docs/tutorials/tutorial-kafka.md

@@ -112,9 +112,9 @@
 
 ![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader schema")
 
-In the `Configure schema` step, you can configure which [dimensions](../ingestion/index.md#dimensions) and [metrics](../ingestion/index.md#metrics) will be ingested into Druid.
+In the `Configure schema` step, you can configure which [dimensions](../ingestion/data-model.md#dimensions) and [metrics](../ingestion/data-model.md#metrics) will be ingested into Druid.
 This is exactly what the data will appear like in Druid once it is ingested.
-Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/index.md#rollup) by clicking on the switch and confirming the change.
+Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/rollup.md) by clicking on the switch and confirming the change.
 
 Once you are satisfied with the schema, click `Next` to go to the `Partition` step where you can fine tune how the data will be partitioned into segments.
 

diff --git a/website/i18n/en.json b/website/i18n/en.json
index a777e17..821796a 100644
--- a/website/i18n/en.json
+++ b/website/i18n/en.json

@@ -282,6 +282,10 @@
       "ingestion/data-management": {
         "title": "Data management"
       },
+      "ingestion/data-model": {
+        "title": "Druid data model",
+        "sidebar_label": "Data model"
+      },
       "ingestion/faq": {
         "title": "Ingestion troubleshooting FAQ",
         "sidebar_label": "Troubleshooting FAQ"
@@ -293,10 +297,22 @@
       "ingestion/index": {
         "title": "Ingestion"
       },
+      "ingestion/ingestion-spec": {
+        "title": "Ingestion spec reference",
+        "sidebar_label": "Ingestion spec"
+      },
       "ingestion/native-batch": {
         "title": "Native batch ingestion",
         "sidebar_label": "Native batch"
       },
+      "ingestion/partitioning": {
+        "title": "Partitioning",
+        "sidebar_label": "Partitioning"
+      },
+      "ingestion/rollup": {
+        "title": "Data rollup",
+        "sidebar_label": "Data rollup"
+      },
       "ingestion/schema-design": {
         "title": "Schema design tips"
       },

diff --git a/website/sidebars.json b/website/sidebars.json
index ba9d359..a8c2fe4 100644
--- a/website/sidebars.json
+++ b/website/sidebars.json

@@ -32,6 +32,10 @@
     "Ingestion": [
       "ingestion/index",
       "ingestion/data-formats",
+      "ingestion/data-model",
+      "ingestion/rollup",
+      "ingestion/partitioning",
+      "ingestion/ingestion-spec",
       "ingestion/schema-design",
       "ingestion/data-management",
       "ingestion/compaction",
commit	6524d838d71829a7d182b608fdbd9e1b5357f54f	[log] [tgz]
author	Charles Smith <techdocsmith@gmail.com>	Fri Aug 13 08:42:03 2021 -0700
committer	GitHub <noreply@github.com>	Fri Aug 13 08:42:03 2021 -0700
tree	16e634b001ee6819ef7762ff1ce9bd23835f1677
parent	aaf0aaad8f3936358ab77415df7dcc68d3ee7800 [diff]