| --- |
| id: tutorial-batch-native |
| title: "Load data with native batch ingestion" |
| sidebar_label: Load data with native batch ingestion |
| --- |
| |
| <!-- |
| ~ Licensed to the Apache Software Foundation (ASF) under one |
| ~ or more contributor license agreements. See the NOTICE file |
| ~ distributed with this work for additional information |
| ~ regarding copyright ownership. The ASF licenses this file |
| ~ to you under the Apache License, Version 2.0 (the |
| ~ "License"); you may not use this file except in compliance |
| ~ with the License. You may obtain a copy of the License at |
| ~ |
| ~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~ |
| ~ Unless required by applicable law or agreed to in writing, |
| ~ software distributed under the License is distributed on an |
| ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| ~ KIND, either express or implied. See the License for the |
| ~ specific language governing permissions and limitations |
| ~ under the License. |
| --> |
| |
| |
| This topic shows you how to load and query data files in Apache Druid using its native batch ingestion feature. |
| |
| ## Prerequisites |
| |
| Install Druid, start up Druid services, and open the web console as described in the [Druid quickstart](index.md). |
| |
| ## Load data |
| |
| Ingestion specs define the schema of the data Druid reads and stores. You can write ingestion specs by hand or using the _data loader_, |
| as we'll do here to perform batch file loading with Druid's native batch ingestion. |
| |
| The Druid distribution bundles sample data we can use. The sample data located in `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` |
| in the Druid root directory represents Wikipedia page edits for a given day. |
| |
| 1. Click **Load data** from the web console header (). |
| |
| 2. Select the **Local disk** tile and then click **Connect data**. |
| |
|  |
| |
| 3. Enter the following values: |
| |
| - **Base directory**: `quickstart/tutorial/` |
| |
| - **File filter**: `wikiticker-2015-09-12-sampled.json.gz` |
| |
|  |
| |
| Entering the base directory and [wildcard file filter](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) separately, as afforded by the UI, allows you to specify multiple files for ingestion at once. |
| |
| 4. Click **Apply**. |
| |
| The data loader displays the raw data, giving you a chance to verify that the data |
| appears as expected. |
| |
|  |
| |
| Notice that your position in the sequence of steps to load data, **Connect** in our case, appears at the top of the console, as shown below. |
| You can click other steps to move forward or backward in the sequence at any time. |
| |
|  |
| |
| |
| 5. Click **Next: Parse data**. |
| |
| The data loader tries to determine the parser appropriate for the data format automatically. In this case |
| it identifies the data format as `json`, as shown in the **Input format** field at the bottom right. |
| |
|  |
| |
| Feel free to select other **Input format** options to get a sense of their configuration settings |
| and how Druid parses other types of data. |
| |
| 6. With the JSON parser selected, click **Next: Parse time**. The **Parse time** settings are where you view and adjust the |
| primary timestamp column for the data. |
| |
|  |
| |
| Druid requires data to have a primary timestamp column (internally stored in a column called `__time`). |
| If you do not have a timestamp in your data, select `Constant value`. In our example, the data loader |
| determines that the `time` column is the only candidate that can be used as the primary time column. |
| |
| 7. Click **Next: Transform**, **Next: Filter**, and then **Next: Configure schema**, skipping a few steps. |
| |
| You do not need to adjust transformation or filtering settings, as applying ingestion time transforms and |
| filters are out of scope for this tutorial. |
| |
| 8. The Configure schema settings are where you configure what [dimensions](../ingestion/schema-model.md#dimensions) |
| and [metrics](../ingestion/schema-model.md#metrics) are ingested. The outcome of this configuration represents exactly how the |
| data will appear in Druid after ingestion. |
| |
| Since our dataset is very small, you can turn off [rollup](../ingestion/rollup.md) |
| by unsetting the **Rollup** switch and confirming the change when prompted. |
| |
|  |
| |
| |
| 9. Click **Next: Partition** to configure how the data will be split into segments. In this case, choose `DAY` as the **Segment granularity**. |
| |
|  |
| |
| Since this is a small dataset, we can have just a single segment, which is what selecting `DAY` as the |
| segment granularity gives us. |
| |
| 10. Click **Next: Tune** and **Next: Publish**. |
| |
| 11. The Publish settings are where you specify the datasource name in Druid. Let's change the default name from `wikiticker-2015-09-12-sampled` to `wikipedia`. |
| |
|  |
| |
| 12. Click **Next: Edit spec** to review the ingestion spec we've constructed with the data loader. |
| |
|  |
| |
| Feel free to go back and change settings from previous steps to see how doing so updates the spec. |
| Similarly, you can edit the spec directly and see it reflected in the previous steps. |
| |
| For other ways to load ingestion specs in Druid, see [Tutorial: Loading a file](./tutorial-batch.md). |
| 13. Once you are satisfied with the spec, click **Submit**. |
| |
| |
| The new task for our wikipedia datasource now appears in the Ingestion view. |
| |
|  |
| |
| The task may take a minute or two to complete. When done, the task status should be "SUCCESS", with |
| the duration of the task indicated. Note that the view is set to automatically |
| refresh, so you do not need to refresh the browser to see the status change. |
| |
| A successful task means that one or more segments have been built and are now picked up by our data servers. |
| |
| |
| ## Query the data |
| |
| You can now see the data as a datasource in the console and try out a query, as follows: |
| |
| 1. Click **Datasources** from the console header. |
| |
| If the wikipedia datasource doesn't appear, wait a few moments for the segment to finish loading. A datasource is |
| queryable once it is shown to be "Fully available" in the **Availability** column. |
| |
| 2. When the datasource is available, open the Actions menu () for that |
| datasource and choose **Query with SQL**. |
| |
|  |
| |
| :::info |
| Notice the other actions you can perform for a datasource, including configuring retention rules, compaction, and more. |
| ::: |
| |
| 3. Run the prepopulated query, `SELECT * FROM "wikipedia"` to see the results. |
| |
|  |