Apache Druid stores data and indexes in segment files partitioned by time. After Druid creates a segment, its contents can't be modified. You can either replace data for the whole segment, or, in some cases, overshadow a portion of the segment data.
In Druid, use time ranges to specify the data you want to update, as opposed to a primary key or dimensions often used in transactional databases. Data outside the specified replacement time range remains unaffected. You can use this Druid functionality to perform data updates, inserts, and deletes, similar to UPSERT functionality for transactional databases.
This tutorial shows you how to use the Druid SQL REPLACE function with the OVERWRITE clause to update existing data.
The tutorial walks you through the following use cases:
All examples use the multi-stage query (MSQ) task engine to executes SQL statements.
Before you follow the steps in this tutorial, download Druid as described in Quickstart (local) and have it running on your local machine. You don't need to load any data into the Druid cluster.
You should be familiar with data querying in Druid. If you haven't already, go through the Query data tutorial first.
Load a sample dataset using REPLACE and EXTERN functions. In Druid SQL, the REPLACE function can create a new datasource or update an existing datasource.
In the Druid web console, go to the Query view and run the following query:
REPLACE INTO "update_tutorial" OVERWRITE ALL WITH "ext" AS ( SELECT * FROM TABLE( EXTERN( '{"type":"inline","data":"{\"timestamp\":\"2024-01-01T07:01:35Z\",\"animal\":\"octopus\", \"number\":115}\n{\"timestamp\":\"2024-01-01T05:01:35Z\",\"animal\":\"mongoose\", \"number\":737}\n{\"timestamp\":\"2024-01-01T06:01:35Z\",\"animal\":\"snake\", \"number\":1234}\n{\"timestamp\":\"2024-01-01T01:01:35Z\",\"animal\":\"lion\", \"number\":300}\n{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"seahorse\", \"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"skunk\", \"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"iguana\", \"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"opossum\", \"number\":300}"}', '{"type":"json"}' ) ) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT) ) SELECT TIME_PARSE("timestamp") AS "__time", "animal", "number" FROM "ext" PARTITIONED BY DAY
In the resulting update_tutorial
datasource, individual rows are uniquely identified by __time
, animal
, and number
. To view the results, open a new tab and run the following query:
SELECT * FROM "update_tutorial"
__time | animal | number |
---|---|---|
2024-01-01T01:01:35.000Z | lion | 300 |
2024-01-01T05:01:35.000Z | mongoose | 737 |
2024-01-01T06:01:35.000Z | snake | 1234 |
2024-01-01T07:01:35.000Z | octopus | 115 |
2024-01-02T01:01:35.000Z | opossum | 300 |
2024-01-02T05:01:35.000Z | skunk | 737 |
2024-01-02T06:01:35.000Z | iguana | 1234 |
2024-01-02T07:01:35.000Z | seahorse | 115 |
The results contain records for eight animals over two days.
You can use the REPLACE function with OVERWRITE ALL to replace the entire datasource with new data while dropping the old data.
In the web console, open a new tab and run the following query to overwrite timestamp data for the entire update_tutorial
datasource:
REPLACE INTO "update_tutorial" OVERWRITE ALL WITH "ext" AS (SELECT * FROM TABLE( EXTERN( '{"type":"inline","data":"{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"octopus\", \"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"mongoose\", \"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"snake\", \"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"lion\", \"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\", \"number\":115}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"skunk\", \"number\":737}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", \"number\":1234}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"opossum\", \"number\":300}"}', '{"type":"json"}' ) ) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT)) SELECT TIME_PARSE("timestamp") AS "__time", "animal", "number" FROM "ext" PARTITIONED BY DAY
__time | animal | number |
---|---|---|
2024-01-02T01:01:35.000Z | lion | 300 |
2024-01-02T05:01:35.000Z | mongoose | 737 |
2024-01-02T06:01:35.000Z | snake | 1234 |
2024-01-02T07:01:35.000Z | octopus | 115 |
2024-01-03T01:01:35.000Z | opossum | 300 |
2024-01-03T05:01:35.000Z | skunk | 737 |
2024-01-03T06:01:35.000Z | iguana | 1234 |
2024-01-03T07:01:35.000Z | seahorse | 115 |
Note that the values in the __time
column have changed to one day later.
You can use the REPLACE function to overwrite a specific time range of a datasource. When you overwrite a specific time range, that time range must align with the granularity specified in the PARTITIONED BY clause.
In the web console, open a new tab and run the following query to insert a new row and update specific rows. Note that the OVERWRITE WHERE clause tells the query to only update records for the date 2024-01-03.
REPLACE INTO "update_tutorial" OVERWRITE WHERE "__time" >= TIMESTAMP'2024-01-03 00:00:00' AND "__time" < TIMESTAMP'2024-01-04 00:00:00' WITH "ext" AS (SELECT * FROM TABLE( EXTERN( '{"type":"inline","data":"{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"tiger\", \"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\", \"number\":500}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"polecat\", \"number\":626}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", \"number\":300}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"flamingo\", \"number\":999}"}', '{"type":"json"}' ) ) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT)) SELECT TIME_PARSE("timestamp") AS "__time", "animal", "number" FROM "ext" PARTITIONED BY DAY
__time | animal | number |
---|---|---|
2024-01-02T01:01:35.000Z | lion | 300 |
2024-01-02T05:01:35.000Z | mongoose | 737 |
2024-01-02T06:01:35.000Z | snake | 1234 |
2024-01-02T07:01:35.000Z | octopus | 115 |
2024-01-03T01:01:35.000Z | flamingo | 999 |
2024-01-03T01:01:35.000Z | tiger | 300 |
2024-01-03T05:01:35.000Z | polecat | 626 |
2024-01-03T06:01:35.000Z | iguana | 300 |
2024-01-03T07:01:35.000Z | seahorse | 500 |
Note the changes in the resulting datasource:
flamingo
.opossum
row has the value tiger
.skunk
row has the value polecat
.iguana
and seahorse
rows have different numbers.In Druid, you can overlay older data with newer data for the entire segment or portions of the segment within a particular partition. This capability is called overshadowing.
You can use partial overshadowing to update a single row by adding a smaller time granularity segment on top of the existing data. It's a less common variation on a more common approach where you replace the entire time chunk.
The following example demonstrates how update data using partial overshadowing with mixed segment granularity.
Note the following important points about the example:
number
row.REPLACE INTO "update_tutorial" OVERWRITE WHERE "__time" >= TIMESTAMP'2024-01-03 05:00:00' AND "__time" < TIMESTAMP'2024-01-03 06:00:00' SELECT "__time", "animal", CAST(486 AS BIGINT) AS "number" FROM "update_tutorial" WHERE TIME_IN_INTERVAL("__time", '2024-01-03T05:01:35Z/PT1S') PARTITIONED BY FLOOR(__time TO HOUR)
__time | animal | number |
---|---|---|
2024-01-02T01:01:35.000Z | lion | 300 |
2024-01-02T05:01:35.000Z | mongoose | 737 |
2024-01-02T06:01:35.000Z | snake | 1234 |
2024-01-02T07:01:35.000Z | octopus | 115 |
2024-01-03T01:01:35.000Z | flamingo | 999 |
2024-01-03T01:01:35.000Z | tiger | 300 |
2024-01-03T05:01:35.000Z | polecat | 486 |
2024-01-03T06:01:35.000Z | iguana | 300 |
2024-01-03T07:01:35.000Z | seahorse | 500 |
Note that the number
for polecat
has changed from 626 to 486.
When you perform partial segment overshadowing multiple times, you can create segment fragmentation that could affect query performance. Use compaction to correct any fragmentation.
See the following topics for more information: