In Apache Druid (incubating), it's important to optimize the segment size because
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimzed size first and optimize them later.
You may need to consider the followings to optimize your segments.
There might be several ways to check if the compaction is necessary. One way is using the System Schema. The system schema provides several tables about the current system status including the segments
table. By running the below query, you can get the average number of rows and average size for published segments.
SELECT "start", "end", version, COUNT(*) AS num_segments, AVG("num_rows") AS avg_num_rows, SUM("num_rows") AS total_num_rows, AVG("size") AS avg_size, SUM("size") AS total_size FROM sys.segments A WHERE datasource = 'your_dataSource' AND is_published = 1 GROUP BY 1, 2, 3 ORDER BY 1, 2, 3 DESC;
Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of start
and end
).
Once you find your segments need compaction, you can consider the below two options:
dataSource
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found under ‘Updating Existing Data’.