In Apache Druid, it's important to optimize the segment size because
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimized size first and optimize them later using compaction.
You may need to consider the followings to optimize your segments.
The above recommendation works in general, but the optimal setting can vary based on your workload. For example, if most of your queries are heavy and take a long time to process each row, you may want to make segments smaller so that the query processing can be more parallelized. If you still see some performance issue after optimizing segment size, you may need to find the optimal settings for your workload.
There might be several ways to check if the compaction is necessary. One way is using the System Schema. The system schema provides several tables about the current system status including the segments table. By running the below query, you can get the average number of rows and average size for published segments.
SELECT "start", "end", version, COUNT(*) AS num_segments, AVG("num_rows") AS avg_num_rows, SUM("num_rows") AS total_num_rows, AVG("size") AS avg_size, SUM("size") AS total_size FROM sys.segments A WHERE datasource = 'your_dataSource' AND is_published = 1 GROUP BY 1, 2, 3 ORDER BY 1, 2, 3 DESC;
Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of start and end).
Once you find your segments need compaction, you can consider the below two options:
dataSource inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found on the Updating existing data section of the data management page.