| --- |
| id: troubleshooting |
| title: "Troubleshooting query execution in Druid" |
| sidebar_label: "Troubleshooting" |
| --- |
| |
| <!-- |
| ~ Licensed to the Apache Software Foundation (ASF) under one |
| ~ or more contributor license agreements. See the NOTICE file |
| ~ distributed with this work for additional information |
| ~ regarding copyright ownership. The ASF licenses this file |
| ~ to you under the Apache License, Version 2.0 (the |
| ~ "License"); you may not use this file except in compliance |
| ~ with the License. You may obtain a copy of the License at |
| ~ |
| ~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~ |
| ~ Unless required by applicable law or agreed to in writing, |
| ~ software distributed under the License is distributed on an |
| ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| ~ KIND, either express or implied. See the License for the |
| ~ specific language governing permissions and limitations |
| ~ under the License. |
| --> |
| |
| This topic describes issues that may affect query execution in Druid, how to identify those issues, and strategies to resolve them. |
| |
| ## Query fails due to internal communication timeout |
| |
| In Druid's query processing, when the Broker sends a query to the data servers, the data servers process the query and push their intermediate results back to the Broker. |
| Because calls from the Broker to the data servers are synchronous, the Jetty server can time out in data servers in certain cases: |
| |
| 1. The data servers don't push any results to the Broker before the maximum idle time. |
| 2. The data servers started to push data but paused for longer than the maximum idle time such as due to [Broker backpressure](../operations/basic-cluster-tuning.md#broker-backpressure). |
| |
| When such timeout occurs, the server interrupts the connection between the Broker and data servers which causes the query to fail with a channel disconnection error. For example, |
| |
| ```json |
| { |
| "error": { |
| "error": "Unknown exception", |
| "errorMessage": "Query[6eee73a6-a95f-4bdc-821d-981e99e39242] url[https://localhost:8283/druid/v2/] failed with exception msg [Channel disconnected] (through reference chain: org.apache.druid.query.scan.ScanResultValue[\"segmentId\"])", |
| "errorClass": "com.fasterxml.jackson.databind.JsonMappingException", |
| "host": "localhost:8283" |
| } |
| } |
| ``` |
| |
| Channel disconnection occurs for various reasons. |
| To verify that the error is due to web server timeout, search for the query ID in the Historical logs. |
| The query ID in the example above is `6eee73a6-a95f-4bdc-821d-981e99e39242`. |
| The `"host"` field in the error message above indicates the IP address of the Historical in question. |
| In the Historical logs, you will see a raised exception indicating `Idle timeout expired`: |
| |
| ```text |
| 2021-09-14T19:52:27,685 ERROR [qtp475526834-85[scan_[test_large_table]_6eee73a6-a95f-4bdc-821d-981e99e39242]] org.apache.druid.server.QueryResource - Unable to send query response. (java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms) |
| 2021-09-14T19:52:27,685 ERROR [qtp475526834-85] org.apache.druid.server.QueryLifecycle - Exception while processing queryId [6eee73a6-a95f-4bdc-821d-981e99e39242] (java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms) |
| 2021-09-14T19:52:27,686 WARN [qtp475526834-85] org.eclipse.jetty.server.HttpChannel - handleException /druid/v2/ java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms |
| ``` |
| |
| To mitigate query failure due to web server timeout: |
| * Increase the max idle time for the web server. |
| Set the max idle time in the `druid.server.http.maxIdleTime` property in the `historical/runtime.properties` file. |
| You must restart the Druid cluster for this change to take effect. |
| See [Configuration reference](../configuration/index.md) for more information on configuring the server. |
| * If the timeout occurs because the data servers have not pushed any results to the Broker, consider optimizing data server performance. Significant slowdown in the data servers may be a result of spilling too much data to disk in [groupBy queries](groupbyquery.md#performance-tuning-for-groupby), large [`IN` filters](filters.md#in-filter) in the query, or an under scaled cluster. Analyze your [Druid query metrics](../operations/metrics.md#query-metrics) to determine the bottleneck. |
| * If the timeout is caused by Broker backpressure, consider optimizing Broker performance. Check whether the connection is fast enough between the Broker and deep storage. |
| |