blob: 82b53b5d7c2b666623d85f9a6a8f566ab7095575 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
= How to Improve Queries With EXPLAIN Command
The SQL command is a powerful tool used to analyze and understand the execution plan of a query without actually executing it.
When you use the EXPLAIN command, it returns the query execution plan, which includes details such as:
- The order in which tables are accessed.
- The type of join operations used (for example, nested loops, hash joins, or merge joins).
- Any indexes that are used to speed up data retrieval.
- Estimated costs and row counts for different parts of the query.
This information is crucial for optimizing query performance, identifying bottlenecks, and making informed decisions about database schema design and indexing strategies.
SQL EXPLAIN output analysis can help you optimize slow query execution. You can avoid common bottlenecks in SQL execution by following these guidelines:
- Avoid scanning an entire table.
- Avoid scanning non-optimal indexes.
- Avoid suboptimal join ordering or join algorithm.
- Ensure optimal data colocation for your queries.
In the following sections, we will see some common issues with queries and ways to identify and fix them.
== Full Scan Instead of Index Scan
Suppose related sql execution flow looks like:
[source,sql]
----
CREATE TABLE t (id INT PRIMARY KEY, col1 VARCHAR);
CREATE INDEX t_col1_idx ON t(col1);
SELECT id FROM t WHERE col1 = '1';
----
And possible EXPLAIN output:
[source,sql]
----
TableScan
table: PUBLIC.T
predicate: =(COL1, _UTF-8'1')
fieldNames: [ID]
est: (rows=1)
----
NOTE: For simplicity, here and below, information that is not related to the example is omitted from the EXPLAIN output.
We can see a full scan (*TableScan* operator) with predicate.
The execution planner chooses which scan implementation (**TableScan** or **IndexScan**) to use.
If you expect that index scan is preferable, you can use the `FORCE_INDEX` hint to manually force `IndexScan` approach:
[source,sql]
----
SELECT /*+ FORCE_INDEX(t_col1_idx) */ id FROM t WHERE col1 = '1';
----
Will show a different plan, like:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_IDX
type: SORTED
predicate: =(COL1, _UTF-8'1')
searchBounds: [ExactBounds [bound=_UTF-8'1']]
ieldNames: [ID]
collation: []
est: (rows=1)
----
== Suboptimal Indexes
Indexes with less prediction can be chosen, for example schema and query may look as follows:
[source,sql]
----
CREATE TABLE t (id INT PRIMARY KEY, col1 VARCHAR, col2 VARCHAR);
CREATE INDEX t_col1_col2_idx ON T(col1, col2);
CREATE INDEX t_col1_idx ON t(col1);
SELECT id FROM t WHERE col1 = '1' AND col2 = '2';
----
and a possible plan would be:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_IDX
...
----
We can see that the execution uses the **T_COL1_IDX** index, through both predicates **COL1 = '1' AND COL2 = '2'** are involved and the **T_COL1_COL2_IDX** is preferable. In this case, the optimal plan would be:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_COL2_IDX
...
----
You can also use the `FORCE_INDEX` hint to achieve this:
[source,sql]
----
SELECT /*+ FORCE_INDEX(t_col1_col2_idx) */ id FROM t WHERE col1 = '1' AND col2 = '2';
----
== Unexpected Sort Operation
By default, sorted indexes store their entries in ascending order.
You can adjust the ordering of a sorted index by including the options ASC or DESC.
Let's suppose the schema and related query look like this:
[source,sql]
----
CREATE TABLE t1 (id INT PRIMARY KEY, col1 VARCHAR);
CREATE TABLE t2 (id INT PRIMARY KEY, col1 VARCHAR);
CREATE INDEX t1_col1_desc_idx ON t1(col1 DESC);
CREATE INDEX t2_col1_idx ON t2(col1);
SELECT t1.id as t1id, t2.id as t2id FROM t1 JOIN t2 USING (col1);
----
And the possible execution plan looks like this:
[source,sql]
----
MergeJoin
...
Sort
collation: [COL1 ASC]
...
IndexScan
index: T2_COL1_IDX
...
----
In the example above, the planner adds the **Sort** operation before performing the **IndexScan**,
as the index used is sorted in descending order, while ascending order is required.
Extra **Sort** operations adds performance costs, and we can avoid it by creating an index with the appropriate sort ordering:
[source,sql]
----
CREATE INDEX t1_col1_idx ON t1(col1);
----
And plan will no longer display the **Sort** operation, improving query execution speed:
[source,sql]
----
MergeJoin
...
IndexScan
index: T1_COL1_IDX
...
IndexScan
index: T2_COL1_IDX
...
----
== Performance Impact of Correlated Subqueries
The SQL-99 standard allows for nested subqueries at nearly all places within a query, so Ignite 3 supports nested subqueries, both correlated and not.
Performance of certain complex correlated subqueries may be insufficient. Let`s consider a correlated query:
[source,sql]
----
CREATE TABLE emp(dept_id INTEGER PRIMARY KEY, name VARCHAR, salary INTEGER);
CREATE TABLE dept(id INTEGER PRIMARY KEY, name VARCHAR);
SELECT emp.name, (SELECT dept.name FROM dept WHERE emp.dept_id=dept.id)
FROM emp
WHERE emp.salary > 1000;
----
We can see nested correlated subquery here, lets check the plan:
[source,sql]
----
CorrelatedNestedLoopJoin
...
TableScan
table: PUBLIC.EMP
predicate: >(SALARY, 1000)
...
ColocatedHashAggregate
...
TableScan
table: PUBLIC.DEPT
predicate: =($cor1.DEPT_ID, ID)
...
----
The example above shows the slow **CorrelatedNestedLoopJoin** operation. Queries with this operation may cause a number of issues:
- Such subqueries may become bottlenecks.
- Queries can cause high CPU load.
- Certain queries may perform slower than expected.
If performance issues are found in similar queries, it would be more
efficient to rewrite the query without nested subqueries, for example:
[source,sql]
----
SELECT emp.name, dept.name
FROM emp, dept
WHERE emp.salary > 1000 AND emp.dept_id=dept.id;
----
And new plan becomes:
[source,sql]
----
HashJoin
predicate: =(DEPT_ID, ID)
...
TableScan
table: PUBLIC.EMP
predicate: >(SALARY, 1000)
...
TableScan
table: PUBLIC.DEPT
...
----
Without the **CorrelatedNestedLoopJoin** operation, the query should perform much better than the previous one.
== Excessive Sorting
Lets explain we have an index involved two columns one of them is participate in predicate and other in ordering, or in sql terms:
[source,sql]
----
CREATE TABLE emp(dept_id INTEGER PRIMARY KEY, name VARCHAR, salary INTEGER);
CREATE INDEX emp_salary_name_idx ON emp(salary, name);
SELECT dept_id FROM emp WHERE salary = 1 ORDER BY name;
----
Expectations:
- Index need to be used here.
- No additional sort is needed because index is ordered by **name** column is satisfies initial query ordering.
But the real execution plan shows a different result:
[source,sql]
----
Sort
collation: [NAME ASC]
...
TableScan
table: PUBLIC.EMP
predicate: =(SALARY, 1)
...
----
We can see a redundant **Sort** operator. A bit query refactoring can help to avoid excessive sorting:
[source,sql]
----
SELECT dept_id FROM emp WHERE salary = 1 ORDER BY salary, name;
----
And the plan becomes as follows:
[source,sql]
----
IndexScan
table: PUBLIC.EMP
index: EMP_SALARY_NAME_IDX
predicate: =(SALARY, 1)
...
----
== Select Count Optimization
Some queries can be optimized to use more optimal plans which brings performance speed up. For example, plan for:
[source,sql]
----
SELECT COUNT(*) FROM emp;
----
Can look like this:
[source,sql]
----
SelectCount
table: PUBLIC.EMP
est: (rows=43)
...
----
But there are numerous cases where such optimization is not applicable. In such a cases, a plan can be different and the execution may require more time.
The same query as above, but with explicit transaction may produce a different plan, for example:
[source,sql]
----
ReduceSortAggregate
...
MapSortAggregate
...
TableScan
table: PUBLIC.EMP
est: (rows=43)
...
----
== Index Scan Without Exact Search Bounds
Table scans are available in two implementations: direct table scan and scan through index.
Index scans contain predicate and search bounds.
Predicate provides final rows comparison. If search bounds are absentm the query degenerates into table scan through index scan (requiring an additional store look up), with further predicate comparison, that incurs additional performance overhead costs.
Let`s suppose we have schema and query like this:
[source,sql]
----
CREATE TABLE t (id INTEGER PRIMARY KEY, col1 DECIMAL(5, 3));
CREATE INDEX t_col1_idx ON t(col1);
SELECT id FROM t WHERE col1 = 43;
----
And possible plan would look like this:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_IDX
predicate: =(CAST(COL1):DECIMAL(13, 3), 43.000)
...
----
We can see here - only **predicate** (and no **searchBounds**) which means that **all** rows from index will go through predicate and bring additional performance penalty.
Two type of solutions are possible here:
- You can prohibit suboptimal index usage;
- You can explicitly help the planner with type derivation.
=== Prohibit Index Usage
For the first approach, use the **NO_INDEX** hint to prohibit index usage:
[source,sql]
----
SELECT /*+ NO_INDEX */ id FROM t WHERE col1 = 43;
-- or with direct index mention:
SELECT /*+ NO_INDEX(t_col1_idx) */ id FROM t WHERE col1 = 43;
----
As a result, you will have a plan similar to this:
[source,sql]
----
TableScan
table: PUBLIC.T
predicate: =(CAST(COL1):DECIMAL(13, 3), 43.000)
...
----
=== Manual Type Casting
You can append additional cast to the same query to explicitly cast data as a specific type:
[source,sql]
----
SELECT id FROM t WHERE col1 = 43::DECIMAL(5, 3);
----
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_IDX
predicate: =(COL1, 43.000)
searchBounds: [ExactBounds [bound=43.000:DECIMAL(5, 3)]]
...
----
We can see here both **searchBounds** and **predicate** which means that only exact lookup through index will be involved.
The same case as above but for a bit complicated query:
[source,sql]
----
CREATE TABLE t (id INT PRIMARY KEY, col1 INT);
CREATE INDEX t_col1_asc_idx ON t (col1);
SELECT * FROM t WHERE col1::varchar = SUBSTR(CURRENT_DATE::varchar, 4);
----
Possible plan:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_IDX
predicate: =(CAST(COL1):VARCHAR CHARACTER SET "UTF-8", SUBSTR(CAST(CURRENT_DATE):VARCHAR CHARACTER SET "UTF-8" NOT NULL, 4))
...
----
And we also can see that no **search bounds** are involved here.
Try to change it like:
[source,sql]
----
SELECT * FROM t WHERE col1 = SUBSTR(CURRENT_DATE::varchar, 4)::int;
----
And the possible plan will become:
[source,sql]
----
IndexScan
table: PUBLIC.T
index: T_COL1_ASC_IDX
predicate: =(COL1, CAST(SUBSTR(CAST(CURRENT_DATE):VARCHAR CHARACTER SET "UTF-8" NOT NULL, 4)):INTEGER NOT NULL)
searchBounds: [ExactBounds [bound=CAST(SUBSTR(CAST(CURRENT_DATE):VARCHAR CHARACTER SET "UTF-8" NOT NULL, 4)):INTEGER]]
...
----
We can see that **searchBounds** are present - thus more productive execution flow is expected here.
== Colocation Usage
As mentioned above, right colocated columns choice plays a significant role in query execution performance.
For example, if initially tables are created without any thoughts about further usage columns colocation you can have the following scenario:
[source,sql]
----
-- by default, the table is implicitly colocated by PRIMARY KEY
CREATE TABLE emp(dept_id INTEGER, name VARCHAR, salary INTEGER, PRIMARY KEY(dept_id, name));
-- implicitly colocated by PRIMARY KEY
CREATE TABLE dept(id INTEGER, name VARCHAR, PRIMARY KEY(name, id));
----
And query as follows:
[source,sql]
----
SELECT emp.name, dept.name FROM emp JOIN dept ON emp.dept_id = dept.id AND emp.salary > 1000;
----
Bring plan like:
[source,sql]
----
HashJoin
predicate: =(DEPT_ID, ID)
...
Exchange
...
TableScan
table: PUBLIC.EMP
...
Exchange
...
TableScan
table: PUBLIC.DEPT
...
----
We can see two **Exchange** operators, which means that all rows are transferred into a single node and then are joined.
This execution flow brings a performance cost and slows down query execution.
Let`s try to improve it by adding explicit colocation for the **dept** table by the **ID** column:
[source,sql]
----
-- implicitly colocated by PRIMARY KEY
CREATE TABLE emp(dept_id INTEGER, name VARCHAR, salary INTEGER, PRIMARY KEY(dept_id, name));
-- explicitly colocated by ID
CREATE TABLE dept(id INTEGER, name VARCHAR, PRIMARY KEY(name, id)) COLOCATE BY (id);
----
Now the dependent rows from **emp** table are transferred into the appropriate
node where **dept** holds the rows according to **DEPT.ID** distribution:
[source,sql]
----
HashJoin
predicate: =(DEPT_ID, ID)
...
Exchange
distribution: table PUBLIC.DEPT in zone "Default" by [DEPT_ID]
...
TableScan
table: PUBLIC.EMP
...
TableScan
table: PUBLIC.DEPT
...
----
Only one **Exchange** operator for now, which, once again, mean only rows transferring
from **emp** table to appropriate **dept** one.
And finally, both join predicate related columns are colocated:
NOTE: The following colocation example will only work if the **emp** and **dept** tables belong to the same distribution zone.
[source,sql]
----
-- explicitly colocated by DEPT_ID
CREATE TABLE emp(dept_id INTEGER, name VARCHAR, salary INTEGER, PRIMARY KEY(dept_id, name)) COLOCATE BY(dept_id);
-- explicitly colocated by ID
CREATE TABLE dept(id INTEGER, name VARCHAR, PRIMARY KEY(id, name)) COLOCATE BY(id);
----
Now, the **emp** and **dept** tables are both colocated.
And the final plan will look like this:
[source,sql]
----
HashJoin
predicate: =(DEPT_ID, ID)
...
TableScan
table: PUBLIC.EMP
...
TableScan
table: PUBLIC.DEPT
...
----
No **Exchange** operators are involved in the explanation, which means that no excessive rows transfer has occurred.