| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="performance"> |
| |
| <title>Tuning Impala for Performance</title> |
| <titlealts audience="PDF"><navtitle>Performance Tuning</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Databases"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Developers"/> |
| <!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. --> |
| <data name="Category" value="Stub Pages"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The following sections explain the factors affecting the performance of Impala features, and procedures for |
| tuning, monitoring, and benchmarking Impala queries and other SQL operations. |
| </p> |
| |
| <p> |
| This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: |
| it means that performance remains high as the system workload increases. For example, reducing the disk I/O |
| performed by a query can speed up an individual query, and at the same time improve scalability by making it |
| practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more |
| than performance. For example, reducing memory usage for a query might not change the query performance much, |
| but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time |
| without running out of memory. |
| </p> |
| |
| <note> |
| <p> |
| Before starting any performance tuning or benchmarking, make sure your system is configured with all the |
| recommended minimum hardware requirements from <xref href="impala_prereqs.xml#prereqs_hardware"/> and |
| software settings from <xref href="impala_config_performance.xml#config_performance"/>. |
| </p> |
| </note> |
| |
| <ul> |
| <li> |
| <xref href="impala_partitioning.xml#partitioning"/>. This technique physically divides the data based on |
| the different values in frequently queried columns, allowing queries to skip reading a large percentage of |
| the data in a table. |
| </li> |
| |
| <li> |
| <xref href="impala_perf_joins.xml#perf_joins"/>. Joins are the main class of queries that you can tune at |
| the SQL level, as opposed to changing physical factors such as the file format or the hardware |
| configuration. The related topics <xref href="impala_perf_stats.xml#perf_column_stats"/> and |
| <xref href="impala_perf_stats.xml#perf_table_stats"/> are also important primarily for join performance. |
| </li> |
| |
| <li> |
| <xref href="impala_perf_stats.xml#perf_table_stats"/> and |
| <xref href="impala_perf_stats.xml#perf_column_stats"/>. Gathering table and column statistics, using the |
| <codeph>COMPUTE STATS</codeph> statement, helps Impala automatically optimize the performance for join |
| queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala |
| 1.2.2 and higher, because the <codeph>COMPUTE STATS</codeph> statement gathers both kinds of statistics in |
| one operation, and does not require any setup and configuration as was previously necessary for the |
| <codeph>ANALYZE TABLE</codeph> statement in Hive.) |
| </li> |
| |
| <li> |
| <xref href="impala_perf_testing.xml#performance_testing"/>. Do some post-setup testing to ensure Impala is |
| using optimal settings for performance, before conducting any benchmark tests. |
| </li> |
| |
| <li> |
| <xref href="impala_perf_benchmarking.xml#perf_benchmarks"/>. The configuration and sample data that you use |
| for initial experiments with Impala is often not appropriate for doing performance tests. |
| </li> |
| |
| <li> |
| <xref href="impala_perf_resources.xml#mem_limits"/>. The more memory Impala can utilize, the better query |
| performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs |
| to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that |
| Impala can use. |
| </li> |
| |
| <li rev="1.2" audience="hidden"> |
| <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. Impala can use the HDFS caching feature to pin |
| frequently accessed data in memory, reducing disk I/O. |
| </li> |
| |
| <li rev="2.2.0"> |
| <xref href="impala_s3.xml#s3"/>. Queries against data stored in the Amazon Simple Storage Service (S3) |
| have different performance characteristics than when the data is stored in HDFS. |
| </li> |
| </ul> |
| |
| <p outputclass="toc"/> |
| |
| <p conref="../shared/impala_common.xml#common/cookbook_blurb"/> |
| |
| </conbody> |
| |
| <!-- Empty/hidden stub sections that might be worth expanding later. --> |
| |
| <concept id="perf_network" audience="hidden"> |
| |
| <title>Network Traffic</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_partition_schema" audience="hidden"> |
| |
| <title>Designing Partitioned Tables</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_partition_query" audience="hidden"> |
| |
| <title>Queries on Partitioned Tables</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_monitoring" audience="hidden"> |
| |
| <title>Monitoring Performance through the Impala Web Interface</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Monitoring"/> |
| </metadata> |
| </prolog> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_query_coord" audience="hidden"> |
| |
| <title>Query Coordination</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_bottlenecks" audience="hidden"> |
| |
| <title>Performance Bottlenecks</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_long_queries" audience="hidden"> |
| |
| <title>Managing Long-Running Queries</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_load" audience="hidden"> |
| |
| <title>Performance Considerations for Loading Data</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_file_formats" audience="hidden"> |
| |
| <title>Performance Considerations for File Formats</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_compression" audience="hidden"> |
| |
| <title>Performance Considerations for Compression</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Compression"/> |
| </metadata> |
| </prolog> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="perf_codegen" audience="hidden"> |
| |
| <title>Native Code Generation</title> |
| |
| <conbody/> |
| </concept> |
| </concept> |