| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="prereqs"> |
| |
| <title>Impala Requirements</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Requirements</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Requirements"/> |
| <data name="Category" value="Planning"/> |
| <data name="Category" value="Installing"/> |
| <data name="Category" value="Upgrading"/> |
| <data name="Category" value="Administrators"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| <!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. --> |
| <data name="Category" value="Duplicate Topics"/> |
| <!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. --> |
| <data name="Category" value="Multimap"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| To perform as expected, Impala depends on the availability of the software, hardware, and |
| configurations described in the following sections. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="prereqs_os"> |
| |
| <title>Supported Operating Systems</title> |
| |
| <conbody> |
| |
| <p> |
| Apache Impala runs on Linux systems only. See the <filepath>README.md</filepath> file |
| for more information. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="prereqs_hive"> |
| |
| <title>Hive Metastore and Related Configuration</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Metastore"/> |
| <data name="Category" value="Hive"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Impala can interoperate with data stored in Hive, and uses the same infrastructure as |
| Hive for tracking metadata about schema objects such as tables and columns. The |
| following components are prerequisites for Impala: |
| </p> |
| |
| <ul> |
| <li> |
| MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive. |
| <p> |
| Always configure a <b>Hive metastore service</b> rather than connecting directly to |
| the metastore database. The Hive metastore service is required to interoperate |
| between different levels of metastore APIs if this is necessary for your |
| environment, and using it avoids known issues with connecting directly to the |
| metastore database. |
| </p> |
| <p> |
| See below for a summary of the metastore installation process. |
| </p> |
| </li> |
| |
| <li> |
| Hive (optional). Although only the Hive metastore database is required for Impala to |
| function, you might install Hive on some client machines to create and load data into |
| tables that use certain file formats. See |
| <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to |
| be installed on the same DataNodes as Impala; it just needs access to the same |
| metastore database. |
| </li> |
| </ul> |
| |
| <p> |
| To install the metastore: |
| </p> |
| |
| <ol> |
| <li> |
| Install a MySQL or PostgreSQL database. Start the database if it is not started after |
| installation. |
| </li> |
| |
| <li> |
| Download the |
| <xref href="http://www.mysql.com/products/connector/" |
| scope="external" format="html">MySQL |
| connector</xref> or the |
| <xref |
| href="http://jdbc.postgresql.org/download.html" scope="external" |
| format="html">PostgreSQL |
| connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory. |
| </li> |
| |
| <li> |
| Use the appropriate command line tool for your database to create the metastore |
| database. |
| </li> |
| |
| <li> |
| Use the appropriate command line tool for your database to grant privileges for the |
| metastore database to the <codeph>hive</codeph> user. |
| </li> |
| |
| <li> |
| Modify <codeph>hive-site.xml</codeph> to include information matching your particular |
| database: its URL, username, and password. You will copy the |
| <codeph>hive-site.xml</codeph> file to the Impala Configuration Directory later in the |
| Impala installation process. |
| </li> |
| </ol> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="prereqs_java"> |
| |
| <title>Java Dependencies</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Java"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Although Impala is primarily written in C++, it does use Java to communicate with |
| various Hadoop components: |
| </p> |
| |
| <ul> |
| <li> |
| The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause |
| issues, typically resulting in a failure at <cmdname>impalad</cmdname> startup. In |
| particular, the JamVM used by default on certain levels of Ubuntu systems can cause |
| <cmdname>impalad</cmdname> to fail to start. |
| </li> |
| |
| <li> |
| Internally, the <cmdname>impalad</cmdname> daemon relies on the |
| <codeph>JAVA_HOME</codeph> environment variable to locate the system Java libraries. |
| Make sure the <cmdname>impalad</cmdname> service is not run from an environment with |
| an incorrect setting for this variable. |
| </li> |
| |
| <li> |
| All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> |
| file, which is located at <codeph>/usr/lib/impala/lib/</codeph>. These map to |
| everything that is built under <codeph>fe/target/dependency</codeph>. |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="prereqs_network"> |
| |
| <title>Networking Configuration Requirements</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Network"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| As part of ensuring best performance, Impala attempts to complete tasks on local data, |
| as opposed to using network connections to work with remote data. To support this goal, |
| Impala matches the <b>hostname</b> provided to each Impala daemon with the <b>IP |
| address</b> of each DataNode by resolving the hostname flag to an IP address. For |
| Impala to work with local data, use a single IP interface for the DataNode and the |
| Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to |
| the IP address of the DataNode. For single-homed machines, this is usually automatic, |
| but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the |
| correct interface. Impala tries to detect the correct hostname at start-up, and prints |
| the derived hostname at the start of the log in a message of the form: |
| </p> |
| |
| <codeblock>Using hostname: impala-daemon-1.example.com</codeblock> |
| |
| <p> |
| In the majority of cases, this automatic detection works correctly. If you need to |
| explicitly set the hostname, do so by setting the <codeph>--hostname</codeph> flag. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="prereqs_hardware"> |
| |
| <title>Hardware Requirements</title> |
| |
| <conbody> |
| |
| <p> |
| The memory allocation should be consistent across Impala executor nodes. A single Impala |
| executor with a lower memory limit than the rest can easily become a bottleneck and lead |
| to suboptimal performance. |
| </p> |
| |
| <p> |
| This guideline does not apply to coordinator-only nodes. |
| </p> |
| |
| </conbody> |
| |
| <concept id="concept_dqy_n1w_zdb"> |
| |
| <title>Hardware Requirements for Optimal Join Performance</title> |
| |
| <conbody> |
| |
| <p> |
| During join operations, portions of data from each joined table are loaded into |
| memory. Data sets can be very large, so ensure your hardware has sufficient memory to |
| accommodate the joins you anticipate completing. |
| </p> |
| |
| <p> |
| While requirements vary according to data set size, the following is generally |
| recommended: |
| </p> |
| |
| <ul> |
| <li rev="2.0.0"> |
| CPU |
| <p> |
| Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in |
| newer processors. |
| </p> |
| |
| <note> |
| This required level of processor is the same as in Impala version 1.x. The Impala |
| 2.0 and 2.1 releases had a stricter requirement for the SSE4.1 instruction set, |
| which has now been relaxed. |
| </note> |
| <!-- |
| For best performance use: |
| <ul> |
| <li> |
| Intel - Nehalem (released 2008) or later processors. |
| </li> |
| |
| <li> |
| AMD - Bulldozer (released 2011) or later processors. |
| </li> |
| </ul> |
| --> |
| </li> |
| |
| <li rev="1.2"> |
| Memory |
| <p> |
| 128 GB or more recommended, ideally 256 GB or more. If the intermediate results |
| during query processing on a particular node exceed the amount of memory available |
| to Impala on that node, the query writes temporary work data to disk, which can |
| lead to long query times. Note that because the work is parallelized, and |
| intermediate results for aggregate queries are typically smaller than the original |
| data, Impala can query and join tables that are much larger than the memory |
| available on an individual node. |
| </p> |
| </li> |
| |
| <li> |
| JVM Heap Size for Catalog Server |
| <p> |
| 4 GB or more recommended, ideally 8 GB or more, to accommodate the maximum numbers |
| of tables, partitions, and data files you are planning to use with Impala. |
| </p> |
| </li> |
| |
| <li> |
| Storage |
| <p> |
| DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for |
| disk performance with Impala. Ensure that you have sufficient disk space to store |
| the data Impala will be querying. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="prereqs_account"> |
| |
| <title>User Account Requirements</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Users"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete |
| this account or group and do not modify the account's or group's permissions and rights. |
| Ensure no existing systems obstruct the functioning of these accounts and groups. For |
| example, if you have scripts that delete user accounts not in a white-list, add these |
| accounts to the list of permitted accounts. |
| </p> |
| |
| <p> |
| For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be |
| able to move files to the HDFS trashcan. You might need to create an HDFS directory |
| <filepath>/user/impala</filepath>, writeable by the <codeph>impala</codeph> user, so |
| that the trashcan can be created. Otherwise, data files might remain behind after a |
| <codeph>DROP TABLE</codeph> statement. |
| </p> |
| |
| <p> |
| Impala should not run as root. Best Impala performance is achieved using direct reads, |
| but root is not permitted to use direct reads. Therefore, running Impala as root |
| negatively affects performance. |
| </p> |
| |
| <p> |
| By default, any user can connect to Impala and access all the associated databases and |
| tables. You can enable authorization and authentication based on the Linux OS user who |
| connects to the Impala server, and the associated groups for that user. |
| <xref href="impala_security.xml#security"/> for details. These security features do not |
| change the underlying file permission requirements; the <codeph>impala</codeph> user |
| still needs to be able to access the data files. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |