| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="prereqs"> |
| |
| <title>Impala Requirements</title> |
| <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Requirements"/> |
| <data name="Category" value="Planning"/> |
| <data name="Category" value="Installing"/> |
| <data name="Category" value="Upgrading"/> |
| <data name="Category" value="Administrators"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| <!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. --> |
| <data name="Category" value="Duplicate Topics"/> |
| <!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. --> |
| <data name="Category" value="Multimap"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">prerequisites</indexterm> |
| <indexterm audience="Cloudera">requirements</indexterm> |
| To perform as expected, Impala depends on the availability of the software, hardware, and configurations |
| described in the following sections. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="product_compatibility_matrix"> |
| |
| <title>Product Compatibility Matrix</title> |
| |
| <conbody> |
| |
| <p> The ultimate source of truth about compatibility between various |
| versions of CDH, Cloudera Manager, and various CDH components is the <ph |
| audience="integrated"><xref |
| href="rn_consolidated_pcm.xml" |
| >Product Compatibility Matrix for CDH and Cloudera |
| Manager</xref></ph><ph audience="standalone">online <xref |
| href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html" |
| format="html" scope="external">Product Compatibility |
| Matrix</xref></ph>. </p> |
| |
| <p> |
| For Impala, see the |
| <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala |
| compatibility matrix page</xref>. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_os"> |
| |
| <title>Supported Operating Systems</title> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">software requirements</indexterm> |
| <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm> |
| <indexterm audience="Cloudera">RHEL</indexterm> |
| <indexterm audience="Cloudera">CentOS</indexterm> |
| <indexterm audience="Cloudera">SLES</indexterm> |
| <indexterm audience="Cloudera">Ubuntu</indexterm> |
| <indexterm audience="Cloudera">SUSE</indexterm> |
| <indexterm audience="Cloudera">Debian</indexterm> The relevant supported operating systems |
| and versions for Impala are the same as for the corresponding CDH 5 platforms. For |
| details, see the <cite>Supported Operating Systems</cite> page for |
| <ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH |
| 5</xref></ph><ph audience="standalone"><xref |
| href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os" |
| scope="external" format="html">CDH 5</xref></ph>. </p> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_hive"> |
| |
| <title>Hive Metastore and Related Configuration</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Metastore"/> |
| <data name="Category" value="Hive"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">Hive</indexterm> |
| <indexterm audience="Cloudera">MySQL</indexterm> |
| <indexterm audience="Cloudera">PostgreSQL</indexterm> |
| Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking |
| metadata about schema objects such as tables and columns. The following components are prerequisites for |
| Impala: |
| </p> |
| |
| <ul> |
| <li> |
| MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive. |
| <note> |
| <p> |
| Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without |
| the metastore database. For the process of installing and configuring the metastore, see |
| <xref href="impala_install.xml#install"/>. |
| </p> |
| <p> |
| Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore |
| database. The Hive metastore service is required to interoperate between possibly different levels of |
| metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the |
| metastore database. The Hive metastore service is set up for you by default if you install through |
| Cloudera Manager 4.5 or higher. |
| </p> |
| <p> |
| A summary of the metastore installation process is as follows: |
| </p> |
| <ul> |
| <li> |
| Install a MySQL or PostgreSQL database. Start the database if it is not started after installation. |
| </li> |
| |
| <li> |
| Download the |
| <xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL |
| connector</xref> or the |
| <xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL |
| connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory. |
| </li> |
| |
| <li> |
| Use the appropriate command line tool for your database to create the metastore database. |
| </li> |
| |
| <li> |
| Use the appropriate command line tool for your database to grant privileges for the metastore |
| database to the <codeph>hive</codeph> user. |
| </li> |
| |
| <li> |
| Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its |
| URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala |
| Configuration Directory later in the Impala installation process. |
| </li> |
| </ul> |
| </note> |
| </li> |
| |
| <li> |
| <b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you |
| might install Hive on some client machines to create and load data into tables that use certain file |
| formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be |
| installed on the same DataNodes as Impala; it just needs access to the same metastore database. |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_java"> |
| |
| <title>Java Dependencies</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Java"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">Java</indexterm> |
| <indexterm audience="Cloudera">impala-dependencies.jar</indexterm> |
| Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop |
| components: |
| </p> |
| |
| <ul> |
| <li> |
| The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically |
| resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on |
| certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start. |
| <!-- To do: |
| Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before, |
| don't know the impact from the user perspective so not calling it out at the moment. |
| --> |
| </li> |
| |
| <li> |
| Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment |
| variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run |
| from an environment with an incorrect setting for this variable. |
| </li> |
| |
| <li> |
| All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located |
| at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under |
| <codeph>fe/target/dependency</codeph>. |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_network"> |
| |
| <title>Networking Configuration Requirements</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Network"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">network configuration</indexterm> |
| As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using |
| network connections to work with remote data. To support this goal, Impala matches |
| the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by |
| resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface |
| for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag |
| resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for |
| multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala |
| tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log |
| in a message of the form: |
| </p> |
| |
| <codeblock>Using hostname: impala-daemon-1.example.com</codeblock> |
| |
| <p> |
| In the majority of cases, this automatic detection works correctly. If you need to explicitly set the |
| hostname, do so by setting the <codeph>--hostname</codeph> flag. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_hardware"> |
| |
| <title>Hardware Requirements</title> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">hardware requirements</indexterm> |
| <indexterm audience="Cloudera">capacity</indexterm> |
| <indexterm audience="Cloudera">RAM</indexterm> |
| <indexterm audience="Cloudera">memory</indexterm> |
| <indexterm audience="Cloudera">CPU</indexterm> |
| <indexterm audience="Cloudera">processor</indexterm> |
| <indexterm audience="Cloudera">Intel</indexterm> |
| <indexterm audience="Cloudera">AMD</indexterm> |
| During join operations, portions of data from each joined table are loaded into memory. Data sets can be |
| very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate |
| completing. |
| </p> |
| |
| <p> |
| While requirements vary according to data set size, the following is generally recommended: |
| </p> |
| |
| <ul> |
| <li rev="2.0.0"> |
| CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors. |
| <note> |
| This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases |
| had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed. |
| </note> |
| <!-- |
| For best performance use: |
| <ul> |
| <li> |
| Intel - Nehalem (released 2008) or later processors. |
| </li> |
| |
| <li> |
| AMD - Bulldozer (released 2011) or later processors. |
| </li> |
| </ul> |
| --> |
| </li> |
| |
| <li rev="1.2"> |
| Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query |
| processing on a particular node exceed the amount of memory available to Impala on that node, the query |
| writes temporary work data to disk, which can lead to long query times. Note that because the work is |
| parallelized, and intermediate results for aggregate queries are typically smaller than the original |
| data, Impala can query and join tables that are much larger than the memory available on an individual |
| node. |
| </li> |
| |
| <li> |
| Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk |
| performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be |
| querying. |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="prereqs_account"> |
| |
| <title>User Account Requirements</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Users"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">impala user</indexterm> |
| <indexterm audience="Cloudera">impala group</indexterm> |
| <indexterm audience="Cloudera">root user</indexterm> |
| Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group |
| and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the |
| functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in |
| a white-list, add these accounts to the list of permitted accounts. |
| </p> |
| |
| <!-- Taking out because no longer applicable in CDH 5.5 and up. --> |
| <p id="impala_hdfs_group" rev="1.2" audience="Cloudera"> |
| For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components), |
| the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is |
| performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala |
| 1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the |
| <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group. |
| </p> |
| |
| <p> |
| For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files |
| to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>, |
| writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files |
| might remain behind after a <codeph>DROP TABLE</codeph> statement. |
| </p> |
| |
| <p> |
| Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not |
| permitted to use direct reads. Therefore, running Impala as root negatively affects performance. |
| </p> |
| |
| <p> |
| By default, any user can connect to Impala and access all the associated databases and tables. You can |
| enable authorization and authentication based on the Linux OS user who connects to the Impala server, and |
| the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These |
| security features do not change the underlying file permission requirements; the <codeph>impala</codeph> |
| user still needs to be able to access the data files. |
| </p> |
| </conbody> |
| </concept> |
| </concept> |