docs/topics/impala_prereqs.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="prereqs">

   <title>Impala Requirements</title>
   <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Requirements"/>
       <data name="Category" value="Planning"/>
       <data name="Category" value="Installing"/>
       <data name="Category" value="Upgrading"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
       <!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. -->
       <data name="Category" value="Duplicate Topics"/>
       <!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. -->
       <data name="Category" value="Multimap"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="Cloudera">prerequisites</indexterm>
       <indexterm audience="Cloudera">requirements</indexterm>
       To perform as expected, Impala depends on the availability of the software, hardware, and configurations
       described in the following sections.
     </p>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="product_compatibility_matrix">

     <title>Product Compatibility Matrix</title>

     <conbody>

       <p> The ultimate source of truth about compatibility between various
         versions of CDH, Cloudera Manager, and various CDH components is the <ph
           audience="integrated"><xref
             href="rn_consolidated_pcm.xml"
             >Product Compatibility Matrix for CDH and Cloudera
           Manager</xref></ph><ph audience="standalone">online <xref
             href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html"
             format="html" scope="external">Product Compatibility
           Matrix</xref></ph>. </p>

       <p>
         For Impala, see the
         <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala
         compatibility matrix page</xref>.
       </p>
     </conbody>
   </concept>

   <concept id="prereqs_os">

     <title>Supported Operating Systems</title>

     <conbody>

       <p>
         <indexterm audience="Cloudera">software requirements</indexterm>
         <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm>
         <indexterm audience="Cloudera">RHEL</indexterm>
         <indexterm audience="Cloudera">CentOS</indexterm>
         <indexterm audience="Cloudera">SLES</indexterm>
         <indexterm audience="Cloudera">Ubuntu</indexterm>
         <indexterm audience="Cloudera">SUSE</indexterm>
         <indexterm audience="Cloudera">Debian</indexterm> The relevant supported operating systems
         and versions for Impala are the same as for the corresponding CDH 5 platforms. For
         details, see the <cite>Supported Operating Systems</cite> page for
         <ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH
             5</xref></ph><ph audience="standalone"><xref
             href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os"
             scope="external" format="html">CDH 5</xref></ph>. </p>
     </conbody>
   </concept>

   <concept id="prereqs_hive">

     <title>Hive Metastore and Related Configuration</title>
   <prolog>
     <metadata>
       <data name="Category" value="Metastore"/>
       <data name="Category" value="Hive"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         <indexterm audience="Cloudera">Hive</indexterm>
         <indexterm audience="Cloudera">MySQL</indexterm>
         <indexterm audience="Cloudera">PostgreSQL</indexterm>
         Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking
         metadata about schema objects such as tables and columns. The following components are prerequisites for
         Impala:
       </p>

       <ul>
         <li>
           MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
           <note>
             <p>
               Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without
               the metastore database. For the process of installing and configuring the metastore, see
               <xref href="impala_install.xml#install"/>.
             </p>
             <p>
               Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore
               database. The Hive metastore service is required to interoperate between possibly different levels of
               metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the
               metastore database. The Hive metastore service is set up for you by default if you install through
               Cloudera Manager 4.5 or higher.
             </p>
             <p>
               A summary of the metastore installation process is as follows:
             </p>
             <ul>
               <li>
                 Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
               </li>

               <li>
                 Download the
                 <xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL
                 connector</xref> or the
                 <xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL
                 connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory.
               </li>

               <li>
                 Use the appropriate command line tool for your database to create the metastore database.
               </li>

               <li>
                 Use the appropriate command line tool for your database to grant privileges for the metastore
                 database to the <codeph>hive</codeph> user.
               </li>

               <li>
                 Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its
                 URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala
                 Configuration Directory later in the Impala installation process.
               </li>
             </ul>
           </note>
         </li>

         <li>
           <b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you
           might install Hive on some client machines to create and load data into tables that use certain file
           formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be
           installed on the same DataNodes as Impala; it just needs access to the same metastore database.
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="prereqs_java">

     <title>Java Dependencies</title>
   <prolog>
     <metadata>
       <data name="Category" value="Java"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         <indexterm audience="Cloudera">Java</indexterm>
         <indexterm audience="Cloudera">impala-dependencies.jar</indexterm>
         Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop
         components:
       </p>

       <ul>
         <li>
           The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically
           resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on
           certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start.
           <!-- To do:
             Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before,
             don't know the impact from the user perspective so not calling it out at the moment.
           -->
         </li>

         <li>
           Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment
           variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run
           from an environment with an incorrect setting for this variable.
         </li>

         <li>
           All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located
           at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under
           <codeph>fe/target/dependency</codeph>.
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="prereqs_network">

     <title>Networking Configuration Requirements</title>
   <prolog>
     <metadata>
       <data name="Category" value="Network"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         <indexterm audience="Cloudera">network configuration</indexterm>
         As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using
         network connections to work with remote data. To support this goal, Impala matches
         the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by
         resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface
         for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag
         resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for
         multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala
         tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log
         in a message of the form:
       </p>

 <codeblock>Using hostname: impala-daemon-1.example.com</codeblock>

       <p>
         In the majority of cases, this automatic detection works correctly. If you need to explicitly set the
         hostname, do so by setting the <codeph>--hostname</codeph> flag.
       </p>
     </conbody>
   </concept>

   <concept id="prereqs_hardware">

     <title>Hardware Requirements</title>

     <conbody>

       <p>
         <indexterm audience="Cloudera">hardware requirements</indexterm>
         <indexterm audience="Cloudera">capacity</indexterm>
         <indexterm audience="Cloudera">RAM</indexterm>
         <indexterm audience="Cloudera">memory</indexterm>
         <indexterm audience="Cloudera">CPU</indexterm>
         <indexterm audience="Cloudera">processor</indexterm>
         <indexterm audience="Cloudera">Intel</indexterm>
         <indexterm audience="Cloudera">AMD</indexterm>
         During join operations, portions of data from each joined table are loaded into memory. Data sets can be
         very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate
         completing.
       </p>

       <p>
         While requirements vary according to data set size, the following is generally recommended:
       </p>

       <ul>
         <li rev="2.0.0">
           CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
           <note>
             This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases
             had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed.
           </note>
 <!--
           For best performance use:
           <ul>
             <li>
               Intel - Nehalem (released 2008) or later processors.
             </li>

             <li>
               AMD - Bulldozer (released 2011) or later processors.
             </li>
           </ul>
 -->
         </li>

         <li rev="1.2">
           Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query
           processing on a particular node exceed the amount of memory available to Impala on that node, the query
           writes temporary work data to disk, which can lead to long query times. Note that because the work is
           parallelized, and intermediate results for aggregate queries are typically smaller than the original
           data, Impala can query and join tables that are much larger than the memory available on an individual
           node.
         </li>

         <li>
           Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk
           performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be
           querying.
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="prereqs_account">

     <title>User Account Requirements</title>
   <prolog>
     <metadata>
       <data name="Category" value="Users"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         <indexterm audience="Cloudera">impala user</indexterm>
         <indexterm audience="Cloudera">impala group</indexterm>
         <indexterm audience="Cloudera">root user</indexterm>
         Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group
         and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the
         functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in
         a white-list, add these accounts to the list of permitted accounts.
       </p>

 <!-- Taking out because no longer applicable in CDH 5.5 and up. -->
       <p id="impala_hdfs_group" rev="1.2" audience="Cloudera">
         For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components),
         the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is
         performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala
         1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the
         <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group.
       </p>

       <p>
         For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files
         to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>,
         writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files
         might remain behind after a <codeph>DROP TABLE</codeph> statement.
       </p>

       <p>
         Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not
         permitted to use direct reads. Therefore, running Impala as root negatively affects performance.
       </p>

       <p>
         By default, any user can connect to Impala and access all the associated databases and tables. You can
         enable authorization and authentication based on the Linux OS user who connects to the Impala server, and
         the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These
         security features do not change the underlying file permission requirements; the <codeph>impala</codeph>
         user still needs to be able to access the data files.
       </p>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="prereqs">

	<title>Impala Requirements</title>
	<titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts>

	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Requirements"/>
	<data name="Category" value="Planning"/>
	<data name="Category" value="Installing"/>
	<data name="Category" value="Upgrading"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	<!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. -->
	<data name="Category" value="Duplicate Topics"/>
	<!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. -->
	<data name="Category" value="Multimap"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="Cloudera">prerequisites</indexterm>
	<indexterm audience="Cloudera">requirements</indexterm>
	To perform as expected, Impala depends on the availability of the software, hardware, and configurations
	described in the following sections.
	</p>

	<p outputclass="toc inpage"/>
	</conbody>

	<concept id="product_compatibility_matrix">

	<title>Product Compatibility Matrix</title>

	<conbody>

	<p> The ultimate source of truth about compatibility between various
	versions of CDH, Cloudera Manager, and various CDH components is the <ph
	audience="integrated"><xref
	href="rn_consolidated_pcm.xml"
	>Product Compatibility Matrix for CDH and Cloudera
	Manager</xref></ph><ph audience="standalone">online <xref
	href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html"
	format="html" scope="external">Product Compatibility
	Matrix</xref></ph>. </p>

	<p>
	For Impala, see the
	<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala
	compatibility matrix page</xref>.
	</p>
	</conbody>
	</concept>

	<concept id="prereqs_os">

	<title>Supported Operating Systems</title>

	<conbody>

	<p>
	<indexterm audience="Cloudera">software requirements</indexterm>
	<indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm>
	<indexterm audience="Cloudera">RHEL</indexterm>
	<indexterm audience="Cloudera">CentOS</indexterm>
	<indexterm audience="Cloudera">SLES</indexterm>
	<indexterm audience="Cloudera">Ubuntu</indexterm>
	<indexterm audience="Cloudera">SUSE</indexterm>
	<indexterm audience="Cloudera">Debian</indexterm> The relevant supported operating systems
	and versions for Impala are the same as for the corresponding CDH 5 platforms. For
	details, see the <cite>Supported Operating Systems</cite> page for
	<ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH
	5</xref></ph><ph audience="standalone"><xref
	href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os"
	scope="external" format="html">CDH 5</xref></ph>. </p>
	</conbody>
	</concept>

	<concept id="prereqs_hive">

	<title>Hive Metastore and Related Configuration</title>
	<prolog>
	<metadata>
	<data name="Category" value="Metastore"/>
	<data name="Category" value="Hive"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="Cloudera">Hive</indexterm>
	<indexterm audience="Cloudera">MySQL</indexterm>
	<indexterm audience="Cloudera">PostgreSQL</indexterm>
	Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking
	metadata about schema objects such as tables and columns. The following components are prerequisites for
	Impala:
	</p>

	<ul>
	<li>
	MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
	<note>
	<p>
	Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without
	the metastore database. For the process of installing and configuring the metastore, see
	<xref href="impala_install.xml#install"/>.
	</p>
	<p>
	Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore
	database. The Hive metastore service is required to interoperate between possibly different levels of
	metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the
	metastore database. The Hive metastore service is set up for you by default if you install through
	Cloudera Manager 4.5 or higher.
	</p>
	<p>
	A summary of the metastore installation process is as follows:
	</p>
	<ul>
	<li>
	Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
	</li>

	<li>
	Download the
	<xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL
	connector</xref> or the
	<xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL
	connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory.
	</li>

	<li>
	Use the appropriate command line tool for your database to create the metastore database.
	</li>

	<li>
	Use the appropriate command line tool for your database to grant privileges for the metastore
	database to the <codeph>hive</codeph> user.
	</li>

	<li>
	Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its
	URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala
	Configuration Directory later in the Impala installation process.
	</li>
	</ul>
	</note>
	</li>

	<li>
	<b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you
	might install Hive on some client machines to create and load data into tables that use certain file
	formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be
	installed on the same DataNodes as Impala; it just needs access to the same metastore database.
	</li>
	</ul>
	</conbody>
	</concept>

	<concept id="prereqs_java">

	<title>Java Dependencies</title>
	<prolog>
	<metadata>
	<data name="Category" value="Java"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="Cloudera">Java</indexterm>
	<indexterm audience="Cloudera">impala-dependencies.jar</indexterm>
	Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop
	components:
	</p>

	<ul>
	<li>
	The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically
	resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on
	certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start.
	<!-- To do:
	Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before,
	don't know the impact from the user perspective so not calling it out at the moment.
	-->
	</li>

	<li>
	Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment
	variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run
	from an environment with an incorrect setting for this variable.
	</li>

	<li>
	All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located
	at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under
	<codeph>fe/target/dependency</codeph>.
	</li>
	</ul>
	</conbody>
	</concept>

	<concept id="prereqs_network">

	<title>Networking Configuration Requirements</title>
	<prolog>
	<metadata>
	<data name="Category" value="Network"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="Cloudera">network configuration</indexterm>
	As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using
	network connections to work with remote data. To support this goal, Impala matches
	the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by
	resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface
	for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag
	resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for
	multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala
	tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log
	in a message of the form:
	</p>

	<codeblock>Using hostname: impala-daemon-1.example.com</codeblock>

	<p>
	In the majority of cases, this automatic detection works correctly. If you need to explicitly set the
	hostname, do so by setting the <codeph>--hostname</codeph> flag.
	</p>
	</conbody>
	</concept>

	<concept id="prereqs_hardware">

	<title>Hardware Requirements</title>

	<conbody>

	<p>
	<indexterm audience="Cloudera">hardware requirements</indexterm>
	<indexterm audience="Cloudera">capacity</indexterm>
	<indexterm audience="Cloudera">RAM</indexterm>
	<indexterm audience="Cloudera">memory</indexterm>
	<indexterm audience="Cloudera">CPU</indexterm>
	<indexterm audience="Cloudera">processor</indexterm>
	<indexterm audience="Cloudera">Intel</indexterm>
	<indexterm audience="Cloudera">AMD</indexterm>
	During join operations, portions of data from each joined table are loaded into memory. Data sets can be
	very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate
	completing.
	</p>

	<p>
	While requirements vary according to data set size, the following is generally recommended:
	</p>

	<ul>
	<li rev="2.0.0">
	CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
	<note>
	This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases
	had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed.
	</note>
	<!--
	For best performance use:
	<ul>
	<li>
	Intel - Nehalem (released 2008) or later processors.
	</li>

	<li>
	AMD - Bulldozer (released 2011) or later processors.
	</li>
	</ul>
	-->
	</li>

	<li rev="1.2">
	Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query
	processing on a particular node exceed the amount of memory available to Impala on that node, the query
	writes temporary work data to disk, which can lead to long query times. Note that because the work is
	parallelized, and intermediate results for aggregate queries are typically smaller than the original
	data, Impala can query and join tables that are much larger than the memory available on an individual
	node.
	</li>

	<li>
	Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk
	performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be
	querying.
	</li>
	</ul>
	</conbody>
	</concept>

	<concept id="prereqs_account">

	<title>User Account Requirements</title>
	<prolog>
	<metadata>
	<data name="Category" value="Users"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="Cloudera">impala user</indexterm>
	<indexterm audience="Cloudera">impala group</indexterm>
	<indexterm audience="Cloudera">root user</indexterm>
	Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group
	and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the
	functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in
	a white-list, add these accounts to the list of permitted accounts.
	</p>

	<!-- Taking out because no longer applicable in CDH 5.5 and up. -->
	<p id="impala_hdfs_group" rev="1.2" audience="Cloudera">
	For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components),
	the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is
	performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala
	1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the
	<codeph>impala</codeph> user to the <codeph>hdfs</codeph> group.
	</p>

	<p>
	For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files
	to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>,
	writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files
	might remain behind after a <codeph>DROP TABLE</codeph> statement.
	</p>

	<p>
	Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not
	permitted to use direct reads. Therefore, running Impala as root negatively affects performance.
	</p>

	<p>
	By default, any user can connect to Impala and access all the associated databases and tables. You can
	enable authorization and authentication based on the Linux OS user who connects to the Impala server, and
	the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These
	security features do not change the underlying file permission requirements; the <codeph>impala</codeph>
	user still needs to be able to access the data files.
	</p>
	</conbody>
	</concept>
	</concept>