docs/topics/impala_config_performance.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="config_performance">

   <title>Post-Installation Configuration for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Configuring"/>
       <data name="Category" value="Administrators"/>
     </metadata>
   </prolog>

   <conbody>

     <p id="p_24">
       This section describes the mandatory and recommended configuration settings for Impala. If Impala is
       installed using cluster management software, some of these configurations might be completed automatically; you must still
       configure short-circuit reads manually. If you want to customize your environment, consider making the changes described in this topic.
     </p>

     <ul>
       <li>
         You must enable short-circuit reads, whether or not Impala was installed with cluster
         management software. This setting goes in the Impala configuration settings, not the Hadoop-wide settings.
       </li>

       <li>
         You must enable block location tracking, and you can optionally enable native checksumming for optimal performance.
       </li>
     </ul>

     <section id="section_fhq_wyv_ls">
       <title>Mandatory: Short-Circuit Reads</title>
       <p> Enabling short-circuit reads allows Impala to read local data directly
         from the file system. This removes the need to communicate through the
         DataNodes, improving performance. This setting also minimizes the number
         of additional copies of data. Short-circuit reads requires
           <codeph>libhadoop.so</codeph>
         (the Hadoop Native Library) to be accessible to both the server and the
         client. <codeph>libhadoop.so</codeph> is not available if you have
         installed from a tarball. You must install from an
         <codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use
         short-circuit local reads.
       </p>
       <p>
         <b>To configure DataNodes for short-circuit reads:</b>
       </p>
       <ol id="ol_qlq_wyv_ls">
         <li id="copy_config_files"> Copy the client
             <codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph>
           configuration files from the Hadoop configuration directory to the
           Impala configuration directory. The default Impala configuration
           location is <codeph>/etc/impala/conf</codeph>. </li>
         <li>
           <indexterm audience="hidden">dfs.client.read.shortcircuit</indexterm>
           <indexterm audience="hidden">dfs.domain.socket.path</indexterm>
           <indexterm audience="hidden">dfs.client.file-block-storage-locations.timeout.millis</indexterm>
           On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf.
      Which wouldn't work if we already modified the Impala version of the file here. Also on a 'managed'
      cluster, these /etc files might not exist in those locations. -->
           <!--          <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: -->
           Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock>&lt;property&gt;
     &lt;name&gt;dfs.client.read.shortcircuit&lt;/name&gt;
     &lt;value&gt;true&lt;/value&gt;
 &lt;/property&gt;

 &lt;property&gt;
     &lt;name&gt;dfs.domain.socket.path&lt;/name&gt;
     &lt;value&gt;/var/run/hdfs-sockets/dn&lt;/value&gt;
 &lt;/property&gt;

 &lt;property&gt;
     &lt;name&gt;dfs.client.file-block-storage-locations.timeout.millis&lt;/name&gt;
     &lt;value&gt;10000&lt;/value&gt;
 &lt;/property&gt;</codeblock>
           <!-- Former socket.path value:    &lt;value&gt;/var/run/hadoop-hdfs/dn._PORT&lt;/value&gt; -->
           <!--
           <note>
             The text <codeph>_PORT</codeph> appears just as shown; you do not need to
             substitute a number.
           </note>
 -->
         </li>
         <li>
           <p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make
             sure its group is <codeph>root</codeph>. </p>
           <note> If you are also going to enable block location tracking, you
             can skip copying configuration files and restarting DataNodes and go
             straight to <xref href="#config_performance/block_location_tracking"
              >Optional: Block Location Tracking</xref>.
             Configuring short-circuit reads and block location tracking require
             the same process of copying files and restarting services, so you
             can complete that process once when you have completed all
             configuration changes. Whether you copy files and restart services
             now or during configuring block location tracking, short-circuit
             reads are not enabled until you complete those final steps. </note>
         </li>
         <li id="restart_all_datanodes"> After applying these changes, restart
           all DataNodes. </li>
       </ol>
     </section>

     <section id="block_location_tracking">

       <title>Mandatory: Block Location Tracking</title>

       <p>
         Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing
         better utilization of the underlying disks. Impala will not start unless this setting is enabled.
       </p>

       <p>
         <b>To enable block location tracking:</b>
       </p>

       <ol>
         <li>
           For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file:
 <codeblock>&lt;property&gt;
   &lt;name&gt;dfs.datanode.hdfs-blocks-metadata.enabled&lt;/name&gt;
   &lt;value&gt;true&lt;/value&gt;
 &lt;/property&gt; </codeblock>
         </li>

         <li conref="#config_performance/copy_config_files"/>

         <li conref="#config_performance/restart_all_datanodes"/>
       </ol>
     </section>

     <section id="native_checksumming">

       <title>Optional: Native Checksumming</title>

       <p>
         Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if
         that library is available.
       </p>

       <p id="p_29">
         <b>To enable native checksumming:</b>
       </p>

       <p>
         If you installed <keyword keyref="distro"/> from packages, the native checksumming library is installed and setup correctly. In
         such a case, no additional steps are required. Conversely, if you installed by other means, such as with
         tarballs, native checksumming may not be available due to missing shared objects. Finding the message
         "<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where
         applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native
         checksumming, you must build and install <codeph>libhadoop.so</codeph> (the
         <!-- Another instance of stale link. -->
         <!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). -->
         Hadoop Native Library).
       </p>
     </section>
   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="config_performance">

	<title>Post-Installation Configuration for Impala</title>
	<prolog>
	<metadata>
	<data name="Category" value="Performance"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Configuring"/>
	<data name="Category" value="Administrators"/>
	</metadata>
	</prolog>

	<conbody>

	<p id="p_24">
	This section describes the mandatory and recommended configuration settings for Impala. If Impala is
	installed using cluster management software, some of these configurations might be completed automatically; you must still
	configure short-circuit reads manually. If you want to customize your environment, consider making the changes described in this topic.
	</p>

	<ul>
	<li>
	You must enable short-circuit reads, whether or not Impala was installed with cluster
	management software. This setting goes in the Impala configuration settings, not the Hadoop-wide settings.
	</li>

	<li>
	You must enable block location tracking, and you can optionally enable native checksumming for optimal performance.
	</li>
	</ul>

	<section id="section_fhq_wyv_ls">
	<title>Mandatory: Short-Circuit Reads</title>
	<p> Enabling short-circuit reads allows Impala to read local data directly
	from the file system. This removes the need to communicate through the
	DataNodes, improving performance. This setting also minimizes the number
	of additional copies of data. Short-circuit reads requires
	<codeph>libhadoop.so</codeph>
	(the Hadoop Native Library) to be accessible to both the server and the
	client. <codeph>libhadoop.so</codeph> is not available if you have
	installed from a tarball. You must install from an
	<codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use
	short-circuit local reads.
	</p>
	<p>
	<b>To configure DataNodes for short-circuit reads:</b>
	</p>
	<ol id="ol_qlq_wyv_ls">
	<li id="copy_config_files"> Copy the client
	<codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph>
	configuration files from the Hadoop configuration directory to the
	Impala configuration directory. The default Impala configuration
	location is <codeph>/etc/impala/conf</codeph>. </li>
	<li>
	<indexterm audience="hidden">dfs.client.read.shortcircuit</indexterm>
	<indexterm audience="hidden">dfs.domain.socket.path</indexterm>
	<indexterm audience="hidden">dfs.client.file-block-storage-locations.timeout.millis</indexterm>
	On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf.
	Which wouldn't work if we already modified the Impala version of the file here. Also on a 'managed'
	cluster, these /etc files might not exist in those locations. -->
	<!-- <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: -->
	Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock><property>
	<name>dfs.client.read.shortcircuit</name>
	<value>true</value>
	</property>

	<property>
	<name>dfs.domain.socket.path</name>
	<value>/var/run/hdfs-sockets/dn</value>
	</property>

	<property>
	<name>dfs.client.file-block-storage-locations.timeout.millis</name>
	<value>10000</value>
	</property></codeblock>
	<!-- Former socket.path value: <value>/var/run/hadoop-hdfs/dn._PORT</value> -->
	<!--
	<note>
	The text <codeph>_PORT</codeph> appears just as shown; you do not need to
	substitute a number.
	</note>
	-->
	</li>
	<li>
	<p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make
	sure its group is <codeph>root</codeph>. </p>
	<note> If you are also going to enable block location tracking, you
	can skip copying configuration files and restarting DataNodes and go
	straight to <xref href="#config_performance/block_location_tracking"
	>Optional: Block Location Tracking</xref>.
	Configuring short-circuit reads and block location tracking require
	the same process of copying files and restarting services, so you
	can complete that process once when you have completed all
	configuration changes. Whether you copy files and restart services
	now or during configuring block location tracking, short-circuit
	reads are not enabled until you complete those final steps. </note>
	</li>
	<li id="restart_all_datanodes"> After applying these changes, restart
	all DataNodes. </li>
	</ol>
	</section>

	<section id="block_location_tracking">

	<title>Mandatory: Block Location Tracking</title>

	<p>
	Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing
	better utilization of the underlying disks. Impala will not start unless this setting is enabled.
	</p>

	<p>
	<b>To enable block location tracking:</b>
	</p>

	<ol>
	<li>
	For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file:
	<codeblock><property>
	<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
	<value>true</value>
	</property> </codeblock>
	</li>

	<li conref="#config_performance/copy_config_files"/>

	<li conref="#config_performance/restart_all_datanodes"/>
	</ol>
	</section>

	<section id="native_checksumming">

	<title>Optional: Native Checksumming</title>

	<p>
	Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if
	that library is available.
	</p>

	<p id="p_29">
	<b>To enable native checksumming:</b>
	</p>

	<p>
	If you installed <keyword keyref="distro"/> from packages, the native checksumming library is installed and setup correctly. In
	such a case, no additional steps are required. Conversely, if you installed by other means, such as with
	tarballs, native checksumming may not be available due to missing shared objects. Finding the message
	"<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where
	applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native
	checksumming, you must build and install <codeph>libhadoop.so</codeph> (the
	<!-- Another instance of stale link. -->
	<!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). -->
	Hadoop Native Library).
	</p>
	</section>
	</conbody>
	</concept>