docs/topics/impala_proxy.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="proxy">

   <title>Using Impala through a Proxy for High Availability</title>
   <titlealts audience="PDF"><navtitle>Load-Balancing Proxy for HA</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="High Availability"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Network"/>
       <data name="Category" value="Proxy"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       For most clusters that have multiple users and production availability requirements, you might set up a proxy
       server to relay requests to and from Impala.
     </p>

     <p>
       Currently, the Impala statestore mechanism does not include such proxying and load-balancing features. Set up
       a software package of your choice to perform these functions.
     </p>

     <note>
       <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
     </note>

     <p outputclass="toc inpage"/>

   </conbody>

   <concept id="proxy_overview">

     <title>Overview of Proxy Usage and Load Balancing for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         Using a load-balancing proxy server for Impala has the following advantages:
       </p>

       <ul>
         <li>
           Applications connect to a single well-known host and port, rather than keeping track of the hosts where
           the <cmdname>impalad</cmdname> daemon is running.
         </li>

         <li>
           If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable, application connection
           requests still succeed because you always connect to the proxy server rather than a specific host running
           the <cmdname>impalad</cmdname> daemon.
         </li>

         <li>
           The coordinator node for each Impala query potentially requires more memory and CPU cycles than the other
           nodes that process the query. The proxy server can issue queries using round-robin scheduling, so that
           each connection uses a different coordinator node. This load-balancing technique lets the Impala nodes
           share this additional work, rather than concentrating it on a single machine.
         </li>
       </ul>

       <p>
         The following setup steps are a general outline that apply to any load-balancing proxy software:
       </p>

       <ol>
         <li>
           Select and download the load-balancing proxy software or other
           load-balancing hardware appliance. It should only need to be installed
           and configured on a single host, typically on an edge node. Pick a
           host other than the DataNodes where <cmdname>impalad</cmdname> is
           running, because the intention is to protect against the possibility
           of one or more of these DataNodes becoming unavailable.
         </li>

         <li>
           Configure the load balancer (typically by editing a configuration file).
           In particular:
           <ul>
             <li>
               Set up a port that the load balancer will listen on to relay
               Impala requests back and forth. </li>
             <li>
               See <xref href="#proxy_balancing" format="dita"/> for load
               balancing algorithm options.
             </li>
             <li>
               For Kerberized clusters, follow the instructions in <xref
                 href="impala_proxy.xml#proxy_kerberos"/>.
             </li>
           </ul>
         </li>

         <li>
           If you are using Hue or JDBC-based applications, you typically set
           up load balancing for both ports 21000 and 21050, because these client
           applications connect through port 21050 while the
             <cmdname>impala-shell</cmdname> command connects through port
           21000. See <xref href="impala_ports.xml#ports"/> for when to use port
           21000, 21050, or another value depending on what type of connections
           you are load balancing.
         </li>

         <li>
           Run the load-balancing proxy server, pointing it at the configuration file that you set up.
         </li>

         <li>
           For any scripts, jobs, or configuration settings for applications
           that formerly connected to a specific DataNode to run Impala SQL
           statements, change the connection information (such as the
             <codeph>-i</codeph> option in <cmdname>impala-shell</cmdname>) to
           point to the load balancer instead.
         </li>
       </ol>

       <note>
         The following sections use the HAProxy software as a representative example of a load balancer
         that you can use with Impala.
       </note>

     </conbody>

   </concept>

   <concept id="proxy_balancing" rev="">
     <title>Choosing the Load-Balancing Algorithm</title>
     <conbody>
       <p>
         Load-balancing software offers a number of algorithms to distribute requests.
         Each algorithm has its own characteristics that make it suitable in some situations
         but not others.
       </p>

       <dl>
         <dlentry>
           <dt>Leastconn</dt>
           <dd>
             Connects sessions to the coordinator with the fewest connections,
             to balance the load evenly. Typically used for workloads consisting
             of many independent, short-running queries. In configurations with
             only a few client machines, this setting can avoid having all
             requests go to only a small set of coordinators.
           </dd>
           <dd>
             Recommended for Impala with F5.
           </dd>
         </dlentry>
         <dlentry>
           <dt>Source IP Persistence</dt>
           <dd>
             <p>
               Sessions from the same IP address always go to the same
               coordinator. A good choice for Impala workloads containing a mix
               of queries and DDL statements, such as <codeph>CREATE TABLE</codeph>
               and <codeph>ALTER TABLE</codeph>. Because the metadata changes from
               a DDL statement take time to propagate across the cluster, prefer
               to use the Source IP Persistence in this case. If you are unable
               to choose Source IP Persistence, run the DDL and subsequent queries
               that depend on the results of the DDL through the same session,
               for example by running <codeph>impala-shell -f <varname>script_file</varname></codeph>
               to submit several statements through a single session.
             </p>
           </dd>
           <dd>
             <p>
               Required for setting up high availability with Hue.
             </p>
           </dd>
         </dlentry>
         <dlentry>
           <dt>Round-robin</dt>
           <dd>
             <p>
               Distributes connections to all coordinator nodes.
               Typically not recommended for Impala.
             </p>
           </dd>
         </dlentry>
       </dl>

       <p>
         You might need to perform benchmarks and load testing to determine
         which setting is optimal for your use case. Always set up using two
         load-balancing algorithms: Source IP Persistence for Hue and Leastconn
         for others.
       </p>

     </conbody>
   </concept>

   <concept id="proxy_kerberos">

     <title>Special Proxy Considerations for Clusters Using Kerberos</title>
   <prolog>
     <metadata>
       <data name="Category" value="Security"/>
       <data name="Category" value="Kerberos"/>
       <data name="Category" value="Authentication"/>
       <data name="Category" value="Proxy"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         In a cluster using Kerberos, applications check host credentials to
         verify that the host they are connecting to is the same one that is
         actually processing the request, to prevent man-in-the-middle attacks.
       </p>
       <p>
         In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower
         versions, once you enable a proxy server in a Kerberized cluster, users
         will not be able to connect to individual impala daemons directly from
         impala-shell.
       </p>

       <p>
         In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher,
         if you enable a proxy server in a Kerberized cluster, users have an
         option to connect to Impala daemons directly from
           <cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
           <codeph>--kerberos_host_fqdn</codeph> option when you start
           <cmdname>impala-shell</cmdname>. This option can be used for testing or
         troubleshooting purposes, but not recommended for live production
         environments as it defeats the purpose of a load balancer/proxy.
       </p>

       <p>
         Example:
 <codeblock>
 impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
 </codeblock>
       </p>

       <p>
         Alternatively, with the fully qualified
         configurations:
 <codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock>
       </p>
       <p>
         See <xref href="impala_shell_options.xml#shell_options"/> for
         information about the option.
       </p>

       <p>
         To clarify that the load-balancing proxy server is legitimate, perform
         these extra Kerberos setup steps:
       </p>

       <ol>
         <li>
           This section assumes you are starting with a Kerberos-enabled cluster. See
           <xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala with Kerberos. See
           <xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general steps to set up Kerberos.
         </li>

         <li>
           Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should
           already have an entry <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in
           its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host
           running the <cmdname>impalad</cmdname> daemon.
         </li>

         <li>
           Copy the keytab file from the proxy host to all other hosts in the cluster that run the
           <cmdname>impalad</cmdname> daemon. (For optimal performance, <cmdname>impalad</cmdname> should be running
           on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
         </li>

         <li>
           Add an entry <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to the keytab on each
           host running the <cmdname>impalad</cmdname> daemon.
         </li>

         <li>

          For each impalad node, merge the existing keytab with the proxy’s keytab using
           <cmdname>ktutil</cmdname>, producing a new keytab file. For example:
           <codeblock>$ ktutil
   ktutil: read_kt proxy.keytab
   ktutil: read_kt impala.keytab
   ktutil: write_kt proxy_impala.keytab
   ktutil: quit</codeblock>

         </li>

         <li>

           To verify that the keytabs are merged, run the command:
 <codeblock>
 klist -k <varname>keytabfile</varname>
 </codeblock>
           which lists the credentials for both <codeph>principal</codeph> and <codeph>be_principal</codeph> on
           all nodes.
         </li>


         <li>

           Make sure that the <codeph>impala</codeph> user has permission to read this merged keytab file.

         </li>

         <li>
           Change the following configuration settings for each host in the cluster that participates
           in the load balancing:
           <ul>
             <li>
               In the <cmdname>impalad</cmdname> option definition, add:
 <codeblock>
 --principal=impala/<i>proxy_host@realm</i>
   --be_principal=impala/<i>actual_host@realm</i>
   --keytab_file=<i>path_to_merged_keytab</i>
 </codeblock>
               <note>
                 Every host has different <codeph>--be_principal</codeph> because the actual hostname
                 is different on each host.

                 Specify the fully qualified domain name (FQDN) for the proxy host, not the IP
                 address. Use the exact FQDN as returned by a reverse DNS lookup for the associated
                 IP address.

               </note>
             </li>

             <li>
               Modify the startup options. See <xref href="impala_config_options.xml#config_options"/> for the procedure to modify the startup
               options.
             </li>
           </ul>
         </li>

         <li>
           Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname> daemons on all
           hosts in the cluster, as well as the <cmdname>statestored</cmdname> and <cmdname>catalogd</cmdname>
           daemons.
         </li>

       </ol>

 <!--
 We basically want to merge the keytab from the proxy host to all the impalad host's keytab file. To merge two keytab files, we first need to ship the proxy keytab to all the impalad node, then merge keytab files using MIT Kerberos "ktutil" command line tool.

 <codeblock>$ ktutil
 ktutil: read_kt krb5.keytab
 ktutil: read_kt proxy-host.keytab
 ktutil: write_kt krb5.keytab
 ktutil: quit</codeblock>

 The setup of the -principal and -be_principal has to be set through safety valve.
 -->

     </conbody>

   </concept>

   <concept id="tut_proxy">

     <title>Example of Configuring HAProxy Load Balancer for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="Configuring"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         If you are not already using a load-balancing proxy, you can experiment with
         <xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a free, open source load
         balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise
         Linux system.
       </p>

       <ul>
         <li>
           <p>
             Install the load balancer: <codeph>yum install haproxy</codeph>
           </p>
         </li>

         <li>
           <p>
             Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See the following section
             for a sample configuration file.
           </p>
         </li>

         <li>
           <p>
             Run the load balancer (on a single host, preferably one not running <cmdname>impalad</cmdname>):
           </p>
 <codeblock>/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg</codeblock>
         </li>

         <li>
           <p>
             In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect to the listener
             port of the proxy host, rather than port 21000 or 21050 on a host actually running <cmdname>impalad</cmdname>.
             The sample configuration file sets haproxy to listen on port 25003, therefore you would send all
             requests to <codeph><varname>haproxy_host</varname>:25003</codeph>.
           </p>
         </li>
       </ul>

       <p>
         This is the sample <filepath>haproxy.cfg</filepath> used in this example:
       </p>

 <codeblock>global
     # To have these messages end up in /var/log/haproxy.log you will
     # need to:
     #
     # 1) configure syslog to accept network log events.  This is done
     #    by adding the '-r' option to the SYSLOGD_OPTIONS in
     #    /etc/sysconfig/syslog
     #
     # 2) configure local2 events to go to the /var/log/haproxy.log
     #   file. A line like the following can be added to
     #   /etc/sysconfig/syslog
     #
     #    local2.*                       /var/log/haproxy.log
     #
     log         127.0.0.1 local0
     log         127.0.0.1 local1 notice
     chroot      /var/lib/haproxy
     pidfile     /var/run/haproxy.pid
     maxconn     4000
     user        haproxy
     group       haproxy
     daemon

     # turn on stats unix socket
     #stats socket /var/lib/haproxy/stats

 #---------------------------------------------------------------------
 # common defaults that all the 'listen' and 'backend' sections will
 # use if not designated in their block
 #
 # You might need to adjust timing values to prevent timeouts.
 #
 # The timeout values should be dependant on how you use the cluster
 # and how long your queries run.
 #---------------------------------------------------------------------
 defaults
     mode                    http
     log                     global
     option                  httplog
     option                  dontlognull
     option http-server-close
     option forwardfor       except 127.0.0.0/8
     option                  redispatch
     retries                 3
     maxconn                 3000
     timeout connect 5000
     timeout client 3600s
     timeout server 3600s

 #
 # This sets up the admin page for HA Proxy at port 25002.
 #
 listen stats :25002
     balance
     mode http
     stats enable
     stats auth <varname>username</varname>:<varname>password</varname>

 # This is the setup for Impala. Impala client connect to load_balancer_host:25003.
 # HAProxy will balance connections among the list of servers listed below.
 # The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
 # For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
 listen impala :25003
     mode tcp
     option tcplog
     balance leastconn

     server <varname>symbolic_name_1</varname> impala-host-1.example.com:21000
     server <varname>symbolic_name_2</varname> impala-host-2.example.com:21000
     server <varname>symbolic_name_3</varname> impala-host-3.example.com:21000
     server <varname>symbolic_name_4</varname> impala-host-4.example.com:21000

 # Setup for Hue or other JDBC-enabled applications.
 # In particular, Hue requires sticky sessions.
 # The application connects to load_balancer_host:21051, and HAProxy balances
 # connections to the associated hosts, where Impala listens for JDBC
 # requests on port 21050.
 listen impalajdbc :21051
     mode tcp
     option tcplog
     balance source
     server <varname>symbolic_name_5</varname> impala-host-1.example.com:21050 check
     server <varname>symbolic_name_6</varname> impala-host-2.example.com:21050 check
     server <varname>symbolic_name_7</varname> impala-host-3.example.com:21050 check
     server <varname>symbolic_name_8</varname> impala-host-4.example.com:21050 check
 </codeblock>
       <note type="important">
         Hue requires the <codeph>check</codeph> option at end of each line in
         the above file to ensure HAProxy can detect any unreachable Impalad
         server, and failover can be successful. Without the TCP check, you may hit
         an error when the <cmdname>impalad</cmdname> daemon to which Hue tries to
         connect is down.
       </note>

       <note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/>

     </conbody>

   </concept>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="proxy">

	<title>Using Impala through a Proxy for High Availability</title>
	<titlealts audience="PDF"><navtitle>Load-Balancing Proxy for HA</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="High Availability"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Network"/>
	<data name="Category" value="Proxy"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	For most clusters that have multiple users and production availability requirements, you might set up a proxy
	server to relay requests to and from Impala.
	</p>

	<p>
	Currently, the Impala statestore mechanism does not include such proxying and load-balancing features. Set up
	a software package of your choice to perform these functions.
	</p>

	<note>
	<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
	</note>

	<p outputclass="toc inpage"/>

	</conbody>

	<concept id="proxy_overview">

	<title>Overview of Proxy Usage and Load Balancing for Impala</title>
	<prolog>
	<metadata>
	<data name="Category" value="Concepts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Using a load-balancing proxy server for Impala has the following advantages:
	</p>

	<ul>
	<li>
	Applications connect to a single well-known host and port, rather than keeping track of the hosts where
	the <cmdname>impalad</cmdname> daemon is running.
	</li>

	<li>
	If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable, application connection
	requests still succeed because you always connect to the proxy server rather than a specific host running
	the <cmdname>impalad</cmdname> daemon.
	</li>

	<li>
	The coordinator node for each Impala query potentially requires more memory and CPU cycles than the other
	nodes that process the query. The proxy server can issue queries using round-robin scheduling, so that
	each connection uses a different coordinator node. This load-balancing technique lets the Impala nodes
	share this additional work, rather than concentrating it on a single machine.
	</li>
	</ul>

	<p>
	The following setup steps are a general outline that apply to any load-balancing proxy software:
	</p>

	<ol>
	<li>
	Select and download the load-balancing proxy software or other
	load-balancing hardware appliance. It should only need to be installed
	and configured on a single host, typically on an edge node. Pick a
	host other than the DataNodes where <cmdname>impalad</cmdname> is
	running, because the intention is to protect against the possibility
	of one or more of these DataNodes becoming unavailable.
	</li>

	<li>
	Configure the load balancer (typically by editing a configuration file).
	In particular:
	<ul>
	<li>
	Set up a port that the load balancer will listen on to relay
	Impala requests back and forth. </li>
	<li>
	See <xref href="#proxy_balancing" format="dita"/> for load
	balancing algorithm options.
	</li>
	<li>
	For Kerberized clusters, follow the instructions in <xref
	href="impala_proxy.xml#proxy_kerberos"/>.
	</li>
	</ul>
	</li>

	<li>
	If you are using Hue or JDBC-based applications, you typically set
	up load balancing for both ports 21000 and 21050, because these client
	applications connect through port 21050 while the
	<cmdname>impala-shell</cmdname> command connects through port
	21000. See <xref href="impala_ports.xml#ports"/> for when to use port
	21000, 21050, or another value depending on what type of connections
	you are load balancing.
	</li>

	<li>
	Run the load-balancing proxy server, pointing it at the configuration file that you set up.
	</li>

	<li>
	For any scripts, jobs, or configuration settings for applications
	that formerly connected to a specific DataNode to run Impala SQL
	statements, change the connection information (such as the
	<codeph>-i</codeph> option in <cmdname>impala-shell</cmdname>) to
	point to the load balancer instead.
	</li>
	</ol>

	<note>
	The following sections use the HAProxy software as a representative example of a load balancer
	that you can use with Impala.
	</note>

	</conbody>

	</concept>

	<concept id="proxy_balancing" rev="">
	<title>Choosing the Load-Balancing Algorithm</title>
	<conbody>
	<p>
	Load-balancing software offers a number of algorithms to distribute requests.
	Each algorithm has its own characteristics that make it suitable in some situations
	but not others.
	</p>

	<dl>
	<dlentry>
	<dt>Leastconn</dt>
	<dd>
	Connects sessions to the coordinator with the fewest connections,
	to balance the load evenly. Typically used for workloads consisting
	of many independent, short-running queries. In configurations with
	only a few client machines, this setting can avoid having all
	requests go to only a small set of coordinators.
	</dd>
	<dd>
	Recommended for Impala with F5.
	</dd>
	</dlentry>
	<dlentry>
	<dt>Source IP Persistence</dt>
	<dd>
	<p>
	Sessions from the same IP address always go to the same
	coordinator. A good choice for Impala workloads containing a mix
	of queries and DDL statements, such as <codeph>CREATE TABLE</codeph>
	and <codeph>ALTER TABLE</codeph>. Because the metadata changes from
	a DDL statement take time to propagate across the cluster, prefer
	to use the Source IP Persistence in this case. If you are unable
	to choose Source IP Persistence, run the DDL and subsequent queries
	that depend on the results of the DDL through the same session,
	for example by running <codeph>impala-shell -f <varname>script_file</varname></codeph>
	to submit several statements through a single session.
	</p>
	</dd>
	<dd>
	<p>
	Required for setting up high availability with Hue.
	</p>
	</dd>
	</dlentry>
	<dlentry>
	<dt>Round-robin</dt>
	<dd>
	<p>
	Distributes connections to all coordinator nodes.
	Typically not recommended for Impala.
	</p>
	</dd>
	</dlentry>
	</dl>

	<p>
	You might need to perform benchmarks and load testing to determine
	which setting is optimal for your use case. Always set up using two
	load-balancing algorithms: Source IP Persistence for Hue and Leastconn
	for others.
	</p>

	</conbody>
	</concept>

	<concept id="proxy_kerberos">

	<title>Special Proxy Considerations for Clusters Using Kerberos</title>
	<prolog>
	<metadata>
	<data name="Category" value="Security"/>
	<data name="Category" value="Kerberos"/>
	<data name="Category" value="Authentication"/>
	<data name="Category" value="Proxy"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	In a cluster using Kerberos, applications check host credentials to
	verify that the host they are connecting to is the same one that is
	actually processing the request, to prevent man-in-the-middle attacks.
	</p>
	<p>
	In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower
	versions, once you enable a proxy server in a Kerberized cluster, users
	will not be able to connect to individual impala daemons directly from
	impala-shell.
	</p>

	<p>
	In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher,
	if you enable a proxy server in a Kerberized cluster, users have an
	option to connect to Impala daemons directly from
	<cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> /
	<codeph>--kerberos_host_fqdn</codeph> option when you start
	<cmdname>impala-shell</cmdname>. This option can be used for testing or
	troubleshooting purposes, but not recommended for live production
	environments as it defeats the purpose of a load balancer/proxy.
	</p>

	<p>
	Example:
	<codeblock>
	impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
	</codeblock>
	</p>

	<p>
	Alternatively, with the fully qualified
	configurations:
	<codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock>
	</p>
	<p>
	See <xref href="impala_shell_options.xml#shell_options"/> for
	information about the option.
	</p>

	<p>
	To clarify that the load-balancing proxy server is legitimate, perform
	these extra Kerberos setup steps:
	</p>

	<ol>
	<li>
	This section assumes you are starting with a Kerberos-enabled cluster. See
	<xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala with Kerberos. See
	<xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general steps to set up Kerberos.
	</li>

	<li>
	Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should
	already have an entry <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in
	its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host
	running the <cmdname>impalad</cmdname> daemon.
	</li>

	<li>
	Copy the keytab file from the proxy host to all other hosts in the cluster that run the
	<cmdname>impalad</cmdname> daemon. (For optimal performance, <cmdname>impalad</cmdname> should be running
	on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
	</li>

	<li>
	Add an entry <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to the keytab on each
	host running the <cmdname>impalad</cmdname> daemon.
	</li>

	<li>

	For each impalad node, merge the existing keytab with the proxy’s keytab using
	<cmdname>ktutil</cmdname>, producing a new keytab file. For example:
	<codeblock>$ ktutil
	ktutil: read_kt proxy.keytab
	ktutil: read_kt impala.keytab
	ktutil: write_kt proxy_impala.keytab
	ktutil: quit</codeblock>

	</li>

	<li>

	To verify that the keytabs are merged, run the command:
	<codeblock>
	klist -k <varname>keytabfile</varname>
	</codeblock>
	which lists the credentials for both <codeph>principal</codeph> and <codeph>be_principal</codeph> on
	all nodes.
	</li>


	<li>

	Make sure that the <codeph>impala</codeph> user has permission to read this merged keytab file.

	</li>

	<li>
	Change the following configuration settings for each host in the cluster that participates
	in the load balancing:
	<ul>
	<li>
	In the <cmdname>impalad</cmdname> option definition, add:
	<codeblock>
	--principal=impala/<i>proxy_host@realm</i>
	--be_principal=impala/<i>actual_host@realm</i>
	--keytab_file=<i>path_to_merged_keytab</i>
	</codeblock>
	<note>
	Every host has different <codeph>--be_principal</codeph> because the actual hostname
	is different on each host.

	Specify the fully qualified domain name (FQDN) for the proxy host, not the IP
	address. Use the exact FQDN as returned by a reverse DNS lookup for the associated
	IP address.

	</note>
	</li>

	<li>
	Modify the startup options. See <xref href="impala_config_options.xml#config_options"/> for the procedure to modify the startup
	options.
	</li>
	</ul>
	</li>

	<li>
	Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname> daemons on all
	hosts in the cluster, as well as the <cmdname>statestored</cmdname> and <cmdname>catalogd</cmdname>
	daemons.
	</li>

	</ol>

	<!--
	We basically want to merge the keytab from the proxy host to all the impalad host's keytab file. To merge two keytab files, we first need to ship the proxy keytab to all the impalad node, then merge keytab files using MIT Kerberos "ktutil" command line tool.

	<codeblock>$ ktutil
	ktutil: read_kt krb5.keytab
	ktutil: read_kt proxy-host.keytab
	ktutil: write_kt krb5.keytab
	ktutil: quit</codeblock>

	The setup of the -principal and -be_principal has to be set through safety valve.
	-->

	</conbody>

	</concept>

	<concept id="tut_proxy">

	<title>Example of Configuring HAProxy Load Balancer for Impala</title>
	<prolog>
	<metadata>
	<data name="Category" value="Configuring"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	If you are not already using a load-balancing proxy, you can experiment with
	<xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a free, open source load
	balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise
	Linux system.
	</p>

	<ul>
	<li>
	<p>
	Install the load balancer: <codeph>yum install haproxy</codeph>
	</p>
	</li>

	<li>
	<p>
	Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See the following section
	for a sample configuration file.
	</p>
	</li>

	<li>
	<p>
	Run the load balancer (on a single host, preferably one not running <cmdname>impalad</cmdname>):
	</p>
	<codeblock>/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg</codeblock>
	</li>

	<li>
	<p>
	In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect to the listener
	port of the proxy host, rather than port 21000 or 21050 on a host actually running <cmdname>impalad</cmdname>.
	The sample configuration file sets haproxy to listen on port 25003, therefore you would send all
	requests to <codeph><varname>haproxy_host</varname>:25003</codeph>.
	</p>
	</li>
	</ul>

	<p>
	This is the sample <filepath>haproxy.cfg</filepath> used in this example:
	</p>

	<codeblock>global
	# To have these messages end up in /var/log/haproxy.log you will
	# need to:
	#
	# 1) configure syslog to accept network log events. This is done
	# by adding the '-r' option to the SYSLOGD_OPTIONS in
	# /etc/sysconfig/syslog
	#
	# 2) configure local2 events to go to the /var/log/haproxy.log
	# file. A line like the following can be added to
	# /etc/sysconfig/syslog
	#
	# local2.* /var/log/haproxy.log
	#
	log 127.0.0.1 local0
	log 127.0.0.1 local1 notice
	chroot /var/lib/haproxy
	pidfile /var/run/haproxy.pid
	maxconn 4000
	user haproxy
	group haproxy
	daemon

	# turn on stats unix socket
	#stats socket /var/lib/haproxy/stats

	#---------------------------------------------------------------------
	# common defaults that all the 'listen' and 'backend' sections will
	# use if not designated in their block
	#
	# You might need to adjust timing values to prevent timeouts.
	#
	# The timeout values should be dependant on how you use the cluster
	# and how long your queries run.
	#---------------------------------------------------------------------
	defaults
	mode http
	log global
	option httplog
	option dontlognull
	option http-server-close
	option forwardfor except 127.0.0.0/8
	option redispatch
	retries 3
	maxconn 3000
	timeout connect 5000
	timeout client 3600s
	timeout server 3600s

	#
	# This sets up the admin page for HA Proxy at port 25002.
	#
	listen stats :25002
	balance
	mode http
	stats enable
	stats auth <varname>username</varname>:<varname>password</varname>

	# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
	# HAProxy will balance connections among the list of servers listed below.
	# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
	# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
	listen impala :25003
	mode tcp
	option tcplog
	balance leastconn

	server <varname>symbolic_name_1</varname> impala-host-1.example.com:21000
	server <varname>symbolic_name_2</varname> impala-host-2.example.com:21000
	server <varname>symbolic_name_3</varname> impala-host-3.example.com:21000
	server <varname>symbolic_name_4</varname> impala-host-4.example.com:21000

	# Setup for Hue or other JDBC-enabled applications.
	# In particular, Hue requires sticky sessions.
	# The application connects to load_balancer_host:21051, and HAProxy balances
	# connections to the associated hosts, where Impala listens for JDBC
	# requests on port 21050.
	listen impalajdbc :21051
	mode tcp
	option tcplog
	balance source
	server <varname>symbolic_name_5</varname> impala-host-1.example.com:21050 check
	server <varname>symbolic_name_6</varname> impala-host-2.example.com:21050 check
	server <varname>symbolic_name_7</varname> impala-host-3.example.com:21050 check
	server <varname>symbolic_name_8</varname> impala-host-4.example.com:21050 check
	</codeblock>
	<note type="important">
	Hue requires the <codeph>check</codeph> option at end of each line in
	the above file to ensure HAProxy can detect any unreachable Impalad
	server, and failover can be successful. Without the TCP check, you may hit
	an error when the <cmdname>impalad</cmdname> daemon to which Hue tries to
	connect is down.
	</note>

	<note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/>

	</conbody>

	</concept>

	</concept>