| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="proxy"> |
| |
| <title>Using Impala through a Proxy for High Availability</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Load-Balancing Proxy for HA</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="High Availability"/> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Network"/> |
| <data name="Category" value="Proxy"/> |
| <data name="Category" value="Administrators"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| For most clusters that have multiple users and production availability requirements, you |
| might set up a proxy server to relay requests to and from Impala. |
| </p> |
| |
| <p> |
| Set up a software package of your choice to perform these functions. |
| </p> |
| |
| <note> |
| <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/> |
| </note> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="proxy_overview"> |
| |
| <title>Overview of Proxy Usage and Load Balancing for Impala</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Concepts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Using a load-balancing proxy server for Impala has the following advantages: |
| </p> |
| |
| <ul> |
| <li> |
| Applications connect to a single well-known host and port, rather than keeping track |
| of the hosts where the <cmdname>impalad</cmdname> daemon is running. |
| </li> |
| |
| <li> |
| If any host running the <cmdname>impalad</cmdname> daemon becomes unavailable, |
| application connection requests still succeed because you always connect to the proxy |
| server rather than a specific host running the <cmdname>impalad</cmdname> daemon. |
| </li> |
| |
| <li> |
| The coordinator node for each Impala query potentially requires more memory and CPU |
| cycles than the other nodes that process the query. The proxy server can issue queries |
| so that each connection uses a different coordinator node. This load-balancing |
| technique lets the Impala nodes share this additional work, rather than concentrating |
| it on a single machine. |
| </li> |
| </ul> |
| |
| <p> |
| The following setup steps are a general outline that apply to any load-balancing proxy |
| software: |
| </p> |
| |
| <ol> |
| <li> |
| Select and download the load-balancing proxy software or other load-balancing hardware |
| appliance. It should only need to be installed and configured on a single host, |
| typically on an edge node. |
| </li> |
| |
| <li> |
| Configure the load balancer (typically by editing a configuration file). In |
| particular: |
| <ul> |
| <li> |
| To relay Impala requests back and forth, set up a port that the load balancer will |
| listen on. |
| </li> |
| |
| <li> |
| Select a load balancing algorithm. See |
| <xref |
| href="#proxy_balancing" format="dita"/> for load balancing |
| algorithm options. |
| </li> |
| |
| <li> |
| For Kerberized clusters, follow the instructions in |
| <xref |
| href="impala_proxy.xml#proxy_kerberos"/>. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| If you are using Hue or JDBC-based applications, you typically set up load balancing |
| for both ports 21000 and 21050 because these client applications connect through port |
| 21050 while the <cmdname>impala-shell</cmdname> command connects through port 21000. |
| See <xref href="impala_ports.xml#ports"/> for when to use port 21000, 21050, or |
| another value depending on what type of connections you are load balancing. |
| </li> |
| |
| <li> |
| Run the load-balancing proxy server, pointing it at the configuration file that you |
| set up. |
| </li> |
| |
| <li> |
| For any scripts, jobs, or configuration settings for applications that formerly |
| connected to a specific <cmdname>impalad</cmdname> to run Impala SQL statements, |
| change the connection information (such as the <codeph>-i</codeph> option in |
| <cmdname>impala-shell</cmdname>) to point to the load balancer instead. |
| </li> |
| </ol> |
| |
| <note> |
| The following sections use the HAProxy software as a representative example of a load |
| balancer that you can use with Impala. |
| </note> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="proxy_balancing" rev=""> |
| |
| <title>Choosing the Load-Balancing Algorithm</title> |
| |
| <conbody> |
| |
| <p> |
| Load-balancing software offers a number of algorithms to distribute requests. Each |
| algorithm has its own characteristics that make it suitable in some situations but not |
| others. |
| </p> |
| |
| <dl> |
| <dlentry> |
| |
| <dt> |
| Leastconn |
| </dt> |
| |
| <dd> |
| Connects sessions to the coordinator with the fewest connections, to balance the |
| load evenly. Typically used for workloads consisting of many independent, |
| short-running queries. In configurations with only a few client machines, this |
| setting can avoid having all requests go to only a small set of coordinators. |
| </dd> |
| |
| <dd> |
| Recommended for Impala with F5. |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| Source IP Persistence |
| </dt> |
| |
| <dd> |
| <p> |
| Sessions from the same IP address always go to the same coordinator. A good choice |
| for Impala workloads containing a mix of queries and DDL statements, such as |
| <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>. Because the |
| metadata changes from a DDL statement take time to propagate across the cluster, |
| prefer to use the Source IP Persistence in this case. If you are unable to choose |
| Source IP Persistence, run the DDL and subsequent queries that depend on the |
| results of the DDL through the same session, for example by running |
| <codeph>impala-shell -f <varname>script_file</varname></codeph> to submit several |
| statements through a single session. |
| </p> |
| </dd> |
| |
| <dd> |
| <p> |
| Required for setting up high availability with Hue. |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| Round-robin |
| </dt> |
| |
| <dd> |
| Distributes connections to all coordinator nodes. Typically not recommended for |
| Impala. |
| </dd> |
| |
| </dlentry> |
| </dl> |
| |
| <p> |
| You might need to perform benchmarks and load testing to determine which setting is |
| optimal for your use case. Always set up using two load-balancing algorithms: Source IP |
| Persistence for Hue and Leastconn for others. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="proxy_kerberos"> |
| |
| <title>Special Proxy Considerations for Clusters Using Kerberos</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Security"/> |
| <data name="Category" value="Kerberos"/> |
| <data name="Category" value="Authentication"/> |
| <data name="Category" value="Proxy"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| In a cluster using Kerberos, applications check host credentials to verify that the host |
| they are connecting to is the same one that is actually processing the request. |
| </p> |
| |
| <p> |
| In <keyword keyref="impala211_full">Impala 2.11</keyword> and lower versions, once you |
| enable a proxy server in a Kerberized cluster, users will not be able to connect to |
| individual impala daemons directly from impala-shell. |
| </p> |
| |
| <p> |
| In <keyword keyref="impala212_full">Impala 2.12</keyword> and higher versions, when you |
| enable a proxy server in a Kerberized cluster, users have an option to connect to Impala |
| daemons directly from <cmdname>impala-shell</cmdname> using the <codeph>-b</codeph> / |
| <codeph>--kerberos_host_fqdn</codeph> <cmdname>impala-shell</cmdname> flag. This option |
| can be used for testing or troubleshooting purposes, but not recommended for live |
| production environments as it defeats the purpose of a load balancer/proxy. |
| </p> |
| |
| <p> |
| Example: |
| <codeblock> |
| impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com |
| </codeblock> |
| </p> |
| |
| <p> |
| Alternatively, with the fully qualified configurations: |
| <codeblock>impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com</codeblock> |
| </p> |
| |
| <p> |
| See <xref href="impala_shell_options.xml#shell_options"/> for information about the |
| option. |
| </p> |
| |
| <p> |
| To validate the load-balancing proxy server, perform these extra Kerberos setup steps: |
| </p> |
| |
| <ol> |
| <li> |
| This section assumes you are starting with a Kerberos-enabled cluster. See |
| <xref href="impala_kerberos.xml#kerberos"/> for instructions for setting up Impala |
| with Kerberos. See <xref keyref="cdh_sg_kerberos_prin_keytab_deploy"/> for general |
| steps to set up Kerberos. |
| </li> |
| |
| <li> |
| Choose the host you will use for the proxy server. Based on the Kerberos setup |
| procedure, it should already have an entry |
| <codeph>impala/<varname>proxy_host</varname>@<varname>realm</varname></codeph> in its |
| <filepath>keytab</filepath>. If not, go back over the initial Kerberos configuration |
| steps for the <filepath>keytab</filepath> on each host running the |
| <cmdname>impalad</cmdname> daemon. |
| </li> |
| |
| <li> |
| Copy the <filepath>keytab</filepath> file from the proxy host to all other hosts in |
| the cluster that run the <cmdname>impalad</cmdname> daemon. Put the |
| <filepath>keytab</filepath> file in a secure location on each of these other hosts. |
| </li> |
| |
| <li> |
| Add an entry |
| <codeph>impala/<varname>actual_hostname</varname>@<varname>realm</varname></codeph> to |
| the <filepath>keytab</filepath> on each host running the <cmdname>impalad</cmdname> |
| daemon. |
| </li> |
| |
| <li> |
| For each <cmdname>impalad</cmdname> node, merge the existing |
| <filepath>keytab</filepath> with the proxy’s <filepath>keytab</filepath> using |
| <cmdname>ktutil</cmdname>, producing a new <filepath>keytab</filepath> file. For |
| example: |
| <codeblock>$ ktutil |
| ktutil: read_kt proxy.keytab |
| ktutil: read_kt impala.keytab |
| ktutil: write_kt proxy_impala.keytab |
| ktutil: quit</codeblock> |
| </li> |
| |
| <li> |
| To verify that the <filepath>keytabs</filepath> are merged, run the command: |
| <codeblock> |
| klist -k <varname>keytabfile</varname> |
| </codeblock> |
| The command lists the credentials for both <codeph>principal</codeph> and |
| <codeph>be_principal</codeph> on all nodes. |
| </li> |
| |
| <li> |
| Make sure that the <codeph>impala</codeph> user has the permission to read this merged |
| <filepath>keytab</filepath> file. |
| </li> |
| |
| <li> |
| For each coordinator <codeph>impalad</codeph> host in the cluster that participates in |
| the load balancing, add the following configuration options to receive client |
| connections coming through the load balancer proxy server: |
| <codeblock> |
| --principal=impala/<varname>proxy_host@realm</varname> |
| --be_principal=impala/<varname>actual_host@realm</varname> |
| --keytab_file=<varname>path_to_merged_keytab</varname> |
| </codeblock> |
| <p> |
| The <codeph>--principal</codeph> setting prevents a client from connecting to a |
| coordinator <codeph>impalad</codeph> using a principal other than one specified. |
| </p> |
| |
| <note> |
| Every host has different <codeph>--be_principal</codeph> because the actual host |
| name is different on each host. Specify the fully qualified domain name (FQDN) for |
| the proxy host, not the IP address. Use the exact FQDN as returned by a reverse DNS |
| lookup for the associated IP address. |
| </note> |
| </li> |
| |
| <li> |
| Restart Impala to make the changes take effect. Restart the <cmdname>impalad</cmdname> |
| daemons on all hosts in the cluster, as well as the <cmdname>statestored</cmdname> and |
| <cmdname>catalogd</cmdname> daemons. |
| </li> |
| </ol> |
| |
| <section id="section_fjz_mfn_yjb"> |
| |
| <title>Client Connection to Proxy Server in Kerberized Clusters</title> |
| |
| <p> |
| When a client connect to Impala, the service principal specified by the client must |
| match the <codeph>-principal</codeph> setting of the Impala proxy server. And the |
| client should connect to the proxy server port. |
| </p> |
| |
| <p> |
| In <filepath>hue.ini</filepath>, set the following for to configure Hue to |
| automatically connect to the proxy server: |
| </p> |
| |
| <codeblock>[impala] |
| server_host=<varname>proxy_host</varname> |
| impala_principal=impala/<varname>proxy_host</varname></codeblock> |
| |
| <p> |
| The following are the JDBC connection string formats when connecting through the load |
| balancer with the load balancer's host name in the principal: |
| </p> |
| |
| <codeblock>jdbc:hive2://<varname>proxy_host</varname>:<varname>load_balancer_port</varname>/;principal=impala/_HOST@<varname>realm</varname> |
| jdbc:hive2://<varname>proxy_host</varname>:<varname>load_balancer_port</varname>/;principal=impala/<varname>proxy_host</varname>@<varname>realm</varname></codeblock> |
| |
| <p> |
| When starting <cmdname>impala-shell</cmdname>, specify the service principal via the |
| <codeph>-b</codeph> or <codeph>--kerberos_host_fqdn</codeph> flag. |
| </p> |
| |
| </section> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="proxy_tls"> |
| |
| <title>Special Proxy Considerations for TLS/SSL Enabled Clusters</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Security"/> |
| <data name="Category" value="TLS"/> |
| <data name="Category" value="Authentication"/> |
| <data name="Category" value="Proxy"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| When TLS/SSL is enabled for Impala, the client application, whether impala-shell, Hue, |
| or something else, expects the certificate common name (CN) to match the hostname that |
| it is connected to. With no load balancing proxy server, the hostname and certificate CN |
| are both that of the <codeph>impalad</codeph> instance. However, with a proxy server, |
| the certificate presented by the <codeph>impalad</codeph> instance does not match the |
| load balancing proxy server hostname. If you try to load-balance a TLS/SSL-enabled |
| Impala installation without additional configuration, you see a certificate mismatch |
| error when a client attempts to connect to the load balancing proxy host. |
| </p> |
| |
| <p> |
| You can configure a proxy server in several ways to load balance TLS/SSL enabled Impala: |
| </p> |
| |
| <dl> |
| <dlentry> |
| |
| <dt> |
| Client/Server SSL |
| </dt> |
| |
| <dd> |
| In this configuration, the proxy server presents an SSL certificate to the client, |
| decrypts the client request, then re-encrypts the request before sending it to the |
| backend <codeph>impalad</codeph>. The client and server certificates can be managed |
| separately. The request or resulting payload is encrypted in transit at all times. |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| TLS/SSL Passthrough |
| </dt> |
| |
| <dd> |
| In this configuration, traffic passes through to the backend |
| <codeph>impalad</codeph> instance with no interaction from the load balancing proxy |
| server. Traffic is still encrypted end-to-end. |
| </dd> |
| |
| <dd> |
| The same server certificate, utilizing either wildcard or Subject Alternate Name |
| (SAN), must be installed on each <codeph>impalad</codeph> instance. |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| TLS/SSL Offload |
| </dt> |
| |
| <dd> |
| In this configuration, all traffic is decrypted on the load balancing proxy server, |
| and traffic between the backend <codeph>impalad</codeph> instances is unencrypted. |
| This configuration presumes that cluster hosts reside on a trusted network and only |
| external client-facing communication need to be encrypted in-transit. |
| </dd> |
| |
| </dlentry> |
| </dl> |
| |
| <p> |
| Refer to your load balancer documentation for the steps to set up Impala and the load |
| balancer using one of the options above. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="tut_proxy"> |
| |
| <title>Example of Configuring HAProxy Load Balancer for Impala</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Configuring"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| If you are not already using a load-balancing proxy, you can experiment with |
| <xref href="http://haproxy.1wt.eu/" scope="external" format="html">HAProxy</xref> a |
| free, open source load balancer. This example shows how you might install and configure |
| that load balancer on a Red Hat Enterprise Linux system. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Install the load balancer: |
| </p> |
| <codeblock>yum install haproxy</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| Set up the configuration file: <filepath>/etc/haproxy/haproxy.cfg</filepath>. See |
| the following section for a sample configuration file. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Run the load balancer (on a single host, preferably one not running |
| <cmdname>impalad</cmdname>): |
| </p> |
| <codeblock>/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg</codeblock> |
| </li> |
| |
| <li> |
| <p> |
| In <cmdname>impala-shell</cmdname>, JDBC applications, or ODBC applications, connect |
| to the listener port of the proxy host, rather than port 21000 or 21050 on a host |
| actually running <cmdname>impalad</cmdname>. The sample configuration file sets |
| haproxy to listen on port 25003, therefore you would send all requests to |
| <codeph><varname>haproxy_host</varname>:25003</codeph>. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| This is the sample <filepath>haproxy.cfg</filepath> used in this example: |
| </p> |
| |
| <codeblock>global |
| # To have these messages end up in /var/log/haproxy.log you will |
| # need to: |
| # |
| # 1) configure syslog to accept network log events. This is done |
| # by adding the '-r' option to the SYSLOGD_OPTIONS in |
| # /etc/sysconfig/syslog |
| # |
| # 2) configure local2 events to go to the /var/log/haproxy.log |
| # file. A line like the following can be added to |
| # /etc/sysconfig/syslog |
| # |
| # local2.* /var/log/haproxy.log |
| # |
| log 127.0.0.1 local0 |
| log 127.0.0.1 local1 notice |
| chroot /var/lib/haproxy |
| pidfile /var/run/haproxy.pid |
| maxconn 4000 |
| user haproxy |
| group haproxy |
| daemon |
| |
| # turn on stats unix socket |
| #stats socket /var/lib/haproxy/stats |
| |
| #--------------------------------------------------------------------- |
| # common defaults that all the 'listen' and 'backend' sections will |
| # use if not designated in their block |
| # |
| # You might need to adjust timing values to prevent timeouts. |
| # |
| # The timeout values should be dependant on how you use the cluster |
| # and how long your queries run. |
| #--------------------------------------------------------------------- |
| defaults |
| mode http |
| log global |
| option httplog |
| option dontlognull |
| option http-server-close |
| option forwardfor except 127.0.0.0/8 |
| option redispatch |
| retries 3 |
| maxconn 3000 |
| timeout connect 5000 |
| timeout client 3600s |
| timeout server 3600s |
| |
| # |
| # This sets up the admin page for HA Proxy at port 25002. |
| # |
| listen stats :25002 |
| balance |
| mode http |
| stats enable |
| stats auth <varname>username</varname>:<varname>password</varname> |
| |
| # Setup for Impala. |
| # Impala client connect to load_balancer_host:25003. |
| # HAProxy will balance connections among the list of servers listed below. |
| # The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver. |
| # For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000. |
| listen impala :25003 |
| mode tcp |
| option tcplog |
| balance leastconn |
| |
| server <varname>symbolic_name_1</varname> impala-host-1.example.com:21000 check |
| server <varname>symbolic_name_2</varname> impala-host-2.example.com:21000 check |
| server <varname>symbolic_name_3</varname> impala-host-3.example.com:21000 check |
| server <varname>symbolic_name_4</varname> impala-host-4.example.com:21000 check |
| |
| # Setup for Hue or other JDBC-enabled applications. |
| # In particular, Hue requires sticky sessions. |
| # The application connects to load_balancer_host:21051, and HAProxy balances |
| # connections to the associated hosts, where Impala listens for |
| # JDBC requests at port 21050. |
| listen impalajdbc :21051 |
| mode tcp |
| option tcplog |
| balance source |
| |
| server <varname>symbolic_name_5</varname> impala-host-1.example.com:21050 check |
| server <varname>symbolic_name_6</varname> impala-host-2.example.com:21050 check |
| server <varname>symbolic_name_7</varname> impala-host-3.example.com:21050 check |
| server <varname>symbolic_name_8</varname> impala-host-4.example.com:21050 check |
| </codeblock> |
| |
| <note type="important"> |
| Hue requires the <codeph>check</codeph> option at end of each line in the above file to |
| ensure HAProxy can detect any unreachable <cmdname>Impalad</cmdname> server, and |
| failover can be successful. Without the TCP check, you may hit an error when the |
| <cmdname>impalad</cmdname> daemon to which Hue tries to connect is down. |
| </note> |
| |
| <note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |