This section provides instructions for installing a HAWQ system.
Note: Install HAWQ from the command line only if you do not use Ambari to install and manage HDFS. If you use Ambari for HDFS and management, follow the instructions in Install HAWQ using Ambari instead.
Install a compatible HDFS cluster before you attempt to install HAWQ.
Configure operating system parameters on each host machine before you begin to install HAWQ.
Use a text editor to edit the /etc/sysctl.conf file. Add or edit each of the following parameter definitions to set the required value:
kernel.shmmax = 1000000000 kernel.shmmni = 4096 kernel.shmall = 4000000000 kernel.sem = 250 512000 100 2048 kernel.sysrq = 1 kernel.core_uses_pid = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.msgmni = 2048 net.ipv4.tcp_syncookies = 0 net.ipv4.ip_forward = 0 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_max_syn_backlog = 200000 net.ipv4.conf.all.arp_filter = 1 net.ipv4.ip_local_port_range = 1281 65535 net.core.netdev_max_backlog = 200000 vm.overcommit_memory = 2 fs.nr_open = 3000000 kernel.threads-max = 798720 kernel.pid_max = 798720 # increase network net.core.rmem_max=2097152 net.core.wmem_max=2097152
Save the file after making the required changes.
Execute the following command to apply your updated /etc/sysctl.conf file to the operating system configuration:
$ sysctl -p
Use a text editor to edit the /etc/security/limits.conf file. Add the following definitions in the exact order that they are listed:
* soft nofile 2900000 * hard nofile 2900000 * soft nproc 131072 * hard nproc 131072
Save the file after making the required changes.
Use a text editor to edit the hdfs-site.xml
file. Ensure that the following HDFS parameters are set to the correct value:
Property | Setting |
---|---|
dfs.allow.truncate | true |
dfs.block.access.token.enable | false for an unsecured HDFS cluster, or true for a secure cluster |
dfs.block.local-path-access.user | gpadmin |
dfs.client.read.shortcircuit | true |
dfs.client.socket-timeout | 300000000 |
dfs.client.use.legacy.blockreader.local | false |
dfs.datanode.data.dir.perm | 750 |
dfs.datanode.handler.count | 60 |
dfs.datanode.max.transfer.threads | 40960 |
dfs.datanode.socket.write.timeout | 7200000 |
dfs.namenode.accesstime.precision | 0 |
dfs.namenode.handler.count | 600 |
dfs.support.append | true |
Use a text editor to edit the core-site.xml
file. Ensure that the following HDFS parameters are set to the correct value:
Property | Setting |
---|---|
ipc.client.connection.maxidletime | 3600000 |
ipc.client.connect.timeout | 300000 |
ipc.server.listen.queue.size | 3300 |
Restart HDFS to apply your configuration changes.
Ensure that the /etc/hosts
file on each cluster node contains the hostname of every other member of the cluster. Consider creating a single, master /etc/hosts
file and either copying it or referencing it on every host that will take part in the cluster.
Follow this procedure to install the HAWQ cluster on multiple host machines or VMs.
Note: If you want to install a cluster on a single host machine or VM, follow the instructions in Install the HAWQ Cluster on a Single Machine instead.
Login to the target master host machine as the root user. If you are logged in as a different user, switch to the root account:
$ su - root
Download the Apache HAWQ tarball distribution (hdb-2.0.0.0-18407.tar.gz) to the local machine.
Unzip the tarball file:
$ tar xzf hdb-2.0.0.0-18407.tar.gz
The files are uncompressed into a hdb-2.0.0.0 subdirectory.
Execute the RPM installer:
$ cd hdb-2.0.0.0 $ rpm -ivh hawq-2.0.0.0-18407.x86_64.rpm
Source the greenplum_path.sh
file to set your environment for HAWQ. For RPM installations, enter:
$ source /usr/local/hawq/greenplum_path.sh
If you downloaded the tarball, substitute the path to the extracted greenplum_path.sh
file (for example /opt/hawq-2.0.0.0/greenplum_path.sh
).
Use a text editor to create a file (for example, hostfile) that lists the hostname of the master node, the standby master node, and each segment node used in the cluster. Specify one hostname per line, as in the following example that defines a cluster with a master node, standby master node, and three segments:
mdw smdw sdw1 sdw2 sdw3
Create a second text file (for example, seg_hosts) that specifies only the HAWQ segment hosts in your cluster. For example:
sdw1 sdw2 sdw3
You will use the hostfile and seg_hosts files in subsequent steps to perform common configuration tasks on multiple nodes of the cluster.
HAWQ requires passwordless ssh access to all cluster nodes. Execute the following hawq
command to set up passwordless ssh on the nodes defined in your hostfile:
``` $ hawq ssh-exkeys -f hostfile ```
Execute the following hawq
to copy the installation RPM to all nodes defined in your hostfile:
$ hawq scp -f hostfile hawq-2.0.0.0-18407.x86_64.rpm =:~/
Execute this command to initiate the RPM installation on all cluster hosts:
$ hawq ssh -f hostfile -e "rpm -ivh ~/hawq-*.rpm"
Create the gpadmin user on all hosts in the cluster:
$ hawq ssh -f hostfile -e '/usr/sbin/useradd gpadmin' $ hawq ssh –f hostfile -e 'echo -e "changeme\changeme" | passwd gpadmin'
Switch to the gpadmin user, and source the greenplum_path.sh file once again:
$ su - gpadmin $ source /usr/local/hawq/greenplum_path.sh
Execute the following hawq
command a second time to set up passwordless ssh for the gpadmin user:
$ hawq ssh-exkeys -f hostfile
Execute this command to confirm that HAWQ was installed on all of the hosts:
$ hawq ssh -f hostfile -e "ls -l $GPHOME"
Create a directory to use for storing the master catalog data. This directory must have sufficient disk space to store data, and it must be owned by the gpadmin user. For example:
$ mkdir -p /data/master $ chown -R gpadmin /data/master
If you are using a standby master node, create the same directory on the standby host. For example:
$ hawq ssh -h smdw -e 'mkdir /data/master' $ hawq ssh -h smdw -e 'chown -R gpadmin /data/master'
Replace smdw
with the name of the standby host.
Use the following hawq
commands with the seg_hosts file you created earlier to create the data directory on all of the HAWQ segment hosts:
$ hawq ssh -f seg_hosts -e 'mkdir /data/segment' $ hawq ssh -f seg_hosts -e 'chown gpadmin /data/segment'
HAWQ temporary directories are used for spill files. Create the HAWQ temporary directories on all HAWQ hosts in your cluster. As a best practice, use a directory from each mounted drive available on your machines to load balance writing for temporary files. These drives are typically the same drives that are used by your DataNode service. For example, if you have two drives mounted on /data1
and /data2
, you could use /data1/tmp
and /data2/tmp
for storing temporary files. If you do not specify master or segment temporary directories, temporary files are stored in /tmp
.
The following example commands use two disks with the paths /data1/tmp and /data2/tmp:
$ dirs="/data1/tmp /data2/tmp" $ mkdir -p $dirs $ chown -R gpadmin $dirs $ hawq ssh -h smdw -e "mkdir -p $dirs" $ hawq ssh -h smdw -e "chown -R gpadmin $dirs" $ hawq ssh -f seg_hosts -e "mkdir -p $dirs" $ hawq ssh -f seg_hosts -e "chown -R gpadmin $dirs"
If you configure too few temp directories, or you place multiple temp directories on the same disk, you increase the risk of disk contention or running out of disk space when multiple virtual segments target the same disk. Each HAWQ segment node can have 6 virtual segments.
Login to the master host as the gpadmin user. Create a customized $GPHOME/etc/hawq-site.xml
file using the template $GPHOME/etc/template-hawq-site.xml
file for a multi-node cluster. Your custom hawq-site.xml should include the following modifications:
Change the hawq_dfs_url
property definition to use the actual Namenode port number as well as the HAWQ data directory:
<property> <name>hawq_dfs_url</name> <value>localhost:8020/hawq_default</value> <description>URL for accessing HDFS.</description> </property>
If HDFS is configured with NameNode high availability (HA), hawq_dfs_url
should instead include the service ID that you configured. For example, if you configured HA with the service name hdpcluster
the entry would be similar to:
<property> <name>hawq_dfs_url</name> <value>hdpcluster/hawq_default</value> <description>URL for accessing HDFS.</description> </property>
Also set gpadmin as the owner of the parent directory HDFS directory you specify. For example:
$ hdfs dfs -chown gpadmin /
Configure additional properties to specify the host names, port numbers, and directories used in your system. For example:
Property | Example Value |
---|---|
hawq_master_address_host | mdw |
hawq_master_address_port | 20000 |
hawq_standby_address_host | smdw |
hawq_segment_address_port | 40000 |
hawq_master_directory | /data/master |
hawq_segment_directory | /data/segment |
hawq_master_temp_directory | /data1/tmp,/data2/tmp |
hawq_segment_temp_directory | /data1/tmp,/data2/tmp |
hawq_global_rm_type | none |
Caution: If you are installing HAWQ in secure mode (Kerberos-enabled), then set hawq_global_rm_type
to standalone mode (none
) to avoid encountering a known installation issue. You can enable YARN mode post-installation if YARN resource management is desired in HAWQ.
If you wish to use YARN mode for HAWQ resource management, configure YARN properties for HAWQ. For example, in $GPHOME/etc/hawq-site.xml
:
Property | Example Value | Comment |
---|---|---|
hawq_global_rm_type | yarn | When set to yarn , HAWQ requires that you configure additional YARN configuration parameters (hawq_rm_yarn_address or hawq_rm_yarn_scheduler_address |
hawq_rm_yarn_address | mdw:9980 | This property must match the yarn.resourcemanager.address value in yarn-site.xml . |
hawq_rm_yarn_scheduler_address | mdw:9981 | This property must match the yarn.resourcemanager.scheduler.address value in yarn-site.xml . |
If you have high availability enabled for YARN resource managers, then you must also configure the HA parameters in $GPHOME/etc/yarn-client.xml
. For example:
Property | Example Value | Comment |
---|---|---|
yarn.resourcemanager.ha | rm1.example.com:8032, rm2.example.com:8032 | Comma-delimited list of resource manager hosts. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_address and uses this property's value instead. |
yarn.resourcemanager.scheduler.ha | rm1.example.com:8030, rm2.example.com:8030 | Comma-delimited list of scheduler hosts. When high availability is enabled, YARN ignores the value in hawq_rm_yarn_scheduler_address and uses this property's value instead. |
Replace the example hostnames with the fully qualified domain names of your resource manager host machines.
Edit the $GPHOME/etc/slaves file to list all of the segment host names for your cluster. For example:
sdw1 sdw2 sdw3
If your HDFS cluster is configured with NameNode high availability (HA), edit the ${GPHOME}/etc/hdfs-client.xml
file on each segment and add the following NameNode properties:
<property> <name>dfs.ha.namenodes.hdpcluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.http-address.hdpcluster.nn1</name> <value>ip-address-1.mycompany.com:50070</value> </property> <property> <name>dfs.namenode.http-address.hdpcluster.nn2</name> <value>ip-address-2.mycompany.com:50070</value> </property> <property> <name>dfs.namenode.rpc-address.hdpcluster.nn1</name> <value>ip-address-1.mycompany.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.hdpcluster.nn2</name> <value>ip-address-2.mycompany.com:8020</value> </property> <property> <name>dfs.nameservices</name> <value>hdpcluster</value> </property>
In the listing above:
hdpcluster
with the actual service ID that is configured in HDFS.ip-address-2.mycompany.com:50070
with the actual NameNode RPC host and port number that is configured in HDFS.ip-address-1.mycompany.com:8020
with the actual NameNode HTTP host and port number that is configured in HDFS.dfs.ha.namenodes.hdpcluster
is important for performance, especially when running secure HDFS. The first entry (nn1
in the example above) should correspond to the active NameNode.Synchronize the customized hawq-site.xml and slaves files to all cluster nodes:
$ hawq scp -f hostfile hawq-site.xml slaves =:$GPHOME/etc/
Finally, initialize and start the new HAWQ cluster using the command:
$ hawq init cluster
After the cluster starts, you can follow the instructions in Validate the Installation.
Follow this procedure to install HAWQ software on a single host machine.
Login to the target machine as the root user. If you are logged in as a different user, switch to the root account:
$ su - root
Download the HAWQ tarball distribution (tar xzf hdb-2.0.0.0-18407.tar.gz) to the local machine.
Unzip the tarball file:
$ tar xzf hdb-2.0.0.0-18407.tar.gz
The files are uncompressed into a hdb-2.0.0.0
subdirectory.
Execute the RPM installer:
$ cd hdb-2.0.0.0 $ rpm -ivh hawq-2.0.0.0-18407.x86_64.rpm
Switch to the gpadmin user:
$ su - gpadmin
Source the greenplum_path.sh file to set your environment for . For RPM installations, enter:
$ source /usr/local/hawq/greenplum_path.sh
If you downloaded the tarball, substitute the path to the extracted greenplum_path.sh file (for example /opt/hawq-2.0.0.0/greenplum_path.sh).
HAWQ requires passwordless ssh access to all cluster nodes, even on a single-node cluster installation. Execute the following hawq
command to exchange keys and enable passwordless SSH to localhost:
``` $ hawq ssh-exkeys -h localhost ```
If your HDFS Namenode does not use the default port, 8020, then open the $GPHOME/etc/hawq-site.xml file with a text editor and modify the following property definition to use the actual Namenode port number:
<property> <name>hawq_dfs_url</name> <value>localhost:8020/hawq_default</value> <description>URL for accessing HDFS.</description> </property>
Also ensure that the gpadmin user has both read and write access to the parent directory specified with hawq_dfs_url
. For example:
$ hdfs dfs -chown gpadmin hdfs://localhost:8020/
Finally, initialize and start the new HAWQ cluster using the command:
$ hawq init cluster
After the cluster starts, you can follow the instructions in Validate the Installation.
The following additional steps are necessary only if you manage your system manually (without using Ambari), and you enabled Kerberos security for HDFS.
For manual installations, perform these additional steps after you complete the previous procedure:
Ensure that the HDFS parameter dfs.block.access.token.enable
is set to true for the secure HDFS cluster. This property can be set within Ambari via Services > HDFS > Configs > Advanced hdfs-site > dfs.block.access.token.enable. After modifying this parameter, you must restart HDFS.
Login to the Kerberos Key Distribution Center (KDC) server as the root
user.
Use kadmin.local
to create a new principal for the postgres
user. This will be used by the HAWQ master segment host:
$ kadmin.local -q “addprinc -randkey postgres@LOCAL.DOMAIN”
Use kadmin.local
to generate a Kerberos service principal for all other hosts that will run a HAWQ segment with the PXF service. The service principal should be of the form name/role@REALM
where:
pxf
). Use the same name for each HAWQ host.hostname -f
command).LOCAL.DOMAIN
).You can generate all principals on the KDC server. For example, these commands add service principals for three HAWQ nodes on the hosts host1.example.com, host2.example.com, and host3.example.com:
$ kadmin.local -q "addprinc -randkey pxf/host1.example.com@LOCAL.DOMAIN" $ kadmin.local -q "addprinc -randkey pxf/host2.example.com@LOCAL.DOMAIN" $ kadmin.local -q "addprinc -randkey pxf/host3.example.com@LOCAL.DOMAIN"
Repeat this step for each host that runs a PXF service. Substitute the fully-qualified hostname of each host machine and the realm name used in your configuration.
**Note:** As an alternative, if you have a hosts file that lists the fully-qualified domain name of each cluster host \(one host per line\), then you can generate principals using the command: ``` $ for HOST in $\(cat hosts\) ; do sudo kadmin.local -q "addprinc -randkey pxf/$HOST@LOCAL.DOMAIN" ; done ```
Generate a keytab file for each principal that you created (for the postgres
and for each pxf
service principal). You can store the keytab files in any convenient location (this example uses the directory /etc/security/keytabs). You will deploy the service principal keytab files to their respective HAWQ host machines in a later step:
$ kadmin.local -q “xst -k /etc/security/keytabs/hawq.service.keytab postgres@LOCAL.DOMAIN” $ kadmin.local -q “xst -k /etc/security/keytabs/pxf-host1.service.keytab pxf/host1.example.com@LOCAL.DOMAIN” $ kadmin.local -q “xst -k /etc/security/keytabs/pxf-host2.service.keytab pxf/host2.example.com@LOCAL.DOMAIN” $ kadmin.local -q “xst -k /etc/security/keytabs/pxf-host3.service.keytab pxf/host3.example.com@LOCAL.DOMAIN” $ kadmin.local -q “listprincs”
Repeat the xst
command as necessary to generate a keytab for each HAWQ service principal that you created in the previous step.
The HAWQ master server also requires a hdfs.headless.keytab file for the HDFS principal to be available under /etc/security/keytabs. If this file does not already exist, generate it using the command:
$ kadmin.local -q “addprinc -randkey hdfs@LOCAL.DOMAIN” $ kadmin.local -q “xst -k /etc/security/keytabs/hdfs.headless.keytab hdfs@LOCAL.DOMAIN”
Note: HAWQ initialization will fail if the above keytab file is not available.
Copy the HAWQ service keytab file (and the hdfs.headless.keytab file if you created one) to the HAWQ master segment host:
$ scp /etc/security/keytabs/hawq.service.keytab hawq_master_fqdn:/etc/security/keytabs/hawq.service.keytab $ scp /etc/security/keytabs/hdfs.headless.keytab hawq_master_fqdn:/etc/security/keytabs/hdfs.headless.keytab
Change the ownership and permissions on hawq.service.keytab (and on hdfs.headless.keytab if you copied it) as follows:
$ ssh hawq_master_fqdn chown gpadmin:gpadmin /etc/security/keytabs/hawq.service.keytab $ ssh hawq_master_fqdn chmod 400 /etc/security/keytabs/hawq.service.keytab $ ssh hawq_master_fqdn chown hdfs:hdfs /etc/security/keytabs/hdfs.headless.keytab $ ssh hawq_master_fqdn chmod 440 /etc/security/keytabs/hdfs.headless.keytab
Copy the keytab file for each service principal to its respective HAWQ host:
$ scp /etc/security/keytabs/pxf-host1.service.keytab host1.example.com:/etc/security/keytabs/pxf.service.keytab $ scp /etc/security/keytabs/pxf-host2.service.keytab host2.example.com:/etc/security/keytabs/pxf.service.keytab $ scp /etc/security/keytabs/pxf-host3.service.keytab host3.example.com:/etc/security/keytabs/pxf.service.keytab
Note: Repeat this step for all of the keytab files that you created. Record the full path of the keytab file on each host machine. During installation, you will need to provide the path to finish configuring Kerberos for HAWQ.
Change the ownership and permissions on the pxf.service.keytab files:
$ ssh host1.example.com chown pxf:pxf /etc/security/keytabs/pxf.service.keytab $ ssh host1.example.com chmod 400 /etc/security/keytabs/pxf.service.keytab $ ssh host2.example.com chown pxf:pxf /etc/security/keytabs/pxf.service.keytab $ ssh host2.example.com chmod 400 /etc/security/keytabs/pxf.service.keytab $ ssh host3.example.com chown pxf:pxf /etc/security/keytabs/pxf.service.keytab $ ssh host3.example.com chmod 400 /etc/security/keytabs/pxf.service.keytab
Note: Repeat this step for all of the keytab files that you copied.
After you have created and copied all required keytab files to their respective hosts, continue to install the HAWQ and PXF services if necessary.
If you already installed HAWQ and PXF and you enabled HDFS security without using Ambari, follow the steps in (Optional) Enable Kerberos to finish enabling HAWQ and PXF security.
On each PXF node, edit the /etc/gphd/pxf/conf/pxf-site.xml to specify the local keytab file and security principal. Add or uncomment the properties:
<property> <name>pxf.service.kerberos.keytab</name> <value>/etc/security/phd/keytabs/pxf.service.keytab</value> <description>path to keytab file owned by pxf service with permissions 0400</description> </property> <property> <name>pxf.service.kerberos.principal</name> <value>pxf/_HOST@PHD.LOCAL</value> <description>Kerberos principal pxf service should use. _HOST is replaced automatically with hostnames FQDN</description> </property>
Also uncomment the KDC section in yarn-client.xml if you want to configure YARN for security.
Perform the remaining steps on the HAWQ master node as the gpadmin
user:
Login to the HAWQ database master server as the gpadmin
user:
ssh hawq_master_fqdn
Run the following commands to set environment variables:
$ source /usr/local/hawq/greenplum_path.sh
Note: Substitute the correct value of MASTER_DATA_DIRECTORY for your configuration.
Run the following commands to enable security and configure the keytab file:
$ hawq config -c enable_secure_filesystem -v ON $ hawq config -c krb_server_keyfile -v /etc/security/keytabs/hawq.service.keytab
Note: Substitute the correct value of MASTER_DATA_DIRECTORY for your configuration.
Start the HAWQ service:
$ hawq start cluster -a
Obtain a kerberos ticket and change the ownership and permissions of the HAWQ HDFS data directory:
$ sudo -u hdfs kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs $ sudo -u hdfs hdfs dfs -chown -R gpadmin:gpadmin /hawq_data
Substitute the actual HDFS data directory name for your system.
On the HAWQ master node and on all segment server nodes, edit the /usr/local/hawq/etc/hdfs-client.xml file to enable kerberos security and assign the HDFS NameNode principal. Add or uncomment the following properties in each file:
<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property>
If you are using YARN for resource management, edit the yarn-client.xml file to enable kerberos security. Add or uncomment the following property in the yarn-client.xml file on each HAWQ node:
<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property>
Restart HAWQ as the gpadmin
user:
$ hawq restart cluster -a -M fast
Perform these basic commands to ensure that the new cluster is functional.
Ensure that you have initialized and started the new cluster by using the hawq init cluster
command, as described in the previous procedures.
Source the greenplum_path.sh
file to set your environment for HAWQ. For RPM installations, enter:
$ source /usr/local/hawq/greenplum_path.sh
If you downloaded the tarball, substitute the path to the extracted greenplum_path.sh
file (for example /opt/hawq-2.0.0.0/greenplum_path.sh
).
If you use a custom HAWQ master port number, set it in your environment. For example:
$ export PGPORT=5432
Start the psql
interactive utility, connecting to the postgres database:
$ psql -d postgres psql (8.2.15) Type "help" for help. postgres=#
Create a new database and connect to it:
postgres=# create database mytest; CREATE DATABASE postgres=# \c mytest You are now connected to database "mytest" as user "*username*".
Create a new table and insert sample data:
mytest=# create table t (i int); CREATE TABLE mytest=# insert into t select generate_series(1,100);
Activate timing and perform a simple query:
mytest=# \timing Timing is on. mytest=# select count(*) from t; count ------- 100 (1 row) Time: 7.266 ms
If you plan on accessing data in external systems such as HDFS files, Hive or HBase:
Follow the instructions in Post-Install Procedure for Hive and HBase on HDP to complete the PXF configuration.
Install the appropriate PXF plugin for the external system. We recommend installing the PXF plugin for the desired external system on all nodes in your cluster. See Installing PXF Plugins.