| ****************** FailMon Quick Start Guide *********************** |
| |
| This document is a guide to quickly setting up and running FailMon. |
| For more information and details please see the FailMon User Manual. |
| |
| ***** Building FailMon ***** |
| |
| Normally, FailMon lies under <hadoop-dir>/src/contrib/failmon, where |
| <hadoop-source-dir> is the Hadoop project root folder. To compile it, |
| one can either run ant for the whole Hadoop project, i.e.: |
| |
| $ cd <hadoop-dir> |
| $ ant |
| |
| or run ant only for FailMon: |
| |
| $ cd <hadoop-dir>/src/contrib/failmon |
| $ ant |
| |
| The above will compile FailMon and place all class files under |
| <hadoop-dir>/build/contrib/failmon/classes. |
| |
| By invoking: |
| |
| $ cd <hadoop-dir>/src/contrib/failmon |
| $ ant tar |
| |
| FailMon is packaged as a standalone jar application in |
| <hadoop-dir>/src/contrib/failmon/failmon.tar.gz. |
| |
| |
| ***** Deploying FailMon ***** |
| |
| There are two ways FailMon can be deployed in a cluster: |
| |
| a) Within Hadoop, in which case the whole Hadoop package is uploaded |
| to the cluster nodes. In that case, nothing else needs to be done on |
| individual nodes. |
| |
| b) Independently of the Hadoop deployment, i.e., by uploading |
| failmon.tar.gz to all nodes and uncompressing it. In that case, the |
| bin/failmon.sh script needs to be edited; environment variable |
| HADOOPDIR should point to the root directory of the Hadoop |
| distribution. Also the location of the Hadoop configuration files |
| should be pointed by the property 'hadoop.conf.path' in file |
| conf/failmon.properties. Note that these files refer to the HDFS in |
| which we want to store the FailMon data (which can potentially be |
| different than the one on the cluster we are monitoring). |
| |
| We assume that either way FailMon is placed in the same directory on |
| all nodes, which is typical for most clusters. If this is not |
| feasible, one should create the same symbolic link on all nodes of the |
| cluster, that points to the FailMon directory of each node. |
| |
| One should also edit the conf/failmon.properties file on each node to |
| set his own property values. However, the default values are expected |
| to serve most practical cases. Refer to the FailMon User Manual about |
| the various properties and configuration parameters. |
| |
| |
| ***** Running FailMon ***** |
| |
| In order to run FailMon using a node to do the ad-hoc scheduling of |
| monitoring jobs, one needs edit the hosts.list file to specify the |
| list of machine hostnames on which FailMon is to be run. Also, in file |
| conf/global.config the username used to connect to the machines has to |
| be specified (passwordless SSH is assumed) in property 'ssh.username'. |
| In property 'failmon.dir', the path to the FailMon folder has to be |
| specified as well (it is assumed to be the same on all machines in the |
| cluster). Then one only needs to invoke the command: |
| |
| $ cd <hadoop-dir> |
| $ bin/scheduler.py |
| |
| to start the system. |
| |
| |
| ***** Merging HDFS files ***** |
| |
| For the purpose of merging the files created on HDFS by FailMon, the |
| following command can be used: |
| |
| $ cd <hadoop-dir> |
| $ bin/failmon.sh --mergeFiles |
| |
| This will concatenate all files in the HDFS folder (pointed to by the |
| 'hdfs.upload.dir' property in conf/failmon.properties file) into a |
| single file, which will be placed in the same folder. Also the |
| location of the Hadoop configuration files should be pointed by the |
| property 'hadoop.conf.path' in file conf/failmon.properties. Note that |
| these files refer to the HDFS in which have stored the FailMon data |
| (which can potentially be different than the one on the cluster we are |
| monitoring). Also, the scheduler.py script can be set up to merge the |
| HDFS files when their number surpasses a configurable limit (see |
| 'conf/global.config' file). |
| |
| Please refer to the FailMon User Manual for more details. |