| # Copyright 2008 The Apache Software Foundation Licensed under the |
| # Apache License, Version 2.0 (the "License"); you may not use this |
| # file except in compliance with the License. You may obtain a copy |
| # of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless |
| # required by applicable law or agreed to in writing, software |
| # distributed under the License is distributed on an "AS IS" BASIS, |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or |
| # implied. See the License for the specific language governing |
| # permissions and limitations under the License. |
| |
| This package implements a Distributed Raid File System. It is used alongwith |
| an instance of the Hadoop Distributed File System (HDFS). It can be used to |
| provide better protection against data corruption. It can also be used to |
| reduce the total storage requirements of HDFS. |
| |
| Distributed Raid File System consists of two main software components. The first component |
| is the RaidNode, a daemon that creates parity files from specified HDFS files. |
| The second component "raidfs" is a software that is layered over a HDFS client and it |
| intercepts all calls that an application makes to the HDFS client. If HDFS encounters |
| corrupted data while reading a file, the raidfs client detects it; it uses the |
| relevant parity blocks to recover the corrupted data (if possible) and returns |
| the data to the application. The application is completely transparent to the |
| fact that parity data was used to satisfy it's read request. |
| |
| The primary use of this feature is to save disk space for HDFS files. |
| HDFS typically stores data in triplicate. |
| The Distributed Raid File System can be configured in such a way that a set of |
| data blocks of a file are combined together to form one or more parity blocks. |
| This allows one to reduce the replication factor of a HDFS file from 3 to 2 |
| while keeping the failure probabilty relatively same as before. This typically |
| results in saving 25% to 30% of storage space in a HDFS cluster. |
| |
| -------------------------------------------------------------------------------- |
| |
| BUILDING: |
| |
| In HADOOP_HOME, run ant package to build Hadoop and its contrib packages. |
| |
| -------------------------------------------------------------------------------- |
| |
| INSTALLING and CONFIGURING: |
| |
| The entire code is packaged in the form of a single jar file hadoop-*-raid.jar. |
| To use HDFS Raid, you need to put the above mentioned jar file on |
| the CLASSPATH. The easiest way is to copy the hadoop-*-raid.jar |
| from HADOOP_HOME/build/contrib/raid to HADOOP_HOME/lib. Alternatively |
| you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh. |
| |
| There is a single configuration file named raid.xml that describes the HDFS |
| path(s) that you want to raid. A sample of this file can be found in |
| sc/contrib/raid/conf/raid.xml. Please edit the entries in this file to list the |
| path(s) that you want to raid. Then, edit the hdfs-site.xml file for |
| your installation to include a reference to this raid.xml. You can add the |
| following to your hdfs-site.xml |
| <property> |
| <name>raid.config.file</name> |
| <value>/mnt/hdfs/DFS/conf/raid.xml</value> |
| <description>This is needed by the RaidNode </description> |
| </property> |
| |
| Please add an entry to your hdfs-site.xml to enable hdfs clients to use the |
| parity bits to recover corrupted data. |
| |
| <property> |
| <name>fs.hdfs.impl</name> |
| <value>org.apache.hadoop.dfs.DistributedRaidFileSystem</value> |
| <description>The FileSystem for hdfs: uris.</description> |
| </property> |
| |
| |
| -------------------------------------------------------------------------------- |
| |
| OPTIONAL CONFIGIURATION: |
| |
| The following properties can be set in hdfs-site.xml to further tune you configuration: |
| |
| Specifies the location where parity files are located. |
| <property> |
| <name>hdfs.raid.locations</name> |
| <value>hdfs://newdfs.data:8000/raid</value> |
| <description>The location for parity files. If this is |
| is not defined, then defaults to /raid. |
| </descrition> |
| </property> |
| |
| Specify the parity stripe length |
| <property> |
| <name>hdfs.raid.stripeLength</name> |
| <value>10</value> |
| <description>The number of blocks in a file to be combined into |
| a single raid parity block. The default value is 5. The lower |
| the number the greater is the disk space you will save when you |
| enable raid. |
| </description> |
| </property> |
| |
| Specify the size of HAR part-files |
| <property> |
| <name>raid.har.partfile.size</name> |
| <value>4294967296</value> |
| <description>The size of HAR part files that store raid parity |
| files. The default is 4GB. The higher the number the fewer the |
| number of files used to store the HAR archive. |
| </description> |
| </property> |
| |
| Specify which implementation of RaidNode to use. |
| <property> |
| <name>raid.classname</name> |
| <value>org.apache.hadoop.raid.DistRaidNode</value> |
| <description>Specify which implementation of RaidNode to use |
| (class name). |
| </description> |
| </property> |
| |
| |
| Specify the periodicy at which the RaidNode re-calculates (if necessary) |
| the parity blocks |
| <property> |
| <name>raid.policy.rescan.interval</name> |
| <value>5000</value> |
| <description>Specify the periodicity in milliseconds after which |
| all source paths are rescanned and parity blocks recomputed if |
| necessary. By default, this value is 1 hour. |
| </description> |
| </property> |
| |
| By default, the DistributedRaidFileSystem assumes that the underlying file |
| system is the DistributedFileSystem. If you want to layer the DistributedRaidFileSystem |
| over some other file system, then define a property named fs.raid.underlyingfs.impl |
| that specifies the name of the underlying class. For example, if you want to layer |
| The DistributedRaidFileSystem over an instance of the NewFileSystem, then |
| <property> |
| <name>fs.raid.underlyingfs.impl</name> |
| <value>org.apche.hadoop.new.NewFileSystem</value> |
| <description>Specify the filesystem that is layered immediately below the |
| DistributedRaidFileSystem. By default, this value is DistributedFileSystem. |
| </description> |
| |
| |
| -------------------------------------------------------------------------------- |
| |
| ADMINISTRATION: |
| |
| The Distributed Raid File System provides support for administration at runtime without |
| any downtime to cluster services. It is possible to add/delete new paths to be raided without |
| interrupting any load on the cluster. If you change raid.xml, its contents will be |
| reload within seconds and the new contents will take effect immediately. |
| |
| Designate one machine in your cluster to run the RaidNode software. You can run this daemon |
| on any machine irrespective of whether that machine is running any other hadoop daemon or not. |
| You can start the RaidNode by running the following on the selected machine: |
| nohup $HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidNode >> /xxx/logs/hadoop-root-raidnode-hadoop.xxx.com.log & |
| |
| Optionally, we provide two scripts to start and stop the RaidNode. Copy the scripts |
| start-raidnode.sh and stop-raidnode.sh to the directory $HADOOP_HOME/bin in the machine |
| you would like to deploy the daemon. You can start or stop the RaidNode by directly |
| callying the scripts from that machine. If you want to deploy the RaidNode remotely, |
| copy start-raidnode-remote.sh and stop-raidnode-remote.sh to $HADOOP_HOME/bin at |
| the machine from which you want to trigger the remote deployment and create a text |
| file $HADOOP_HOME/conf/raidnode at the same machine containing the name of the server |
| where the RaidNode should run. These scripts run ssh to the specified machine and |
| invoke start/stop-raidnode.sh there. As an example, you might want to change |
| start-mapred.sh in the JobTracker machine so that it automatically calls |
| start-raidnode-remote.sh (and do the equivalent thing for stop-mapred.sh and |
| stop-raidnode-remote.sh). |
| |
| To validate the integrity of a file system, run RaidFSCK as follows: |
| $HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidShell -fsck [path] |
| |
| This will print a list of corrupt files (i.e., files which have lost too many |
| blocks and can no longer be fixed by Raid). |
| |
| |
| -------------------------------------------------------------------------------- |
| |
| IMPLEMENTATION: |
| |
| The RaidNode periodically scans all the specified paths in the configuration |
| file. For each path, it recursively scans all files that have more than 2 blocks |
| and that has not been modified during the last few hours (default is 24 hours). |
| It picks the specified number of blocks (as specified by the stripe size), |
| from the file, generates a parity block by combining them and |
| stores the results as another HDFS file in the specified destination |
| directory. There is a one-to-one mapping between a HDFS |
| file and its parity file. The RaidNode also periodically finds parity files |
| that are orphaned and deletes them. |
| |
| The Distributed Raid FileSystem is layered over a DistributedFileSystem |
| instance intercepts all calls that go into HDFS. HDFS throws a ChecksumException |
| or a BlocMissingException when a file read encounters bad data. The layered |
| Distributed Raid FileSystem catches these exceptions, locates the corresponding |
| parity file, extract the original data from the parity files and feeds the |
| extracted data back to the application in a completely transparent way. |
| |
| The layered Distributed Raid FileSystem does not fix the data-loss that it |
| encounters while serving data. It merely make the application transparently |
| use the parity blocks to re-create the original data. A command line tool |
| "fsckraid" is currently under development that will fix the corrupted files |
| by extracting the data from the associated parity files. An adminstrator |
| can run "fsckraid" manually as and when needed. |