hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm - hadoop - Git at Google

 ~~ Licensed under the Apache License, Version 2.0 (the "License");
 ~~ you may not use this file except in compliance with the License.
 ~~ You may obtain a copy of the License at
 ~~
 ~~   http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~ Unless required by applicable law or agreed to in writing, software
 ~~ distributed under the License is distributed on an "AS IS" BASIS,
 ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~ See the License for the specific language governing permissions and
 ~~ limitations under the License. See accompanying LICENSE file.

   ---
   Fault Injection Framework and Development Guide
   ---
   ---
   ${maven.build.timestamp}

 Fault Injection Framework and Development Guide

 %{toc|section=1|fromDepth=0}

 * Introduction

    This guide provides an overview of the Hadoop Fault Injection (FI)
    framework for those who will be developing their own faults (aspects).

    The idea of fault injection is fairly simple: it is an infusion of
    errors and exceptions into an application's logic to achieve a higher
    coverage and fault tolerance of the system. Different implementations
    of this idea are available today. Hadoop's FI framework is built on top
    of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.

 * Assumptions

    The current implementation of the FI framework assumes that the faults
    it will be emulating are of non-deterministic nature. That is, the
    moment of a fault's happening isn't known in advance and is a coin-flip
    based.

 * Architecture of the Fault Injection Framework

    Components layout

 ** Configuration Management

    This piece of the FI framework allows you to set expectations for
    faults to happen. The settings can be applied either statically (in
    advance) or in runtime. The desired level of faults in the framework
    can be configured two ways:

      * editing src/aop/fi-site.xml configuration file. This file is
        similar to other Hadoop's config files

      * setting system properties of JVM through VM startup parameters or
        in build.properties file

 ** Probability Model

    This is fundamentally a coin flipper. The methods of this class are
    getting a random number between 0.0 and 1.0 and then checking if a new
    number has happened in the range of 0.0 and a configured level for the
    fault in question. If that condition is true then the fault will occur.

    Thus, to guarantee the happening of a fault one needs to set an
    appropriate level to 1.0. To completely prevent a fault from happening
    its probability level has to be set to 0.0.

    Note: The default probability level is set to 0 (zero) unless the level
    is changed explicitly through the configuration file or in the runtime.
    The name of the default level's configuration parameter is fi.*

 ** Fault Injection Mechanism: AOP and AspectJ

    The foundation of Hadoop's FI framework includes a cross-cutting
    concept implemented by AspectJ. The following basic terms are important
    to remember:

      * A cross-cutting concept (aspect) is behavior, and often data, that
        is used across the scope of a piece of software

      * In AOP, the aspects provide a mechanism by which a cross-cutting
        concern can be specified in a modular way

      * Advice is the code that is executed when an aspect is invoked

      * Join point (or pointcut) is a specific point within the application
        that may or not invoke some advice

 ** Existing Join Points

    The following readily available join points are provided by AspectJ:

      * Join when a method is called

      * Join during a method's execution

      * Join when a constructor is invoked

      * Join during a constructor's execution

      * Join during aspect advice execution

      * Join before an object is initialized

      * Join during object initialization

      * Join during static initializer execution

      * Join when a class's field is referenced

      * Join when a class's field is assigned

      * Join when a handler is executed

 * Aspect Example

 ----
     package org.apache.hadoop.hdfs.server.datanode;

     import org.apache.commons.logging.Log;
     import org.apache.commons.logging.LogFactory;
     import org.apache.hadoop.fi.ProbabilityModel;
     import org.apache.hadoop.hdfs.server.datanode.DataNode;
     import org.apache.hadoop.util.DiskChecker.*;

     import java.io.IOException;
     import java.io.OutputStream;
     import java.io.DataOutputStream;

     /**
      * This aspect takes care about faults injected into datanode.BlockReceiver
      * class
      */
     public aspect BlockReceiverAspects {
       public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);

       public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
         pointcut callReceivePacket() : call (* OutputStream.write(..))
           && withincode (* BlockReceiver.receivePacket(..))
         // to further limit the application of this aspect a very narrow 'target' can be used as follows
         // && target(DataOutputStream)
           && !within(BlockReceiverAspects +);

       before () throws IOException : callReceivePacket () {
         if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
           LOG.info("Before the injection point");
           Thread.dumpStack();
           throw new DiskOutOfSpaceException ("FI: injected fault point at " +
           thisJoinPoint.getStaticPart( ).getSourceLocation());
         }
       }
     }
 ----

    The aspect has two main parts:

      * The join point pointcut callReceivepacket() which servers as an
        identification mark of a specific point (in control and/or data
        flow) in the life of an application.

      * A call to the advice - before () throws IOException :
        callReceivepacket() - will be injected (see Putting It All
        Together) before that specific spot of the application's code.

    The pointcut identifies an invocation of class' java.io.OutputStream
    write() method with any number of parameters and any return type. This
    invoke should take place within the body of method receivepacket() from
    classBlockReceiver. The method can have any parameters and any return
    type. Possible invocations of write() method happening anywhere within
    the aspect BlockReceiverAspects or its heirs will be ignored.

    Note 1: This short example doesn't illustrate the fact that you can
    have more than a single injection point per class. In such a case the
    names of the faults have to be different if a developer wants to
    trigger them separately.

    Note 2: After the injection step (see Putting It All Together) you can
    verify that the faults were properly injected by searching for ajc
    keywords in a disassembled class file.

 * Fault Naming Convention and Namespaces

    For the sake of a unified naming convention the following two types of
    names are recommended for a new aspects development:

      * Activity specific notation (when we don't care about a particular
        location of a fault's happening). In this case the name of the
        fault is rather abstract: fi.hdfs.DiskError

      * Location specific notation. Here, the fault's name is mnemonic as
        in: fi.hdfs.datanode.BlockReceiver[optional location details]

 * Development Tools

      * The Eclipse AspectJ Development Toolkit may help you when
        developing aspects

      * IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins

 * Putting It All Together

    Faults (aspects) have to injected (or woven) together before they can
    be used. Follow these instructions:
      * To weave aspects in place use:

 ----
     % ant injectfaults
 ----

      * If you misidentified the join point of your aspect you will see a
        warning (similar to the one shown here) when 'injectfaults' target
        is completed:

 ----
     [iajc] warning at
     src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
               BlockReceiverAspects.aj:44::0
     advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
     has not been applied [Xlint:adviceDidNotMatch]
 ----

      * It isn't an error, so the build will report the successful result.
        To prepare dev.jar file with all your faults weaved in place
        (HDFS-475 pending) use:

 ----
     % ant jar-fault-inject
 ----

      * To create test jars use:

 ----
     % ant jar-test-fault-inject
 ----

      * To run HDFS tests with faults injected use:

 ----
     % ant run-test-hdfs-fault-inject
 ----

 ** How to Use the Fault Injection Framework

    Faults can be triggered as follows:

      * During runtime:

 ----
     % ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
 ----

        To set a certain level, for example 25%, of all injected faults
        use:

 ----
     % ant run-test-hdfs-fault-inject -Dfi.*=0.25
 ----

      * From a program:

 ----
     package org.apache.hadoop.fs;

     import org.junit.Test;
     import org.junit.Before;

     public class DemoFiTest {
       public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
       @Override
       @Before
       public void setUp() {
         //Setting up the test's environment as required
       }

       @Test
       public void testFI() {
         // It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
         System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
         //
         // The main logic of your tests goes here
         //
         // Now set the level back to 0 (zero) to prevent this fault from happening again
         System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
         // or delete its trigger completely
         System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
       }

       @Override
       @After
       public void tearDown() {
         //Cleaning up test test environment
       }
     }
 ----

    As you can see above these two methods do the same thing. They are
    setting the probability level of <<<hdfs.datanode.BlockReceiver>>> at 12%.
    The difference, however, is that the program provides more flexibility
    and allows you to turn a fault off when a test no longer needs it.

 * Additional Information and Contacts

    These two sources of information are particularly interesting and worth
    reading:

      * {{http://www.eclipse.org/aspectj/doc/next/devguide/}}

      * AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)

    If you have additional comments or questions for the author check
    {{{https://issues.apache.org/jira/browse/HDFS-435}HDFS-435}}.
	~~ Licensed under the Apache License, Version 2.0 (the "License");
	~~ you may not use this file except in compliance with the License.
	~~ You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License. See accompanying LICENSE file.

	---
	Fault Injection Framework and Development Guide
	---
	---
	${maven.build.timestamp}

	Fault Injection Framework and Development Guide

	%{toc\|section=1\|fromDepth=0}

	* Introduction

	This guide provides an overview of the Hadoop Fault Injection (FI)
	framework for those who will be developing their own faults (aspects).

	The idea of fault injection is fairly simple: it is an infusion of
	errors and exceptions into an application's logic to achieve a higher
	coverage and fault tolerance of the system. Different implementations
	of this idea are available today. Hadoop's FI framework is built on top
	of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.

	* Assumptions

	The current implementation of the FI framework assumes that the faults
	it will be emulating are of non-deterministic nature. That is, the
	moment of a fault's happening isn't known in advance and is a coin-flip
	based.

	* Architecture of the Fault Injection Framework

	Components layout

	** Configuration Management

	This piece of the FI framework allows you to set expectations for
	faults to happen. The settings can be applied either statically (in
	advance) or in runtime. The desired level of faults in the framework
	can be configured two ways:

	* editing src/aop/fi-site.xml configuration file. This file is
	similar to other Hadoop's config files

	* setting system properties of JVM through VM startup parameters or
	in build.properties file

	** Probability Model

	This is fundamentally a coin flipper. The methods of this class are
	getting a random number between 0.0 and 1.0 and then checking if a new
	number has happened in the range of 0.0 and a configured level for the
	fault in question. If that condition is true then the fault will occur.

	Thus, to guarantee the happening of a fault one needs to set an
	appropriate level to 1.0. To completely prevent a fault from happening
	its probability level has to be set to 0.0.

	Note: The default probability level is set to 0 (zero) unless the level
	is changed explicitly through the configuration file or in the runtime.
	The name of the default level's configuration parameter is fi.*

	** Fault Injection Mechanism: AOP and AspectJ

	The foundation of Hadoop's FI framework includes a cross-cutting
	concept implemented by AspectJ. The following basic terms are important
	to remember:

	* A cross-cutting concept (aspect) is behavior, and often data, that
	is used across the scope of a piece of software

	* In AOP, the aspects provide a mechanism by which a cross-cutting
	concern can be specified in a modular way

	* Advice is the code that is executed when an aspect is invoked

	* Join point (or pointcut) is a specific point within the application
	that may or not invoke some advice

	** Existing Join Points

	The following readily available join points are provided by AspectJ:

	* Join when a method is called

	* Join during a method's execution

	* Join when a constructor is invoked

	* Join during a constructor's execution

	* Join during aspect advice execution

	* Join before an object is initialized

	* Join during object initialization

	* Join during static initializer execution

	* Join when a class's field is referenced

	* Join when a class's field is assigned

	* Join when a handler is executed

	* Aspect Example

	----
	package org.apache.hadoop.hdfs.server.datanode;

	import org.apache.commons.logging.Log;
	import org.apache.commons.logging.LogFactory;
	import org.apache.hadoop.fi.ProbabilityModel;
	import org.apache.hadoop.hdfs.server.datanode.DataNode;
	import org.apache.hadoop.util.DiskChecker.*;

	import java.io.IOException;
	import java.io.OutputStream;
	import java.io.DataOutputStream;

	/**
	* This aspect takes care about faults injected into datanode.BlockReceiver
	* class
	*/
	public aspect BlockReceiverAspects {
	public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);

	public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
	pointcut callReceivePacket() : call (* OutputStream.write(..))
	&& withincode (* BlockReceiver.receivePacket(..))
	// to further limit the application of this aspect a very narrow 'target' can be used as follows
	// && target(DataOutputStream)
	&& !within(BlockReceiverAspects +);

	before () throws IOException : callReceivePacket () {
	if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
	LOG.info("Before the injection point");
	Thread.dumpStack();
	throw new DiskOutOfSpaceException ("FI: injected fault point at " +
	thisJoinPoint.getStaticPart( ).getSourceLocation());
	}
	}
	}
	----

	The aspect has two main parts:

	* The join point pointcut callReceivepacket() which servers as an
	identification mark of a specific point (in control and/or data
	flow) in the life of an application.

	* A call to the advice - before () throws IOException :
	callReceivepacket() - will be injected (see Putting It All
	Together) before that specific spot of the application's code.

	The pointcut identifies an invocation of class' java.io.OutputStream
	write() method with any number of parameters and any return type. This
	invoke should take place within the body of method receivepacket() from
	classBlockReceiver. The method can have any parameters and any return
	type. Possible invocations of write() method happening anywhere within
	the aspect BlockReceiverAspects or its heirs will be ignored.

	Note 1: This short example doesn't illustrate the fact that you can
	have more than a single injection point per class. In such a case the
	names of the faults have to be different if a developer wants to
	trigger them separately.

	Note 2: After the injection step (see Putting It All Together) you can
	verify that the faults were properly injected by searching for ajc
	keywords in a disassembled class file.

	* Fault Naming Convention and Namespaces

	For the sake of a unified naming convention the following two types of
	names are recommended for a new aspects development:

	* Activity specific notation (when we don't care about a particular
	location of a fault's happening). In this case the name of the
	fault is rather abstract: fi.hdfs.DiskError

	* Location specific notation. Here, the fault's name is mnemonic as
	in: fi.hdfs.datanode.BlockReceiver[optional location details]

	* Development Tools

	* The Eclipse AspectJ Development Toolkit may help you when
	developing aspects

	* IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins

	* Putting It All Together

	Faults (aspects) have to injected (or woven) together before they can
	be used. Follow these instructions:
	* To weave aspects in place use:

	----
	% ant injectfaults
	----

	* If you misidentified the join point of your aspect you will see a
	warning (similar to the one shown here) when 'injectfaults' target
	is completed:

	----
	[iajc] warning at
	src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
	BlockReceiverAspects.aj:44::0
	advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
	has not been applied [Xlint:adviceDidNotMatch]
	----

	* It isn't an error, so the build will report the successful result.
	To prepare dev.jar file with all your faults weaved in place
	(HDFS-475 pending) use:

	----
	% ant jar-fault-inject
	----

	* To create test jars use:

	----
	% ant jar-test-fault-inject
	----

	* To run HDFS tests with faults injected use:

	----
	% ant run-test-hdfs-fault-inject
	----

	** How to Use the Fault Injection Framework

	Faults can be triggered as follows:

	* During runtime:

	----
	% ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
	----

	To set a certain level, for example 25%, of all injected faults
	use:

	----
	% ant run-test-hdfs-fault-inject -Dfi.*=0.25
	----

	* From a program:

	----
	package org.apache.hadoop.fs;

	import org.junit.Test;
	import org.junit.Before;

	public class DemoFiTest {
	public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
	@Override
	@Before
	public void setUp() {
	//Setting up the test's environment as required
	}

	@Test
	public void testFI() {
	// It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
	System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
	//
	// The main logic of your tests goes here
	//
	// Now set the level back to 0 (zero) to prevent this fault from happening again
	System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
	// or delete its trigger completely
	System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
	}

	@Override
	@After
	public void tearDown() {
	//Cleaning up test test environment
	}
	}
	----

	As you can see above these two methods do the same thing. They are
	setting the probability level of <<<hdfs.datanode.BlockReceiver>>> at 12%.
	The difference, however, is that the program provides more flexibility
	and allows you to turn a fault off when a test no longer needs it.

	* Additional Information and Contacts

	These two sources of information are particularly interesting and worth
	reading:

	* {{http://www.eclipse.org/aspectj/doc/next/devguide/}}

	* AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)

	If you have additional comments or questions for the author check
	{{{https://issues.apache.org/jira/browse/HDFS-435}HDFS-435}}.