gpcontrib/gp_inject_fault/README - cloudberry - Git at Google

 Fault Injection Framework
 =========================

 Fault is defined as a point of interest in the source code with an
 associated action to be taken when that point is hit during execution.
 Fault points are defined in the instrumented code using
 SIMPLE_FAULT_INJECTOR() macro.  A fault point is identifed by a name.
 This module provides an interface to inject a predefined fault point
 into a running Apache Cloudberry cluster by associating an action
 with the fault point.  Action can be error, panic, sleep, skip,
 infinite_loop, etc.

 Basic examples
 --------------

    select gp_inject_fault('checkpoint', 'error', dbid) from
    gp_segment_configuration where content=1 and role='p';

 The above command causes the next checkpoint to fail with elog(ERROR)
 on the segment that is acting as primary and has contentid = 1.  Once
 the action associated with the fault (error in this case) is taken, it
 is not hit again.  Checkpoints will finish without error after the
 fault is hit once.  The 'checkpoint' fault is defined in
 CreateCheckPoint() function in xlog.c.

    select gp_inject_fault('checkpoint', 'status', dbid) from
    gp_segment_configuration where content=1 and role='p';

 The above command checks the status of the fault.  It reports the
 number of times the fault has been hit during execution and whether it
 has completed.

    select gp_wait_until_triggered_fault('checkpoint', 1, dbid) from
    gp_segment_configuration where content=1 and role='p';

 The above command blocks until the checkpoint fault is triggered
 exactly once.  If the fault has already been triggered, the command
 will not block at all.

    select gp_inject_fault('checkpoint', 'reset', dbid) from
    gp_segment_configuration where content=1 and role='p';

 The above command removes the fault, such that no action will be taken
 when the fault point is reached during execution.  A fault can be set
 to trigger more than once.  For example:

    select gp_inject_fault_infinite('checkpoint', 'error', dbid) from
    gp_segment_configuration where content=1 and role='p';

 This command causes checkpoints to fail until the fault is removed on
 the segment 1 primary.

 More detailed interface
 -----------------------

 A more detailed version of the fault injector interface accepts
 several more parameters.  For example:

    select gp_inject_fault('heap_insert', 'error',
           '' /* DDL */, '' /* database name */,
           'my_table' /* table name */,
           1 /* start occurrence */, 10 /* end occurrence */,
           0 /* */, dbid) from
    gp_segment_configuration where content=1 and role='p';

 The above command sets heap_insert fault such that the inserting
 transaction will abort with elog(ERROR) when the code reaches the
 fault point, only if the relation being inserted to has the name
 'my_table'.  Moreover, after the fault has been hit 10 times, it will
 stop triggering.  The 11th transaction to insert into my_table will
 continue the insert as if no fault was injected.

 Fault actions
 -------------

 Fault action is specified as the type parameter in gp_inject_fault()
 interface.  The following types are supported.

 error
    elog(ERROR)

 fatal
    elog(FATAL)

 panic
    elog(PANIC)

 sleep
    sleep for specified amount of time

 infinite_loop
    loop until query cancel or terminate signal is received

 suspend
    block until the fault is removed, without checking for interrupts

 resume
    resume backend processes that are blocked due to a suspend fault

 skip
    used to implement custom logic that is not supported by
    predefined actions, e.g.

    if (SIMPLE_FAULT_INJECTOR("fts_probe") == FaultInjectorTypeSkip)
    {
        //custom code
    }

 reset
    remove a previously injected fault

 segv
    crash the backend process due to SIGSEGV

 interrupt
    simulate cancel interrupt arrival, such that the next
    interrupt processing cycle will cancel the query

 finish_pending
    similar to interrupt, sets the QueryFinishPending global flag

 status
    return a text datum with details of how many times a fault has been
    triggered, the state it is currently in.  Fault states are as follows:

       "set" injected but the fault point has not been reached during
       execution yet.

       "triggered" the fault point has been reached at least once during
       execution.

       "completed" the action associated with the fault point will no
       longer be taken because the fault point has been reached maximum
       number of times during execution.

 NOTES
 -----

 * A fault point applies to ALL callers of a function.  So special care
 must be used if you are setting fault points in code shared amongst
 many backends, such as in parallel regress tests.

 * Be sure to "reset" all fault points at the end of a test that
 associates actions with them.

 * Supplementary details
 ==========================================================

 When we injected PANIC fault for Segment node, FTS will promote Mirror
 node to Primary node so that causes a change in the cluster topology
 and tests have side effects. To prevent this kind of situation from
 happening, when injecting PANIC faults into Segment nodes. If the FTS
 function is enabled, we need to print a warning message to prompt the
 user to prevent such operations.

 The current implementation is achieved by calling the _PG_init
 function to register the callback function when the dynamic library is
 loaded. But this implementation is problematic. This is because we
 connect the Master, the newly generated QE node will not actively load
 the gp_fault_inject dynamic library into the QE process.

 To achieve this, we will explicitly load the gp_fault_inject dynamic
 library in the QE node processing function HandleFaultMessage to
 inject our pre-written warning callback function. In addition, since
 the warning callback function needs to determine whether the FTS
 process is enabled on the Segment node, we need to synchronize the
 information on whether the FTS on the Master is enabled to the
 corresponding QE process.
 ================================
 * We let some background process ignore all but a few faults. If one wants
 to test fault injection in background processese, add the exception in
 checkBgProcessSkipFault().
	Fault Injection Framework
	=========================

	Fault is defined as a point of interest in the source code with an
	associated action to be taken when that point is hit during execution.
	Fault points are defined in the instrumented code using
	SIMPLE_FAULT_INJECTOR() macro. A fault point is identifed by a name.
	This module provides an interface to inject a predefined fault point
	into a running Apache Cloudberry cluster by associating an action
	with the fault point. Action can be error, panic, sleep, skip,
	infinite_loop, etc.

	Basic examples
	--------------

	select gp_inject_fault('checkpoint', 'error', dbid) from
	gp_segment_configuration where content=1 and role='p';

	The above command causes the next checkpoint to fail with elog(ERROR)
	on the segment that is acting as primary and has contentid = 1. Once
	the action associated with the fault (error in this case) is taken, it
	is not hit again. Checkpoints will finish without error after the
	fault is hit once. The 'checkpoint' fault is defined in
	CreateCheckPoint() function in xlog.c.

	select gp_inject_fault('checkpoint', 'status', dbid) from
	gp_segment_configuration where content=1 and role='p';

	The above command checks the status of the fault. It reports the
	number of times the fault has been hit during execution and whether it
	has completed.

	select gp_wait_until_triggered_fault('checkpoint', 1, dbid) from
	gp_segment_configuration where content=1 and role='p';

	The above command blocks until the checkpoint fault is triggered
	exactly once. If the fault has already been triggered, the command
	will not block at all.

	select gp_inject_fault('checkpoint', 'reset', dbid) from
	gp_segment_configuration where content=1 and role='p';

	The above command removes the fault, such that no action will be taken
	when the fault point is reached during execution. A fault can be set
	to trigger more than once. For example:

	select gp_inject_fault_infinite('checkpoint', 'error', dbid) from
	gp_segment_configuration where content=1 and role='p';

	This command causes checkpoints to fail until the fault is removed on
	the segment 1 primary.

	More detailed interface
	-----------------------

	A more detailed version of the fault injector interface accepts
	several more parameters. For example:

	select gp_inject_fault('heap_insert', 'error',
	'' /* DDL /, '' / database name */,
	'my_table' /* table name */,
	1 /* start occurrence /, 10 / end occurrence */,
	0 /* */, dbid) from
	gp_segment_configuration where content=1 and role='p';

	The above command sets heap_insert fault such that the inserting
	transaction will abort with elog(ERROR) when the code reaches the
	fault point, only if the relation being inserted to has the name
	'my_table'. Moreover, after the fault has been hit 10 times, it will
	stop triggering. The 11th transaction to insert into my_table will
	continue the insert as if no fault was injected.

	Fault actions
	-------------

	Fault action is specified as the type parameter in gp_inject_fault()
	interface. The following types are supported.

	error
	elog(ERROR)

	fatal
	elog(FATAL)

	panic
	elog(PANIC)

	sleep
	sleep for specified amount of time

	infinite_loop
	loop until query cancel or terminate signal is received

	suspend
	block until the fault is removed, without checking for interrupts

	resume
	resume backend processes that are blocked due to a suspend fault

	skip
	used to implement custom logic that is not supported by
	predefined actions, e.g.

	if (SIMPLE_FAULT_INJECTOR("fts_probe") == FaultInjectorTypeSkip)
	{
	//custom code
	}

	reset
	remove a previously injected fault

	segv
	crash the backend process due to SIGSEGV

	interrupt
	simulate cancel interrupt arrival, such that the next
	interrupt processing cycle will cancel the query

	finish_pending
	similar to interrupt, sets the QueryFinishPending global flag

	status
	return a text datum with details of how many times a fault has been
	triggered, the state it is currently in. Fault states are as follows:

	"set" injected but the fault point has not been reached during
	execution yet.

	"triggered" the fault point has been reached at least once during
	execution.

	"completed" the action associated with the fault point will no
	longer be taken because the fault point has been reached maximum
	number of times during execution.

	NOTES
	-----

	* A fault point applies to ALL callers of a function. So special care
	must be used if you are setting fault points in code shared amongst
	many backends, such as in parallel regress tests.

	* Be sure to "reset" all fault points at the end of a test that
	associates actions with them.

	* Supplementary details
	==========================================================

	When we injected PANIC fault for Segment node, FTS will promote Mirror
	node to Primary node so that causes a change in the cluster topology
	and tests have side effects. To prevent this kind of situation from
	happening, when injecting PANIC faults into Segment nodes. If the FTS
	function is enabled, we need to print a warning message to prompt the
	user to prevent such operations.

	The current implementation is achieved by calling the _PG_init
	function to register the callback function when the dynamic library is
	loaded. But this implementation is problematic. This is because we
	connect the Master, the newly generated QE node will not actively load
	the gp_fault_inject dynamic library into the QE process.

	To achieve this, we will explicitly load the gp_fault_inject dynamic
	library in the QE node processing function HandleFaultMessage to
	inject our pre-written warning callback function. In addition, since
	the warning callback function needs to determine whether the FTS
	process is enabled on the Segment node, we need to synchronize the
	information on whether the FTS on the Master is enabled to the
	corresponding QE process.
	================================
	* We let some background process ignore all but a few faults. If one wants
	to test fault injection in background processese, add the exception in
	checkBgProcessSkipFault().