blob: 0b78e9fa158514a3d7d32b92512868771b5b5be6 [file] [log] [blame]
Fault Injection Framework
=========================
Fault is defined as a point of interest in the source code with an
associated action to be taken when that point is hit during execution.
Fault points are defined in the instrumented code using
SIMPLE_FAULT_INJECTOR() macro. A fault point is identifed by a name.
This module provides an interface to inject a predefined fault point
into a running Apache Cloudberry cluster by associating an action
with the fault point. Action can be error, panic, sleep, skip,
infinite_loop, etc.
Basic examples
--------------
select gp_inject_fault('checkpoint', 'error', dbid) from
gp_segment_configuration where content=1 and role='p';
The above command causes the next checkpoint to fail with elog(ERROR)
on the segment that is acting as primary and has contentid = 1. Once
the action associated with the fault (error in this case) is taken, it
is not hit again. Checkpoints will finish without error after the
fault is hit once. The 'checkpoint' fault is defined in
CreateCheckPoint() function in xlog.c.
select gp_inject_fault('checkpoint', 'status', dbid) from
gp_segment_configuration where content=1 and role='p';
The above command checks the status of the fault. It reports the
number of times the fault has been hit during execution and whether it
has completed.
select gp_wait_until_triggered_fault('checkpoint', 1, dbid) from
gp_segment_configuration where content=1 and role='p';
The above command blocks until the checkpoint fault is triggered
exactly once. If the fault has already been triggered, the command
will not block at all.
select gp_inject_fault('checkpoint', 'reset', dbid) from
gp_segment_configuration where content=1 and role='p';
The above command removes the fault, such that no action will be taken
when the fault point is reached during execution. A fault can be set
to trigger more than once. For example:
select gp_inject_fault_infinite('checkpoint', 'error', dbid) from
gp_segment_configuration where content=1 and role='p';
This command causes checkpoints to fail until the fault is removed on
the segment 1 primary.
More detailed interface
-----------------------
A more detailed version of the fault injector interface accepts
several more parameters. For example:
select gp_inject_fault('heap_insert', 'error',
'' /* DDL */, '' /* database name */,
'my_table' /* table name */,
1 /* start occurrence */, 10 /* end occurrence */,
0 /* */, dbid) from
gp_segment_configuration where content=1 and role='p';
The above command sets heap_insert fault such that the inserting
transaction will abort with elog(ERROR) when the code reaches the
fault point, only if the relation being inserted to has the name
'my_table'. Moreover, after the fault has been hit 10 times, it will
stop triggering. The 11th transaction to insert into my_table will
continue the insert as if no fault was injected.
Fault actions
-------------
Fault action is specified as the type parameter in gp_inject_fault()
interface. The following types are supported.
error
elog(ERROR)
fatal
elog(FATAL)
panic
elog(PANIC)
sleep
sleep for specified amount of time
infinite_loop
loop until query cancel or terminate signal is received
suspend
block until the fault is removed, without checking for interrupts
resume
resume backend processes that are blocked due to a suspend fault
skip
used to implement custom logic that is not supported by
predefined actions, e.g.
if (SIMPLE_FAULT_INJECTOR("fts_probe") == FaultInjectorTypeSkip)
{
//custom code
}
reset
remove a previously injected fault
segv
crash the backend process due to SIGSEGV
interrupt
simulate cancel interrupt arrival, such that the next
interrupt processing cycle will cancel the query
finish_pending
similar to interrupt, sets the QueryFinishPending global flag
status
return a text datum with details of how many times a fault has been
triggered, the state it is currently in. Fault states are as follows:
"set" injected but the fault point has not been reached during
execution yet.
"triggered" the fault point has been reached at least once during
execution.
"completed" the action associated with the fault point will no
longer be taken because the fault point has been reached maximum
number of times during execution.
NOTES
-----
* A fault point applies to ALL callers of a function. So special care
must be used if you are setting fault points in code shared amongst
many backends, such as in parallel regress tests.
* Be sure to "reset" all fault points at the end of a test that
associates actions with them.
* Supplementary details
==========================================================
When we injected PANIC fault for Segment node, FTS will promote Mirror
node to Primary node so that causes a change in the cluster topology
and tests have side effects. To prevent this kind of situation from
happening, when injecting PANIC faults into Segment nodes. If the FTS
function is enabled, we need to print a warning message to prompt the
user to prevent such operations.
The current implementation is achieved by calling the _PG_init
function to register the callback function when the dynamic library is
loaded. But this implementation is problematic. This is because we
connect the Master, the newly generated QE node will not actively load
the gp_fault_inject dynamic library into the QE process.
To achieve this, we will explicitly load the gp_fault_inject dynamic
library in the QE node processing function HandleFaultMessage to
inject our pre-written warning callback function. In addition, since
the warning callback function needs to determine whether the FTS
process is enabled on the Segment node, we need to synchronize the
information on whether the FTS on the Master is enabled to the
corresponding QE process.
================================
* We let some background process ignore all but a few faults. If one wants
to test fault injection in background processese, add the exception in
checkBgProcessSkipFault().