| Fault Injection Framework |
| ========================= |
| |
| Fault is defined as a point of interest in the source code with an |
| associated action to be taken when that point is hit during execution. |
| Fault points are defined in the instrumented code using |
| SIMPLE_FAULT_INJECTOR() macro. A fault point is identifed by a name. |
| This module provides an interface to inject a predefined fault point |
| into a running Apache Cloudberry cluster by associating an action |
| with the fault point. Action can be error, panic, sleep, skip, |
| infinite_loop, etc. |
| |
| Basic examples |
| -------------- |
| |
| select gp_inject_fault('checkpoint', 'error', dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| The above command causes the next checkpoint to fail with elog(ERROR) |
| on the segment that is acting as primary and has contentid = 1. Once |
| the action associated with the fault (error in this case) is taken, it |
| is not hit again. Checkpoints will finish without error after the |
| fault is hit once. The 'checkpoint' fault is defined in |
| CreateCheckPoint() function in xlog.c. |
| |
| select gp_inject_fault('checkpoint', 'status', dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| The above command checks the status of the fault. It reports the |
| number of times the fault has been hit during execution and whether it |
| has completed. |
| |
| select gp_wait_until_triggered_fault('checkpoint', 1, dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| The above command blocks until the checkpoint fault is triggered |
| exactly once. If the fault has already been triggered, the command |
| will not block at all. |
| |
| select gp_inject_fault('checkpoint', 'reset', dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| The above command removes the fault, such that no action will be taken |
| when the fault point is reached during execution. A fault can be set |
| to trigger more than once. For example: |
| |
| select gp_inject_fault_infinite('checkpoint', 'error', dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| This command causes checkpoints to fail until the fault is removed on |
| the segment 1 primary. |
| |
| More detailed interface |
| ----------------------- |
| |
| A more detailed version of the fault injector interface accepts |
| several more parameters. For example: |
| |
| select gp_inject_fault('heap_insert', 'error', |
| '' /* DDL */, '' /* database name */, |
| 'my_table' /* table name */, |
| 1 /* start occurrence */, 10 /* end occurrence */, |
| 0 /* */, dbid) from |
| gp_segment_configuration where content=1 and role='p'; |
| |
| The above command sets heap_insert fault such that the inserting |
| transaction will abort with elog(ERROR) when the code reaches the |
| fault point, only if the relation being inserted to has the name |
| 'my_table'. Moreover, after the fault has been hit 10 times, it will |
| stop triggering. The 11th transaction to insert into my_table will |
| continue the insert as if no fault was injected. |
| |
| Fault actions |
| ------------- |
| |
| Fault action is specified as the type parameter in gp_inject_fault() |
| interface. The following types are supported. |
| |
| error |
| elog(ERROR) |
| |
| fatal |
| elog(FATAL) |
| |
| panic |
| elog(PANIC) |
| |
| sleep |
| sleep for specified amount of time |
| |
| infinite_loop |
| loop until query cancel or terminate signal is received |
| |
| suspend |
| block until the fault is removed, without checking for interrupts |
| |
| resume |
| resume backend processes that are blocked due to a suspend fault |
| |
| skip |
| used to implement custom logic that is not supported by |
| predefined actions, e.g. |
| |
| if (SIMPLE_FAULT_INJECTOR("fts_probe") == FaultInjectorTypeSkip) |
| { |
| //custom code |
| } |
| |
| reset |
| remove a previously injected fault |
| |
| segv |
| crash the backend process due to SIGSEGV |
| |
| interrupt |
| simulate cancel interrupt arrival, such that the next |
| interrupt processing cycle will cancel the query |
| |
| finish_pending |
| similar to interrupt, sets the QueryFinishPending global flag |
| |
| status |
| return a text datum with details of how many times a fault has been |
| triggered, the state it is currently in. Fault states are as follows: |
| |
| "set" injected but the fault point has not been reached during |
| execution yet. |
| |
| "triggered" the fault point has been reached at least once during |
| execution. |
| |
| "completed" the action associated with the fault point will no |
| longer be taken because the fault point has been reached maximum |
| number of times during execution. |
| |
| NOTES |
| ----- |
| |
| * A fault point applies to ALL callers of a function. So special care |
| must be used if you are setting fault points in code shared amongst |
| many backends, such as in parallel regress tests. |
| |
| * Be sure to "reset" all fault points at the end of a test that |
| associates actions with them. |
| |
| * Supplementary details |
| ========================================================== |
| |
| When we injected PANIC fault for Segment node, FTS will promote Mirror |
| node to Primary node so that causes a change in the cluster topology |
| and tests have side effects. To prevent this kind of situation from |
| happening, when injecting PANIC faults into Segment nodes. If the FTS |
| function is enabled, we need to print a warning message to prompt the |
| user to prevent such operations. |
| |
| The current implementation is achieved by calling the _PG_init |
| function to register the callback function when the dynamic library is |
| loaded. But this implementation is problematic. This is because we |
| connect the Master, the newly generated QE node will not actively load |
| the gp_fault_inject dynamic library into the QE process. |
| |
| To achieve this, we will explicitly load the gp_fault_inject dynamic |
| library in the QE node processing function HandleFaultMessage to |
| inject our pre-written warning callback function. In addition, since |
| the warning callback function needs to determine whether the FTS |
| process is enabled on the Segment node, we need to synchronize the |
| information on whether the FTS on the Master is enabled to the |
| corresponding QE process. |
| ================================ |
| * We let some background process ignore all but a few faults. If one wants |
| to test fault injection in background processese, add the exception in |
| checkBgProcessSkipFault(). |