| |
| Fault Tolerance Service (FTS) |
| ============================= |
| |
| This document illustrates the mechanism of a GPDB component called |
| Fault Tolerance Service (FTS), which also works for Apache Cloudberry: |
| |
| 1. This sections explains how FTS probe process is started. The FTS |
| probe process is running on the coordinator node only. It starts as a |
| background worker process managed by the BackgroundWorker structure |
| (see src/include/postmaster/bgworker.h). Greenplum sets up a group of |
| GP background processes through an array structure PMAuxProcList. A |
| entry in that struct represents a GP background process. |
| |
| Two functions pointers are important members of the |
| BackgroundWorker structure. One points to main entry function of |
| the GP background process. The other points to the function |
| that determine if the process should be started or not. For FTS, |
| these two functions are FtsProbeMain() and FtsProbeStartRule(), |
| respectively. This is hard-coded in postmaster.c. |
| |
| static BackgroundWorker PMAuxProcList[MaxPMAuxProc] |
| |
| FtsProbeStartRule() function specifies the following condition under |
| which the FTS probe process should be started by postmaster: |
| |
| Gp_role == GP_ROLE_DISPATCH |
| |
| That is, the FTS probe process should only be started on the coordinator |
| node in normal cluster mode. |
| |
| In the initialization phase, we register one BackgroundWorker entry |
| for each GP background process into postmaster's private structure |
| BackgroundWorkerList. When we do this, the above condition is |
| checked to decide if FTS should be registered there or not. The |
| reader may want to check load_auxiliary_libraries() for more |
| detail. |
| |
| Later, the postmaster tries to start the processes that have been |
| registered in the BackgroundWorkerList, which includes the FTS |
| probe process. If first attempt to start a particular process |
| fails, or a process goes down for some reason and needs to be |
| brought up again, postmaster restarts it in its main loop. Every |
| iteration, it checks the status of these processes and acts |
| accordingly. |
| |
| 1. This sections explains how FTS probes are or can be |
| initiated. Either the probes are triggered at regular defined |
| interval (which can be tuned via GUC) or triggered on the fly when |
| required by certain internal components or tests or user via FTS |
| probe triggering function. |
| |
| The FTS probe process always runs in a infinite loop, does a round |
| of polling at each iteration to get the health status of all |
| segments. At each iteration, it waits on a latch with timeout to |
| block itself for a while. Thus, two types of events might trigger |
| the polling. One is timeout on the latch it is waiting for, and the |
| other one is that someone sets the latch. |
| |
| Certain components running on coordinator node may interrupt FTS from |
| its wait to trigger a probe immediately. This is referred to as |
| notifying FTS. Dispatcher is one such component. As an example, it |
| can notify FTS if it encounters an error while creating a gang. The |
| reader may check FtsNotifyProber() to find more cases. |
| |
| 2. On the coordinator node, the FTS probe process gets the configuration |
| from catalog table gp_segment_configuration, which describes the |
| status of each segment and also reflects if any of them has a |
| mirror. For each unique content(or segindex) value, will see a |
| primary segment and may see a mirror segment. The two make a pair |
| and they have the same content(or segindex) value but different |
| dbid. |
| |
| FTS probes only the primary segments. Primary segments provide |
| their own status as well as their mirror's status in response. When |
| a primary segment is found to be down, FTS promotes its mirror, |
| only if it was in-sync with the primary. If the mirror is |
| out-of-sync, this is considered "double failure" and FTS does |
| nothing. The cluster is unusable in this case. |
| |
| If FTS, upon probing segments, finds any change, it would update |
| segment configuration. Dispatcher would then use the new |
| configuration to create gangs. |
| |
| So FTS both read and write the catalog table. |
| |
| 3. On the coordinator node: each round of the polling is done in a chain of |
| calls : |
| |
| ftsConnect() |
| ftsPoll() |
| ftsSend() |
| ftsReceive() |
| processRetry() |
| processResponse(). |
| |
| FTS probe process connects to each primary segment node(or mirror |
| segment when failover occurs) through TCP/IP. It sends requests to |
| segment and waits for the responses. Once a response is received, |
| it updates the catalog table gp_segment_configuration and |
| gp_configuration_history, and also relevant memory structures |
| accordingly. |
| |
| 4. On the segment node: in the main loop of PostgresMain(), the |
| requests from the coordinator FTS probe process |
| received. ProcessStartupPacket() is called first to make sure this |
| dialog is for FTS requests and thus the Postgres process spawn for |
| it would be a FTS handler(am_ftshandler = true). Then it accepts |
| the request and process the ‘Q’ type message using |
| HandleFtsMessage(). This function deals with three kinds of |
| requests: |
| |
| - Probe |
| - Sync |
| - Promote |
| |
| 5. SIGUSR2 is ignored by FTS now, like other background, postmaster |
| use SIGTERM to stop the FTS. |
| |
| FTS Probe Request |
| ================= |
| Currently there are three ways to trigger an FTS probe - two internal and one |
| external: |
| 1. An internal regular FTS probe that is configurable with gp_fts_probe_interval |
| 2. An internal FTS probe triggered by the query dispatcher |
| 3. An external manual FTS probe from gp_request_fts_probe_scan() |
| |
| The following diagram illustrates the fts loop process. The upper portion of the |
| loop represents a current probe in progress, and the lower portion represents a |
| completed probe awaiting a trigger including the gp_fts_probe_interval timeout. |
| This loop can be probed at anytime for results due to any of the above three |
| mechanisms. |
| |
| |
| poll segments |
| +---------<--------+ |
| | | <-----+ request4 |
| | upper | |
| | | |
| | ^ |
| done| |start |
| | | |
| v lower | |
| | | |
| | | <-----+ request1, request2, request3 |
| +----------->------+ |
| waitLatch |
| |
| |
| Two main scenarios to consider: |
| 1) Allowing multiple probes both internal and external to reuse the same results |
| when appropriate (ie: piggybacking on previous results). This is depicted as |
| requests 1, 2, and 3 which should share the same results since they request |
| before the start of a new fts loop, and after the results of the previous probe |
| - that is in the lower portion. |
| |
| 2) Ensuring fresh results from an external probe. This is depicted as request |
| 4 incoming during a current probe in progress. This request should get fresh |
| results rather than using the current results (ie: "piggybacking"). |
| |
| |
| Our implementation addresses these concerns with a probe start tick and probe |
| end tick. We send a signal requesting fts results, then wait for a new loop to |
| start, and then wait for that current loop to finish. |