src/backend/fts/README - cloudberry - Git at Google


 Fault Tolerance Service (FTS)
 =============================

 This document illustrates the mechanism of a GPDB component called
 Fault Tolerance Service (FTS), which also works for Apache Cloudberry:

 1. This sections explains how FTS probe process is started. The FTS
    probe process is running on the coordinator node only. It starts as a
    background worker process managed by the BackgroundWorker structure
    (see src/include/postmaster/bgworker.h). Greenplum sets up a group of
    GP background processes through an array structure PMAuxProcList. A
    entry in that struct represents a GP background process.

    Two functions pointers are important members of the
    BackgroundWorker structure. One points to main entry function of
    the GP background process. The other points to the function
    that determine if the process should be started or not. For FTS,
    these two functions are FtsProbeMain() and FtsProbeStartRule(),
    respectively.  This is hard-coded in postmaster.c.

       static BackgroundWorker PMAuxProcList[MaxPMAuxProc]

    FtsProbeStartRule() function specifies the following condition under
    which the FTS probe process should be started by postmaster:

       Gp_role == GP_ROLE_DISPATCH

    That is, the FTS probe process should only be started on the coordinator
    node in normal cluster mode.

    In the initialization phase, we register one BackgroundWorker entry
    for each GP background process into postmaster's private structure
    BackgroundWorkerList. When we do this, the above condition is
    checked to decide if FTS should be registered there or not. The
    reader may want to check load_auxiliary_libraries() for more
    detail.

    Later, the postmaster tries to start the processes that have been
    registered in the BackgroundWorkerList, which includes the FTS
    probe process. If first attempt to start a particular process
    fails, or a process goes down for some reason and needs to be
    brought up again, postmaster restarts it in its main loop. Every
    iteration, it checks the status of these processes and acts
    accordingly.

 1. This sections explains how FTS probes are or can be
    initiated. Either the probes are triggered at regular defined
    interval (which can be tuned via GUC) or triggered on the fly when
    required by certain internal components or tests or user via FTS
    probe triggering function.

    The FTS probe process always runs in a infinite loop, does a round
    of polling at each iteration to get the health status of all
    segments. At each iteration, it waits on a latch with timeout to
    block itself for a while. Thus, two types of events might trigger
    the polling. One is timeout on the latch it is waiting for, and the
    other one is that someone sets the latch.

    Certain components running on coordinator node may interrupt FTS from
    its wait to trigger a probe immediately. This is referred to as
    notifying FTS. Dispatcher is one such component. As an example, it
    can notify FTS if it encounters an error while creating a gang. The
    reader may check FtsNotifyProber() to find more cases.

 2. On the coordinator node, the FTS probe process gets the configuration
    from catalog table gp_segment_configuration, which describes the
    status of each segment and also reflects if any of them has a
    mirror. For each unique content(or segindex) value, will see a
    primary segment and may see a mirror segment. The two make a pair
    and they have the same content(or segindex) value but different
    dbid.

    FTS probes only the primary segments. Primary segments provide
    their own status as well as their mirror's status in response. When
    a primary segment is found to be down, FTS promotes its mirror,
    only if it was in-sync with the primary. If the mirror is
    out-of-sync, this is considered "double failure" and FTS does
    nothing. The cluster is unusable in this case.

    If FTS, upon probing segments, finds any change, it would update
    segment configuration. Dispatcher would then use the new
    configuration to create gangs.

    So FTS both read and write the catalog table.

 3. On the coordinator node: each round of the polling is done in a chain of
    calls :

 	ftsConnect()
 	ftsPoll()
 	ftsSend()
 	ftsReceive()
 	processRetry()
 	processResponse().

    FTS probe process connects to each primary segment node(or mirror
    segment when failover occurs) through TCP/IP. It sends requests to
    segment and waits for the responses. Once a response is received,
    it updates the catalog table gp_segment_configuration and
    gp_configuration_history, and also relevant memory structures
    accordingly.

 4. On the segment node: in the main loop of PostgresMain(), the
    requests from the coordinator FTS probe process
    received. ProcessStartupPacket() is called first to make sure this
    dialog is for FTS requests and thus the Postgres process spawn for
    it would be a FTS handler(am_ftshandler = true).  Then it accepts
    the request and process the ‘Q’ type message using
    HandleFtsMessage(). This function deals with three kinds of
    requests:

         - Probe
         - Sync
         - Promote

 5. SIGUSR2 is ignored by FTS now, like other background, postmaster
    use SIGTERM to stop the FTS.

 FTS Probe Request
 =================
 Currently there are three ways to trigger an FTS probe - two internal and one
 external:
 1. An internal regular FTS probe that is configurable with gp_fts_probe_interval
 2. An internal FTS probe triggered by the query dispatcher
 3. An external manual FTS probe from gp_request_fts_probe_scan()

 The following diagram illustrates the fts loop process. The upper portion of the
 loop represents a current probe in progress, and the lower portion represents a
 completed probe awaiting a trigger including the gp_fts_probe_interval timeout.
 This loop can be probed at anytime for results due to any of the above three
 mechanisms.


               poll segments
           +---------<--------+
           |                  |  <-----+ request4
           |     upper        |
           |                  |
           |                  ^
       done|                  |start
           |                  |
           v     lower        |
           |                  |
           |                  |  <-----+ request1, request2, request3
           +----------->------+
                 waitLatch


 Two main scenarios to consider:
 1) Allowing multiple probes both internal and external to reuse the same results
 when appropriate (ie: piggybacking on previous results). This is depicted as
 requests 1, 2, and 3 which should share the same results since they request
 before the start of a new fts loop, and after the results of the previous probe
 - that is in the lower portion.

 2) Ensuring fresh results from an external probe. This is depicted as request
 4 incoming during a current probe in progress. This request should get fresh
 results rather than using the current results (ie: "piggybacking").


 Our implementation addresses these concerns with a probe start tick and probe
 end tick. We send a signal requesting fts results, then wait for a new loop to
 start, and then wait for that current loop to finish.

	Fault Tolerance Service (FTS)
	=============================

	This document illustrates the mechanism of a GPDB component called
	Fault Tolerance Service (FTS), which also works for Apache Cloudberry:

	1. This sections explains how FTS probe process is started. The FTS
	probe process is running on the coordinator node only. It starts as a
	background worker process managed by the BackgroundWorker structure
	(see src/include/postmaster/bgworker.h). Greenplum sets up a group of
	GP background processes through an array structure PMAuxProcList. A
	entry in that struct represents a GP background process.

	Two functions pointers are important members of the
	BackgroundWorker structure. One points to main entry function of
	the GP background process. The other points to the function
	that determine if the process should be started or not. For FTS,
	these two functions are FtsProbeMain() and FtsProbeStartRule(),
	respectively. This is hard-coded in postmaster.c.

	static BackgroundWorker PMAuxProcList[MaxPMAuxProc]

	FtsProbeStartRule() function specifies the following condition under
	which the FTS probe process should be started by postmaster:

	Gp_role == GP_ROLE_DISPATCH

	That is, the FTS probe process should only be started on the coordinator
	node in normal cluster mode.

	In the initialization phase, we register one BackgroundWorker entry
	for each GP background process into postmaster's private structure
	BackgroundWorkerList. When we do this, the above condition is
	checked to decide if FTS should be registered there or not. The
	reader may want to check load_auxiliary_libraries() for more
	detail.

	Later, the postmaster tries to start the processes that have been
	registered in the BackgroundWorkerList, which includes the FTS
	probe process. If first attempt to start a particular process
	fails, or a process goes down for some reason and needs to be
	brought up again, postmaster restarts it in its main loop. Every
	iteration, it checks the status of these processes and acts
	accordingly.

	1. This sections explains how FTS probes are or can be
	initiated. Either the probes are triggered at regular defined
	interval (which can be tuned via GUC) or triggered on the fly when
	required by certain internal components or tests or user via FTS
	probe triggering function.

	The FTS probe process always runs in a infinite loop, does a round
	of polling at each iteration to get the health status of all
	segments. At each iteration, it waits on a latch with timeout to
	block itself for a while. Thus, two types of events might trigger
	the polling. One is timeout on the latch it is waiting for, and the
	other one is that someone sets the latch.

	Certain components running on coordinator node may interrupt FTS from
	its wait to trigger a probe immediately. This is referred to as
	notifying FTS. Dispatcher is one such component. As an example, it
	can notify FTS if it encounters an error while creating a gang. The
	reader may check FtsNotifyProber() to find more cases.

	2. On the coordinator node, the FTS probe process gets the configuration
	from catalog table gp_segment_configuration, which describes the
	status of each segment and also reflects if any of them has a
	mirror. For each unique content(or segindex) value, will see a
	primary segment and may see a mirror segment. The two make a pair
	and they have the same content(or segindex) value but different
	dbid.

	FTS probes only the primary segments. Primary segments provide
	their own status as well as their mirror's status in response. When
	a primary segment is found to be down, FTS promotes its mirror,
	only if it was in-sync with the primary. If the mirror is
	out-of-sync, this is considered "double failure" and FTS does
	nothing. The cluster is unusable in this case.

	If FTS, upon probing segments, finds any change, it would update
	segment configuration. Dispatcher would then use the new
	configuration to create gangs.

	So FTS both read and write the catalog table.

	3. On the coordinator node: each round of the polling is done in a chain of
	calls :

	ftsConnect()
	ftsPoll()
	ftsSend()
	ftsReceive()
	processRetry()
	processResponse().

	FTS probe process connects to each primary segment node(or mirror
	segment when failover occurs) through TCP/IP. It sends requests to
	segment and waits for the responses. Once a response is received,
	it updates the catalog table gp_segment_configuration and
	gp_configuration_history, and also relevant memory structures
	accordingly.

	4. On the segment node: in the main loop of PostgresMain(), the
	requests from the coordinator FTS probe process
	received. ProcessStartupPacket() is called first to make sure this
	dialog is for FTS requests and thus the Postgres process spawn for
	it would be a FTS handler(am_ftshandler = true). Then it accepts
	the request and process the ‘Q’ type message using
	HandleFtsMessage(). This function deals with three kinds of
	requests:

	- Probe
	- Sync
	- Promote

	5. SIGUSR2 is ignored by FTS now, like other background, postmaster
	use SIGTERM to stop the FTS.

	FTS Probe Request
	=================
	Currently there are three ways to trigger an FTS probe - two internal and one
	external:
	1. An internal regular FTS probe that is configurable with gp_fts_probe_interval
	2. An internal FTS probe triggered by the query dispatcher
	3. An external manual FTS probe from gp_request_fts_probe_scan()

	The following diagram illustrates the fts loop process. The upper portion of the
	loop represents a current probe in progress, and the lower portion represents a
	completed probe awaiting a trigger including the gp_fts_probe_interval timeout.
	This loop can be probed at anytime for results due to any of the above three
	mechanisms.


	poll segments
	+---------<--------+
	\| \| <-----+ request4
	\| upper \|
	\| \|
	\| ^
	done\| \|start
	\| \|
	v lower \|
	\| \|
	\| \| <-----+ request1, request2, request3
	+----------->------+
	waitLatch


	Two main scenarios to consider:
	1) Allowing multiple probes both internal and external to reuse the same results
	when appropriate (ie: piggybacking on previous results). This is depicted as
	requests 1, 2, and 3 which should share the same results since they request
	before the start of a new fts loop, and after the results of the previous probe
	- that is in the lower portion.

	2) Ensuring fresh results from an external probe. This is depicted as request
	4 incoming during a current probe in progress. This request should get fresh
	results rather than using the current results (ie: "piggybacking").


	Our implementation addresses these concerns with a probe start tick and probe
	end tick. We send a signal requesting fts results, then wait for a new loop to
	start, and then wait for that current loop to finish.