src/main/java/com/sleepycat/je/log/Provisional.java - doris-thirdparty - Git at Google

 /*-
  * Copyright (C) 2002, 2018, Oracle and/or its affiliates. All rights reserved.
  *
  * This file was distributed by Oracle as part of a version of Oracle Berkeley
  * DB Java Edition made available at:
  *
  * http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/downloads/index.html
  *
  * Please see the LICENSE file included in the top-level directory of the
  * appropriate version of Oracle Berkeley DB Java Edition for a copy of the
  * license and additional information.
  */

 package com.sleepycat.je.log;

 import com.sleepycat.je.utilint.DbLsn;

 /**
  * Specifies whether to log an entry provisionally.
  *
  * Provisional log entries:
  *
  * What are provisional log entries?
  *
  *    Provisional log entries are those tagged with the provisional attribute
  *    in the log entry header. The provisional attribute can be applied to any
  *    type of log entry, and is implemented in
  *    com.sleepycat.je.log.LogEntryHeader as two stolen bits in the 8 bit
  *    version field.
  *
  * When is the provisional attribute used?
  *
  *    The provisional attribute is used only during recovery. It very simply
  *    indicates that recovery will ignore and skip over this log entry.
  *
  * When is the provisional attribute set?
  *
  *    The provisional attribute started out as a way to create atomicity among
  *    different log entries. Child pointers in the JE Btree are physical LSNs,
  *    so each Btree node's children must be logged before it in the log. On
  *    the other hand, one fundamental assumption of the JE log is that each
  *    Btree node found in the log can be replayed and placed in the in-memory
  *    tree. To do so, each Btree node must have a parent node that will house
  *    it. The grouping of multiple log entries into one atomic group is often
  *    used to fulfiil this requirement.
  *
  *     * Atomic log entries:
  *
  *           + When a btree is first created, we bootstrap tree creation by
  *           logging the first BIN provisionally, then creating a parent IN
  *           which is the Btree root IN, which points to this first BIN.
  *
  *           + When we split a Btree node, we create a new IN, which is the
  *           sibling of the split node. We log the old sibling and the new
  *           sibling provisionally, and then log the parent, so that any
  *           crashes in the middle of this triumvirate which result in the
  *           failure to log the parent will skip over the orphaned siblings.
  *
  *           + Splitting the Btree root is just a special case of a split.
  *
  *           + Creating a duplicate subtree to hang in the middle of a btree is
  *           just a special case of a split and btree first creation.
  *
  *     * Entries not meant to be recovered
  *
  *           Temp DBs are not meant to be recovered and we log their LN
  *           and IN nodes in a very lax fashion, purely as a way of evicting
  *           them out of the cache temporarily. There is no guarantee that a
  *           consistent set has been logged to disk. We skip over them for both
  *           recovery performance and the "each-node-must-have-a-parent" rule.
  *
  *     * Durable deferred-write entries
  *
  *           Deferred-write INs are logged provisionally for the same reasons
  *           as for temp DBs (above): for recovery performance and the
  *           "each-node-must-have-a-parent" rule.
  *
  *           Deferred-write LNs are logged non-provisionally to support
  *           obsolete LSN counting.  It would be nice to log them provisionally
  *           for recovery performance and to allow LN deletion without logging;
  *           however, that is not currently practical if obsolete counting is
  *           to be supported.  See [#16864].
  *
  *     * Checkpoint performance
  *
  *           When we flush a series of nodes, it's a waste to replay nodes
  *           which are referenced by higher levels. For example, if we
  *           checkpoint this btree:
  *
  *           {@literal
  *           INA -> INb -> BINc (dirty)-> LNd
  *           }
  *
  *           we log them in this order:
  *
  *           BINc
  *           INb
  *
  *           And there's no point to replaying BINc, because it's referenced by
  *           INb.  We skip over BINc, which we do by logging it provisionally.
  *
  *           In addition, BEFORE_CKPT_END is used to improve cleaner
  *           performance by keeping utilization information up-to-date during
  *           the checkpoint.  See below for details.
  *
  *     * Log cleaning - removing references to deleted files.
  *
  *       When we delete a file for log cleaning we guarantee that no active log
  *       entries refer to any log entry in the deleted file. Suppose our
  *       checkpoint looks like this:
  *
  *         5/100 LNa
  *         5/200 Ckpt start
  *         5/300 INs
  *         ...
  *         5/500 Ckpt end
  *         ...
  *         5/800 last entry in log
  *
  *       Because we do not delete a file until the Ckpt end after processing
  *       (cleaning) it, nothing from 5/500 to 5/800 can refer to a file deleted
  *       due to the Ckpt end in 5/500.
  *
  *       BEFORE_CKPT_END is motivated in part (see below for a complete
  *       description) by the fact that while log entries between 5/100
  *       (first active lsn) and 5/500 (ckpt end) will not in of themselves
  *       contain a LSN for a cleaned, deleted file, the act of processing them
  *       during recovery could require fetching a node from a deleted file. For
  *       example, the IN at 5/300 could have an in-memory parent which has a
  *       reference to an older, cleaned version of that IN.  Treating the span
  *       between 5/200 and 5/500 as provisional is both optimal, because only
  *       the high level INs need to be processed, and required, in order not to
  *       fetch from a cleaned file.
  *
  *       The correctness issue is described in [#16037] comment 151, where we
  *       attempted to log non-provisionally below maxFlushLevel.  It is
  *       repeated below.
  *
  *       IN-A
  *        \
  *        IN-B
  *         \
  *         IN-C
  *          \
  *          BIN-D
  *
  *       1/100 CkptStart
  *       1/200 BIN-D provisional
  *       1/300 IN-C non-provisional
  *       2/100 IN-B non-provisional
  *       2/200 IN-A non-provisional
  *       2/300 MapLN refers to IN-A
  *       2/400 CkptEnd
  *       5/100 cleaner processes file 1
  *             BIN-D and IN-C are dirty
  *       5/200 CkptStart
  *       5/300 BIN-D provisional
  *       5/400 IN-C non-provisional
  *       5/500 IN-B non-provisional (must log one extra level)
  *             IN-A is not logged
  *             MapLN still refers to IN-A at 2/200
  *       5/600 CkptEnd
  *             file 1 is deleted
  *       6/100 Start recovery
  *
  *       Note that only the bottom level BINs are logged provisionally because
  *       we're logging level 2 and up non-provisionally in this experiment.
  *
  *       Recovery replays IN-C at 5/400 because it is non-provisional.
  *
  *       When it does the tree lookup (getParentINForChildIN) it uses the root
  *       IN-A at 2/200. This search fetches IN-B at 2/100 and then fails
  *       fetching IN-C at 1/300 because file 1 has been deleted.
  *
  *       In reality we log provisionally below maxFlushLevel, so that IN-C at
  *       5/400 is not replayed. IN-B at 5/500 is at the maxFlushLevel and is
  *       non-provisional and is replayed.  The search succeeds because nothing
  *       in file 1 needs to be fetched to find the parent.
  *
  *       TODO: Could we instead replay INs in reverse order?
  *       Then IN-B at 5/500 would be replayed first. Unfortunately this would
  *       probably break something else. For example, the utilization tracking
  *       replay for INs currently depends on reading forward.  This is worth
  *       exploring, however, since reducing logging during checkpoints would be
  *       extremely beneficial.
  *
  * Provisional.BEFORE_CKPT_END
  * ---------------------------
  * This property was added to solve a specific problem that occurs in earlier
  * versions of JE:  When a long checkpoint runs, the BINs are not counted
  * obsolete until after the entire BIN level has been logged.  Specifically,
  * they are counted obsolete when their ancestor is logged non-provisionally.
  * Most INs logged by a checkpoint are BINs.  This means that during a very
  * long checkpoint, cleaning of the files containing those old BINs is delayed,
  * and more importantly the calculated utilization is much higher than it
  * actually is.  The correction in utilization does not occur until the end of
  * the checkpoint, when the higher level INs are logged.  This manifests as a
  * lull in cleaning during the checkpoint, because calculated utilization is
  * artificially high, and a spike in cleaning at the end of the checkpoint.  In
  * some cases, the cleaner cannot recover from the backlog that is created by
  * the spike.
  *
  * The provisional property effects obsolete counting as follows:
  *
  *  + If an IN is logged with Provisional.YES, the old version of the IN is not
  *    counted obsolete immediately.  Instead, the offset of the old version of
  *    the IN is added to a list in its parent IN.  The offsets migrate upward
  *    in the tree in this manner until an ancestor IN is logged
  *    non-provisionally.
  *
  *  + If an IN is logged with Provisional.NO or BEFORE_CKPT_END, the old
  *    version of the IN is counted obsolete immediately (and offsets
  *    accumulated from provisional child INs are counted).  This means
  *    that the obsolete offset is added to the UtilizationTracker, and may be
  *    flushed in a FileSummaryLN any time after that.  At the latest, it is
  *    flushed at the end of the checkpoint.
  *
  * Because subtree logging is now used for checkpoints and the parent IN of
  * each logged sub-tree is logged with BEFORE_CKPT_END, the prior version of
  * all INs in the sub-tree will be counted obsolete at that time.  This keeps
  * the calculated utilization accurate throughout the checkpoint, and prevents
  * the large per-checkpoint lull and spike in log cleaning.
  *
  * For the intermediate levels, Provisional.BEFORE_CKPT_END must be used rather
  * than Provisional.NO, which is reserved for the highest level only.  During
  * recovery, the Provisional values are treated as follows (this is from the
  * Provisional javadoc):
  *  + NO:  The entry is non-provisional and is always processed by recovery.
  *  + YES: The entry is provisional and is never processed by recovery.
  *  + BEFORE_CKPT_END: The entry is provisional (not processed by recovery) if
  *    it occurs before the CkptEnd in the recovery interval, or is
  *    non-provisional (is processed) if it occurs after CkptEnd.
  *
  * The key to BEFORE_CKPT_END is that it is treated as provisional if a CkptEnd
  * is logged, i.e., if we do not crash before completing the checkpoint.
  * Because the checkpoint completed, we may have deleted log files that
  * would be necessary to replay the IN.  So we cannot safely replay it.
  *
  * Note the difference in the treatment of BEFORE_CKPT_END for obsolete
  * counting and recovery:
  *  + For obsolete counting, BEFORE_CKPT_END is treated as non-provisional.
  *  + For recovery, when the IN occurs before CkptEnd, BEFORE_CKPT_END is
  *    treated as provisional.
  * This difference is the reason for the existence of BEFORE_CKPT_END.
  *
  * TODO: Improvement to tracking of obsolete data.
  * When we checkpoint INs, why can't we always count the previous version
  * obsolete immediately, irrespective of whether it is logged provisionally?
  * The previous version file can't be deleted until after a complete
  * checkpoint. If we do not complete the next checkpoint, recovery will
  * replay the INs logged with BEFORE_CKPT_END. So the previous version will be
  * obsolete. This would avoid storing a list of obsolete child LSNs in each
  * parent IN, and would make the utilization summary more up-to-date. The
  * motivation for sub-tree logging was to keep the utilization info up-to-date,
  * so we may be able to remove sub-tree logging as well. Additionally, I think
  * we can remove BEFORE_CKPT_END and log provisionally (YES) instead, because
  * recovery will replay the actions that dirtied the INs, and the ckpt at the
  * end of recovery will flush the dirty nodes, making the previous version
  * obsolete; however, this would duplicate the provisional INs, so perhaps it
  * is best to continue to use BEFORE_CKPT_END.
  */
 public enum Provisional {

     /**
      * The entry is non-provisional and is always processed by recovery.
      */
     NO,

     /**
      * The entry is provisional and is never processed by recovery.
      */
     YES,

     /**
      * The entry is provisional (not processed by recovery) if it occurs before
      * the CkptEnd in the recovery interval, or is non-provisional (is
      * processed) if it occurs after CkptEnd.
      */
     BEFORE_CKPT_END;

     /**
      * Determines whether a given log entry should be processed during
      * recovery.
      */
     public boolean isProvisional(long logEntryLsn, long ckptEndLsn) {
         assert logEntryLsn != DbLsn.NULL_LSN;
         switch (this) {
         case NO:
             return false;
         case YES:
             return true;
         case BEFORE_CKPT_END:
             return ckptEndLsn != DbLsn.NULL_LSN &&
                    DbLsn.compareTo(logEntryLsn, ckptEndLsn) < 0;
         default:
             assert false;
             return false;
         }
     }
 }
	/*-
	* Copyright (C) 2002, 2018, Oracle and/or its affiliates. All rights reserved.
	*
	* This file was distributed by Oracle as part of a version of Oracle Berkeley
	* DB Java Edition made available at:
	*
	* http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/downloads/index.html
	*
	* Please see the LICENSE file included in the top-level directory of the
	* appropriate version of Oracle Berkeley DB Java Edition for a copy of the
	* license and additional information.
	*/

	package com.sleepycat.je.log;

	import com.sleepycat.je.utilint.DbLsn;

	/**
	* Specifies whether to log an entry provisionally.
	*
	* Provisional log entries:
	*
	* What are provisional log entries?
	*
	* Provisional log entries are those tagged with the provisional attribute
	* in the log entry header. The provisional attribute can be applied to any
	* type of log entry, and is implemented in
	* com.sleepycat.je.log.LogEntryHeader as two stolen bits in the 8 bit
	* version field.
	*
	* When is the provisional attribute used?
	*
	* The provisional attribute is used only during recovery. It very simply
	* indicates that recovery will ignore and skip over this log entry.
	*
	* When is the provisional attribute set?
	*
	* The provisional attribute started out as a way to create atomicity among
	* different log entries. Child pointers in the JE Btree are physical LSNs,
	* so each Btree node's children must be logged before it in the log. On
	* the other hand, one fundamental assumption of the JE log is that each
	* Btree node found in the log can be replayed and placed in the in-memory
	* tree. To do so, each Btree node must have a parent node that will house
	* it. The grouping of multiple log entries into one atomic group is often
	* used to fulfiil this requirement.
	*
	* * Atomic log entries:
	*
	* + When a btree is first created, we bootstrap tree creation by
	* logging the first BIN provisionally, then creating a parent IN
	* which is the Btree root IN, which points to this first BIN.
	*
	* + When we split a Btree node, we create a new IN, which is the
	* sibling of the split node. We log the old sibling and the new
	* sibling provisionally, and then log the parent, so that any
	* crashes in the middle of this triumvirate which result in the
	* failure to log the parent will skip over the orphaned siblings.
	*
	* + Splitting the Btree root is just a special case of a split.
	*
	* + Creating a duplicate subtree to hang in the middle of a btree is
	* just a special case of a split and btree first creation.
	*
	* * Entries not meant to be recovered
	*
	* Temp DBs are not meant to be recovered and we log their LN
	* and IN nodes in a very lax fashion, purely as a way of evicting
	* them out of the cache temporarily. There is no guarantee that a
	* consistent set has been logged to disk. We skip over them for both
	* recovery performance and the "each-node-must-have-a-parent" rule.
	*
	* * Durable deferred-write entries
	*
	* Deferred-write INs are logged provisionally for the same reasons
	* as for temp DBs (above): for recovery performance and the
	* "each-node-must-have-a-parent" rule.
	*
	* Deferred-write LNs are logged non-provisionally to support
	* obsolete LSN counting. It would be nice to log them provisionally
	* for recovery performance and to allow LN deletion without logging;
	* however, that is not currently practical if obsolete counting is
	* to be supported. See [#16864].
	*
	* * Checkpoint performance
	*
	* When we flush a series of nodes, it's a waste to replay nodes
	* which are referenced by higher levels. For example, if we
	* checkpoint this btree:
	*
	* {@literal
	* INA -> INb -> BINc (dirty)-> LNd
	* }
	*
	* we log them in this order:
	*
	* BINc
	* INb
	*
	* And there's no point to replaying BINc, because it's referenced by
	* INb. We skip over BINc, which we do by logging it provisionally.
	*
	* In addition, BEFORE_CKPT_END is used to improve cleaner
	* performance by keeping utilization information up-to-date during
	* the checkpoint. See below for details.
	*
	* * Log cleaning - removing references to deleted files.
	*
	* When we delete a file for log cleaning we guarantee that no active log
	* entries refer to any log entry in the deleted file. Suppose our
	* checkpoint looks like this:
	*
	* 5/100 LNa
	* 5/200 Ckpt start
	* 5/300 INs
	* ...
	* 5/500 Ckpt end
	* ...
	* 5/800 last entry in log
	*
	* Because we do not delete a file until the Ckpt end after processing
	* (cleaning) it, nothing from 5/500 to 5/800 can refer to a file deleted
	* due to the Ckpt end in 5/500.
	*
	* BEFORE_CKPT_END is motivated in part (see below for a complete
	* description) by the fact that while log entries between 5/100
	* (first active lsn) and 5/500 (ckpt end) will not in of themselves
	* contain a LSN for a cleaned, deleted file, the act of processing them
	* during recovery could require fetching a node from a deleted file. For
	* example, the IN at 5/300 could have an in-memory parent which has a
	* reference to an older, cleaned version of that IN. Treating the span
	* between 5/200 and 5/500 as provisional is both optimal, because only
	* the high level INs need to be processed, and required, in order not to
	* fetch from a cleaned file.
	*
	* The correctness issue is described in [#16037] comment 151, where we
	* attempted to log non-provisionally below maxFlushLevel. It is
	* repeated below.
	*
	* IN-A
	* \
	* IN-B
	* \
	* IN-C
	* \
	* BIN-D
	*
	* 1/100 CkptStart
	* 1/200 BIN-D provisional
	* 1/300 IN-C non-provisional
	* 2/100 IN-B non-provisional
	* 2/200 IN-A non-provisional
	* 2/300 MapLN refers to IN-A
	* 2/400 CkptEnd
	* 5/100 cleaner processes file 1
	* BIN-D and IN-C are dirty
	* 5/200 CkptStart
	* 5/300 BIN-D provisional
	* 5/400 IN-C non-provisional
	* 5/500 IN-B non-provisional (must log one extra level)
	* IN-A is not logged
	* MapLN still refers to IN-A at 2/200
	* 5/600 CkptEnd
	* file 1 is deleted
	* 6/100 Start recovery
	*
	* Note that only the bottom level BINs are logged provisionally because
	* we're logging level 2 and up non-provisionally in this experiment.
	*
	* Recovery replays IN-C at 5/400 because it is non-provisional.
	*
	* When it does the tree lookup (getParentINForChildIN) it uses the root
	* IN-A at 2/200. This search fetches IN-B at 2/100 and then fails
	* fetching IN-C at 1/300 because file 1 has been deleted.
	*
	* In reality we log provisionally below maxFlushLevel, so that IN-C at
	* 5/400 is not replayed. IN-B at 5/500 is at the maxFlushLevel and is
	* non-provisional and is replayed. The search succeeds because nothing
	* in file 1 needs to be fetched to find the parent.
	*
	* TODO: Could we instead replay INs in reverse order?
	* Then IN-B at 5/500 would be replayed first. Unfortunately this would
	* probably break something else. For example, the utilization tracking
	* replay for INs currently depends on reading forward. This is worth
	* exploring, however, since reducing logging during checkpoints would be
	* extremely beneficial.
	*
	* Provisional.BEFORE_CKPT_END
	* ---------------------------
	* This property was added to solve a specific problem that occurs in earlier
	* versions of JE: When a long checkpoint runs, the BINs are not counted
	* obsolete until after the entire BIN level has been logged. Specifically,
	* they are counted obsolete when their ancestor is logged non-provisionally.
	* Most INs logged by a checkpoint are BINs. This means that during a very
	* long checkpoint, cleaning of the files containing those old BINs is delayed,
	* and more importantly the calculated utilization is much higher than it
	* actually is. The correction in utilization does not occur until the end of
	* the checkpoint, when the higher level INs are logged. This manifests as a
	* lull in cleaning during the checkpoint, because calculated utilization is
	* artificially high, and a spike in cleaning at the end of the checkpoint. In
	* some cases, the cleaner cannot recover from the backlog that is created by
	* the spike.
	*
	* The provisional property effects obsolete counting as follows:
	*
	* + If an IN is logged with Provisional.YES, the old version of the IN is not
	* counted obsolete immediately. Instead, the offset of the old version of
	* the IN is added to a list in its parent IN. The offsets migrate upward
	* in the tree in this manner until an ancestor IN is logged
	* non-provisionally.
	*
	* + If an IN is logged with Provisional.NO or BEFORE_CKPT_END, the old
	* version of the IN is counted obsolete immediately (and offsets
	* accumulated from provisional child INs are counted). This means
	* that the obsolete offset is added to the UtilizationTracker, and may be
	* flushed in a FileSummaryLN any time after that. At the latest, it is
	* flushed at the end of the checkpoint.
	*
	* Because subtree logging is now used for checkpoints and the parent IN of
	* each logged sub-tree is logged with BEFORE_CKPT_END, the prior version of
	* all INs in the sub-tree will be counted obsolete at that time. This keeps
	* the calculated utilization accurate throughout the checkpoint, and prevents
	* the large per-checkpoint lull and spike in log cleaning.
	*
	* For the intermediate levels, Provisional.BEFORE_CKPT_END must be used rather
	* than Provisional.NO, which is reserved for the highest level only. During
	* recovery, the Provisional values are treated as follows (this is from the
	* Provisional javadoc):
	* + NO: The entry is non-provisional and is always processed by recovery.
	* + YES: The entry is provisional and is never processed by recovery.
	* + BEFORE_CKPT_END: The entry is provisional (not processed by recovery) if
	* it occurs before the CkptEnd in the recovery interval, or is
	* non-provisional (is processed) if it occurs after CkptEnd.
	*
	* The key to BEFORE_CKPT_END is that it is treated as provisional if a CkptEnd
	* is logged, i.e., if we do not crash before completing the checkpoint.
	* Because the checkpoint completed, we may have deleted log files that
	* would be necessary to replay the IN. So we cannot safely replay it.
	*
	* Note the difference in the treatment of BEFORE_CKPT_END for obsolete
	* counting and recovery:
	* + For obsolete counting, BEFORE_CKPT_END is treated as non-provisional.
	* + For recovery, when the IN occurs before CkptEnd, BEFORE_CKPT_END is
	* treated as provisional.
	* This difference is the reason for the existence of BEFORE_CKPT_END.
	*
	* TODO: Improvement to tracking of obsolete data.
	* When we checkpoint INs, why can't we always count the previous version
	* obsolete immediately, irrespective of whether it is logged provisionally?
	* The previous version file can't be deleted until after a complete
	* checkpoint. If we do not complete the next checkpoint, recovery will
	* replay the INs logged with BEFORE_CKPT_END. So the previous version will be
	* obsolete. This would avoid storing a list of obsolete child LSNs in each
	* parent IN, and would make the utilization summary more up-to-date. The
	* motivation for sub-tree logging was to keep the utilization info up-to-date,
	* so we may be able to remove sub-tree logging as well. Additionally, I think
	* we can remove BEFORE_CKPT_END and log provisionally (YES) instead, because
	* recovery will replay the actions that dirtied the INs, and the ckpt at the
	* end of recovery will flush the dirty nodes, making the previous version
	* obsolete; however, this would duplicate the provisional INs, so perhaps it
	* is best to continue to use BEFORE_CKPT_END.
	*/
	public enum Provisional {

	/**
	* The entry is non-provisional and is always processed by recovery.
	*/
	NO,

	/**
	* The entry is provisional and is never processed by recovery.
	*/
	YES,

	/**
	* The entry is provisional (not processed by recovery) if it occurs before
	* the CkptEnd in the recovery interval, or is non-provisional (is
	* processed) if it occurs after CkptEnd.
	*/
	BEFORE_CKPT_END;

	/**
	* Determines whether a given log entry should be processed during
	* recovery.
	*/
	public boolean isProvisional(long logEntryLsn, long ckptEndLsn) {
	assert logEntryLsn != DbLsn.NULL_LSN;
	switch (this) {
	case NO:
	return false;
	case YES:
	return true;
	case BEFORE_CKPT_END:
	return ckptEndLsn != DbLsn.NULL_LSN &&
	DbLsn.compareTo(logEntryLsn, ckptEndLsn) < 0;
	default:
	assert false;
	return false;
	}
	}
	}