blob: 87bdd3bcab2bceb424410a3c485d06f4eb1ffcd2 [file] [log] [blame]
/*-
* Copyright (C) 2002, 2018, Oracle and/or its affiliates. All rights reserved.
*
* This file was distributed by Oracle as part of a version of Oracle Berkeley
* DB Java Edition made available at:
*
* http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/downloads/index.html
*
* Please see the LICENSE file included in the top-level directory of the
* appropriate version of Oracle Berkeley DB Java Edition for a copy of the
* license and additional information.
*/
package com.sleepycat.je.log;
import com.sleepycat.je.utilint.DbLsn;
/**
* Specifies whether to log an entry provisionally.
*
* Provisional log entries:
*
* What are provisional log entries?
*
* Provisional log entries are those tagged with the provisional attribute
* in the log entry header. The provisional attribute can be applied to any
* type of log entry, and is implemented in
* com.sleepycat.je.log.LogEntryHeader as two stolen bits in the 8 bit
* version field.
*
* When is the provisional attribute used?
*
* The provisional attribute is used only during recovery. It very simply
* indicates that recovery will ignore and skip over this log entry.
*
* When is the provisional attribute set?
*
* The provisional attribute started out as a way to create atomicity among
* different log entries. Child pointers in the JE Btree are physical LSNs,
* so each Btree node's children must be logged before it in the log. On
* the other hand, one fundamental assumption of the JE log is that each
* Btree node found in the log can be replayed and placed in the in-memory
* tree. To do so, each Btree node must have a parent node that will house
* it. The grouping of multiple log entries into one atomic group is often
* used to fulfiil this requirement.
*
* * Atomic log entries:
*
* + When a btree is first created, we bootstrap tree creation by
* logging the first BIN provisionally, then creating a parent IN
* which is the Btree root IN, which points to this first BIN.
*
* + When we split a Btree node, we create a new IN, which is the
* sibling of the split node. We log the old sibling and the new
* sibling provisionally, and then log the parent, so that any
* crashes in the middle of this triumvirate which result in the
* failure to log the parent will skip over the orphaned siblings.
*
* + Splitting the Btree root is just a special case of a split.
*
* + Creating a duplicate subtree to hang in the middle of a btree is
* just a special case of a split and btree first creation.
*
* * Entries not meant to be recovered
*
* Temp DBs are not meant to be recovered and we log their LN
* and IN nodes in a very lax fashion, purely as a way of evicting
* them out of the cache temporarily. There is no guarantee that a
* consistent set has been logged to disk. We skip over them for both
* recovery performance and the "each-node-must-have-a-parent" rule.
*
* * Durable deferred-write entries
*
* Deferred-write INs are logged provisionally for the same reasons
* as for temp DBs (above): for recovery performance and the
* "each-node-must-have-a-parent" rule.
*
* Deferred-write LNs are logged non-provisionally to support
* obsolete LSN counting. It would be nice to log them provisionally
* for recovery performance and to allow LN deletion without logging;
* however, that is not currently practical if obsolete counting is
* to be supported. See [#16864].
*
* * Checkpoint performance
*
* When we flush a series of nodes, it's a waste to replay nodes
* which are referenced by higher levels. For example, if we
* checkpoint this btree:
*
* {@literal
* INA -> INb -> BINc (dirty)-> LNd
* }
*
* we log them in this order:
*
* BINc
* INb
*
* And there's no point to replaying BINc, because it's referenced by
* INb. We skip over BINc, which we do by logging it provisionally.
*
* In addition, BEFORE_CKPT_END is used to improve cleaner
* performance by keeping utilization information up-to-date during
* the checkpoint. See below for details.
*
* * Log cleaning - removing references to deleted files.
*
* When we delete a file for log cleaning we guarantee that no active log
* entries refer to any log entry in the deleted file. Suppose our
* checkpoint looks like this:
*
* 5/100 LNa
* 5/200 Ckpt start
* 5/300 INs
* ...
* 5/500 Ckpt end
* ...
* 5/800 last entry in log
*
* Because we do not delete a file until the Ckpt end after processing
* (cleaning) it, nothing from 5/500 to 5/800 can refer to a file deleted
* due to the Ckpt end in 5/500.
*
* BEFORE_CKPT_END is motivated in part (see below for a complete
* description) by the fact that while log entries between 5/100
* (first active lsn) and 5/500 (ckpt end) will not in of themselves
* contain a LSN for a cleaned, deleted file, the act of processing them
* during recovery could require fetching a node from a deleted file. For
* example, the IN at 5/300 could have an in-memory parent which has a
* reference to an older, cleaned version of that IN. Treating the span
* between 5/200 and 5/500 as provisional is both optimal, because only
* the high level INs need to be processed, and required, in order not to
* fetch from a cleaned file.
*
* The correctness issue is described in [#16037] comment 151, where we
* attempted to log non-provisionally below maxFlushLevel. It is
* repeated below.
*
* IN-A
* \
* IN-B
* \
* IN-C
* \
* BIN-D
*
* 1/100 CkptStart
* 1/200 BIN-D provisional
* 1/300 IN-C non-provisional
* 2/100 IN-B non-provisional
* 2/200 IN-A non-provisional
* 2/300 MapLN refers to IN-A
* 2/400 CkptEnd
* 5/100 cleaner processes file 1
* BIN-D and IN-C are dirty
* 5/200 CkptStart
* 5/300 BIN-D provisional
* 5/400 IN-C non-provisional
* 5/500 IN-B non-provisional (must log one extra level)
* IN-A is not logged
* MapLN still refers to IN-A at 2/200
* 5/600 CkptEnd
* file 1 is deleted
* 6/100 Start recovery
*
* Note that only the bottom level BINs are logged provisionally because
* we're logging level 2 and up non-provisionally in this experiment.
*
* Recovery replays IN-C at 5/400 because it is non-provisional.
*
* When it does the tree lookup (getParentINForChildIN) it uses the root
* IN-A at 2/200. This search fetches IN-B at 2/100 and then fails
* fetching IN-C at 1/300 because file 1 has been deleted.
*
* In reality we log provisionally below maxFlushLevel, so that IN-C at
* 5/400 is not replayed. IN-B at 5/500 is at the maxFlushLevel and is
* non-provisional and is replayed. The search succeeds because nothing
* in file 1 needs to be fetched to find the parent.
*
* TODO: Could we instead replay INs in reverse order?
* Then IN-B at 5/500 would be replayed first. Unfortunately this would
* probably break something else. For example, the utilization tracking
* replay for INs currently depends on reading forward. This is worth
* exploring, however, since reducing logging during checkpoints would be
* extremely beneficial.
*
* Provisional.BEFORE_CKPT_END
* ---------------------------
* This property was added to solve a specific problem that occurs in earlier
* versions of JE: When a long checkpoint runs, the BINs are not counted
* obsolete until after the entire BIN level has been logged. Specifically,
* they are counted obsolete when their ancestor is logged non-provisionally.
* Most INs logged by a checkpoint are BINs. This means that during a very
* long checkpoint, cleaning of the files containing those old BINs is delayed,
* and more importantly the calculated utilization is much higher than it
* actually is. The correction in utilization does not occur until the end of
* the checkpoint, when the higher level INs are logged. This manifests as a
* lull in cleaning during the checkpoint, because calculated utilization is
* artificially high, and a spike in cleaning at the end of the checkpoint. In
* some cases, the cleaner cannot recover from the backlog that is created by
* the spike.
*
* The provisional property effects obsolete counting as follows:
*
* + If an IN is logged with Provisional.YES, the old version of the IN is not
* counted obsolete immediately. Instead, the offset of the old version of
* the IN is added to a list in its parent IN. The offsets migrate upward
* in the tree in this manner until an ancestor IN is logged
* non-provisionally.
*
* + If an IN is logged with Provisional.NO or BEFORE_CKPT_END, the old
* version of the IN is counted obsolete immediately (and offsets
* accumulated from provisional child INs are counted). This means
* that the obsolete offset is added to the UtilizationTracker, and may be
* flushed in a FileSummaryLN any time after that. At the latest, it is
* flushed at the end of the checkpoint.
*
* Because subtree logging is now used for checkpoints and the parent IN of
* each logged sub-tree is logged with BEFORE_CKPT_END, the prior version of
* all INs in the sub-tree will be counted obsolete at that time. This keeps
* the calculated utilization accurate throughout the checkpoint, and prevents
* the large per-checkpoint lull and spike in log cleaning.
*
* For the intermediate levels, Provisional.BEFORE_CKPT_END must be used rather
* than Provisional.NO, which is reserved for the highest level only. During
* recovery, the Provisional values are treated as follows (this is from the
* Provisional javadoc):
* + NO: The entry is non-provisional and is always processed by recovery.
* + YES: The entry is provisional and is never processed by recovery.
* + BEFORE_CKPT_END: The entry is provisional (not processed by recovery) if
* it occurs before the CkptEnd in the recovery interval, or is
* non-provisional (is processed) if it occurs after CkptEnd.
*
* The key to BEFORE_CKPT_END is that it is treated as provisional if a CkptEnd
* is logged, i.e., if we do not crash before completing the checkpoint.
* Because the checkpoint completed, we may have deleted log files that
* would be necessary to replay the IN. So we cannot safely replay it.
*
* Note the difference in the treatment of BEFORE_CKPT_END for obsolete
* counting and recovery:
* + For obsolete counting, BEFORE_CKPT_END is treated as non-provisional.
* + For recovery, when the IN occurs before CkptEnd, BEFORE_CKPT_END is
* treated as provisional.
* This difference is the reason for the existence of BEFORE_CKPT_END.
*
* TODO: Improvement to tracking of obsolete data.
* When we checkpoint INs, why can't we always count the previous version
* obsolete immediately, irrespective of whether it is logged provisionally?
* The previous version file can't be deleted until after a complete
* checkpoint. If we do not complete the next checkpoint, recovery will
* replay the INs logged with BEFORE_CKPT_END. So the previous version will be
* obsolete. This would avoid storing a list of obsolete child LSNs in each
* parent IN, and would make the utilization summary more up-to-date. The
* motivation for sub-tree logging was to keep the utilization info up-to-date,
* so we may be able to remove sub-tree logging as well. Additionally, I think
* we can remove BEFORE_CKPT_END and log provisionally (YES) instead, because
* recovery will replay the actions that dirtied the INs, and the ckpt at the
* end of recovery will flush the dirty nodes, making the previous version
* obsolete; however, this would duplicate the provisional INs, so perhaps it
* is best to continue to use BEFORE_CKPT_END.
*/
public enum Provisional {
/**
* The entry is non-provisional and is always processed by recovery.
*/
NO,
/**
* The entry is provisional and is never processed by recovery.
*/
YES,
/**
* The entry is provisional (not processed by recovery) if it occurs before
* the CkptEnd in the recovery interval, or is non-provisional (is
* processed) if it occurs after CkptEnd.
*/
BEFORE_CKPT_END;
/**
* Determines whether a given log entry should be processed during
* recovery.
*/
public boolean isProvisional(long logEntryLsn, long ckptEndLsn) {
assert logEntryLsn != DbLsn.NULL_LSN;
switch (this) {
case NO:
return false;
case YES:
return true;
case BEFORE_CKPT_END:
return ckptEndLsn != DbLsn.NULL_LSN &&
DbLsn.compareTo(logEntryLsn, ckptEndLsn) < 0;
default:
assert false;
return false;
}
}
}