| src/backend/storage/lmgr/README-SSI |
| |
| Serializable Snapshot Isolation (SSI) and Predicate Locking |
| =========================================================== |
| |
| This code is in the lmgr directory because about 90% of it is an |
| implementation of predicate locking, which is required for SSI, |
| rather than being directly related to SSI itself. When another use |
| for predicate locking justifies the effort to tease these two things |
| apart, this README file should probably be split. |
| |
| |
| Credits |
| ------- |
| |
| This feature was developed by Kevin Grittner and Dan R. K. Ports, |
| with review and suggestions from Joe Conway, Heikki Linnakangas, and |
| Jeff Davis. It is based on work published in these papers: |
| |
| Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2008. |
| Serializable isolation for snapshot databases. |
| In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD |
| international conference on Management of data, |
| pages 729-738, New York, NY, USA. ACM. |
| http://doi.acm.org/10.1145/1376616.1376690 |
| |
| Michael James Cahill. 2009. |
| Serializable Isolation for Snapshot Databases. |
| Sydney Digital Theses. |
| University of Sydney, School of Information Technologies. |
| http://hdl.handle.net/2123/5353 |
| |
| |
| Overview |
| -------- |
| |
| With true serializable transactions, if you can show that your |
| transaction will do the right thing if there are no concurrent |
| transactions, it will do the right thing in any mix of serializable |
| transactions or be rolled back with a serialization failure. This |
| feature has been implemented in PostgreSQL using SSI. |
| |
| |
| Serializable and Snapshot Transaction Isolation Levels |
| ------------------------------------------------------ |
| |
| Serializable transaction isolation is attractive for shops with |
| active development by many programmers against a complex schema |
| because it guarantees data integrity with very little staff time -- |
| if a transaction can be shown to always do the right thing when it is |
| run alone (before or after any other transaction), it will always do |
| the right thing in any mix of concurrent serializable transactions. |
| Where conflicts with other transactions would result in an |
| inconsistent state within the database or an inconsistent view of |
| the data, a serializable transaction will block or roll back to |
| prevent the anomaly. The SQL standard provides a specific SQLSTATE |
| for errors generated when a transaction rolls back for this reason, |
| so that transactions can be retried automatically. |
| |
| Before version 9.1, PostgreSQL did not support a full serializable |
| isolation level. A request for serializable transaction isolation |
| actually provided snapshot isolation. This has well known anomalies |
| which can allow data corruption or inconsistent views of the data |
| during concurrent transactions; although these anomalies only occur |
| when certain patterns of read-write dependencies exist within a set |
| of concurrent transactions. Where these patterns exist, the anomalies |
| can be prevented by introducing conflicts through explicitly |
| programmed locks or otherwise unnecessary writes to the database. |
| Snapshot isolation is popular because performance is better than |
| serializable isolation and the integrity guarantees which it does |
| provide allow anomalies to be avoided or managed with reasonable |
| effort in many environments. |
| |
| |
| Serializable Isolation Implementation Strategies |
| ------------------------------------------------ |
| |
| Techniques for implementing full serializable isolation have been |
| published and in use in many database products for decades. The |
| primary technique which has been used is Strict Two-Phase Locking |
| (S2PL), which operates by blocking writes against data which has been |
| read by concurrent transactions and blocking any access (read or |
| write) against data which has been written by concurrent |
| transactions. A cycle in a graph of blocking indicates a deadlock, |
| requiring a rollback. Blocking and deadlocks under S2PL in high |
| contention workloads can be debilitating, crippling throughput and |
| response time. |
| |
| A new technique for implementing full serializable isolation in an |
| MVCC database appears in the literature beginning in 2008. This |
| technique, known as Serializable Snapshot Isolation (SSI) has many of |
| the advantages of snapshot isolation. In particular, reads don't |
| block anything and writes don't block reads. Essentially, it runs |
| snapshot isolation but monitors the read-write conflicts between |
| transactions to identify dangerous structures in the transaction |
| graph which indicate that a set of concurrent transactions might |
| produce an anomaly, and rolls back transactions to ensure that no |
| anomalies occur. It will produce some false positives (where a |
| transaction is rolled back even though there would not have been an |
| anomaly), but will never let an anomaly occur. In the two known |
| prototype implementations, performance for many workloads (even with |
| the need to restart transactions which are rolled back) is very close |
| to snapshot isolation and generally far better than an S2PL |
| implementation. |
| |
| |
| Apparent Serial Order of Execution |
| ---------------------------------- |
| |
| One way to understand when snapshot anomalies can occur, and to |
| visualize the difference between the serializable implementations |
| described above, is to consider that among transactions executing at |
| the serializable transaction isolation level, the results are |
| required to be consistent with some serial (one-at-a-time) execution |
| of the transactions [1]. How is that order determined in each? |
| |
| In S2PL, each transaction locks any data it accesses. It holds the |
| locks until committing, preventing other transactions from making |
| conflicting accesses to the same data in the interim. Some |
| transactions may have to be rolled back to prevent deadlock. But |
| successful transactions can always be viewed as having occurred |
| sequentially, in the order they committed. |
| |
| With snapshot isolation, reads never block writes, nor vice versa, so |
| more concurrency is possible. The order in which transactions appear |
| to have executed is determined by something more subtle than in S2PL: |
| read/write dependencies. If a transaction reads data, it appears to |
| execute after the transaction that wrote the data it is reading. |
| Similarly, if it updates data, it appears to execute after the |
| transaction that wrote the previous version. These dependencies, which |
| we call "wr-dependencies" and "ww-dependencies", are consistent with |
| the commit order, because the first transaction must have committed |
| before the second starts. However, there can also be dependencies |
| between two *concurrent* transactions, i.e. where one was running when |
| the other acquired its snapshot. These "rw-conflicts" occur when one |
| transaction attempts to read data which is not visible to it because |
| the transaction which wrote it (or will later write it) is |
| concurrent. The reading transaction appears to have executed first, |
| regardless of the actual sequence of transaction starts or commits, |
| because it sees a database state prior to that in which the other |
| transaction leaves it. |
| |
| Anomalies occur when a cycle is created in the graph of dependencies: |
| when a dependency or series of dependencies causes transaction A to |
| appear to have executed before transaction B, but another series of |
| dependencies causes B to appear before A. If that's the case, then |
| the results can't be consistent with any serial execution of the |
| transactions. |
| |
| |
| SSI Algorithm |
| ------------- |
| |
| As of 9.1, serializable transactions in PostgreSQL are implemented using |
| Serializable Snapshot Isolation (SSI), based on the work of Cahill |
| et al. Fundamentally, this allows snapshot isolation to run as it |
| previously did, while monitoring for conditions which could create a |
| serialization anomaly. |
| |
| SSI is based on the observation [2] that each snapshot isolation |
| anomaly corresponds to a cycle that contains a "dangerous structure" |
| of two adjacent rw-conflict edges: |
| |
| Tin ------> Tpivot ------> Tout |
| rw rw |
| |
| SSI works by watching for this dangerous structure, and rolling |
| back a transaction when needed to prevent any anomaly. This means it |
| only needs to track rw-conflicts between concurrent transactions, not |
| wr- and ww-dependencies. It also means there is a risk of false |
| positives, because not every dangerous structure is embedded in an |
| actual cycle. The number of false positives is low in practice, so |
| this represents an acceptable tradeoff for keeping the detection |
| overhead low. |
| |
| The PostgreSQL implementation uses two additional optimizations: |
| |
| * Tout must commit before any other transaction in the cycle |
| (see proof of Theorem 2.1 of [2]). We only roll back a transaction |
| if Tout commits before Tpivot and Tin. |
| |
| * if Tin is read-only, there can only be an anomaly if Tout committed |
| before Tin takes its snapshot. This optimization is an original |
| one. Proof: |
| |
| - Because there is a cycle, there must be some transaction T0 that |
| precedes Tin in the cycle. (T0 might be the same as Tout.) |
| |
| - The edge between T0 and Tin can't be a rw-conflict or ww-dependency, |
| because Tin was read-only, so it must be a wr-dependency. |
| Those can only occur if T0 committed before Tin took its snapshot, |
| else Tin would have ignored T0's output. |
| |
| - Because Tout must commit before any other transaction in the |
| cycle, it must commit before T0 commits -- and thus before Tin |
| starts. |
| |
| |
| PostgreSQL Implementation |
| ------------------------- |
| |
| * Since this technique is based on Snapshot Isolation (SI), those |
| areas in PostgreSQL which don't use SI can't be brought under SSI. |
| This includes system tables, temporary tables, sequences, hint bit |
| rewrites, etc. SSI can not eliminate existing anomalies in these |
| areas. |
| |
| * Any transaction which is run at a transaction isolation level |
| other than SERIALIZABLE will not be affected by SSI. If you want to |
| enforce business rules through SSI, all transactions should be run at |
| the SERIALIZABLE transaction isolation level, and that should |
| probably be set as the default. |
| |
| * If all transactions are run at the SERIALIZABLE transaction |
| isolation level, business rules can be enforced in triggers or |
| application code without ever having a need to acquire an explicit |
| lock or to use SELECT FOR SHARE or SELECT FOR UPDATE. |
| |
| * Those who want to continue to use snapshot isolation without |
| the additional protections of SSI (and the associated costs of |
| enforcing those protections), can use the REPEATABLE READ transaction |
| isolation level. This level retains its legacy behavior, which |
| is identical to the old SERIALIZABLE implementation and fully |
| consistent with the standard's requirements for the REPEATABLE READ |
| transaction isolation level. |
| |
| * Performance under this SSI implementation will be significantly |
| improved if transactions which don't modify permanent tables are |
| declared to be READ ONLY before they begin reading data. |
| |
| * Performance under SSI will tend to degrade more rapidly with a |
| large number of active database transactions than under less strict |
| isolation levels. Limiting the number of active transactions through |
| use of a connection pool or similar techniques may be necessary to |
| maintain good performance. |
| |
| * Any transaction which must be rolled back to prevent |
| serialization anomalies will fail with SQLSTATE 40001, which has a |
| standard meaning of "serialization failure". |
| |
| * This SSI implementation makes an effort to choose the |
| transaction to be canceled such that an immediate retry of the |
| transaction will not fail due to conflicts with exactly the same |
| transactions. Pursuant to this goal, no transaction is canceled |
| until one of the other transactions in the set of conflicts which |
| could generate an anomaly has successfully committed. This is |
| conceptually similar to how write conflicts are handled. To fully |
| implement this guarantee there needs to be a way to roll back the |
| active transaction for another process with a serialization failure |
| SQLSTATE, even if it is "idle in transaction". |
| |
| |
| Predicate Locking |
| ----------------- |
| |
| Both S2PL and SSI require some form of predicate locking to handle |
| situations where reads conflict with later inserts or with later |
| updates which move data into the selected range. PostgreSQL didn't |
| already have predicate locking, so it needed to be added to support |
| full serializable transactions under either strategy. Practical |
| implementations of predicate locking generally involve acquiring |
| locks against data as it is accessed, using multiple granularities |
| (tuple, page, table, etc.) with escalation as needed to keep the lock |
| count to a number which can be tracked within RAM structures. This |
| approach was used in PostgreSQL. Coarse granularities can cause some |
| false positive indications of conflict. The number of false positives |
| can be influenced by plan choice. |
| |
| |
| Implementation overview |
| ----------------------- |
| |
| New RAM structures, inspired by those used to track traditional locks |
| in PostgreSQL, but tailored to the needs of SIREAD predicate locking, |
| are used. These refer to physical objects actually accessed in the |
| course of executing the query, to model the predicates through |
| inference. Anyone interested in this subject should review the |
| Hellerstein, Stonebraker and Hamilton paper [3], along with the |
| locking papers referenced from that and the Cahill papers. |
| |
| Because the SIREAD locks don't block, traditional locking techniques |
| have to be modified. Intent locking (locking higher level objects |
| before locking lower level objects) doesn't work with non-blocking |
| "locks" (which are, in some respects, more like flags than locks). |
| |
| A configurable amount of shared memory is reserved at postmaster |
| start-up to track predicate locks. This size cannot be changed |
| without a restart. |
| |
| To prevent resource exhaustion, multiple fine-grained locks may |
| be promoted to a single coarser-grained lock as needed. |
| |
| An attempt to acquire an SIREAD lock on a tuple when the same |
| transaction already holds an SIREAD lock on the page or the relation |
| will be ignored. Likewise, an attempt to lock a page when the |
| relation is locked will be ignored, and the acquisition of a coarser |
| lock will result in the automatic release of all finer-grained locks |
| it covers. |
| |
| |
| Heap locking |
| ------------ |
| |
| Predicate locks will be acquired for the heap based on the following: |
| |
| * For a table scan, the entire relation will be locked. |
| |
| * Each tuple read which is visible to the reading transaction |
| will be locked, whether or not it meets selection criteria; except |
| that there is no need to acquire an SIREAD lock on a tuple when the |
| transaction already holds a write lock on any tuple representing the |
| row, since a rw-conflict would also create a ww-dependency which |
| has more aggressive enforcement and thus will prevent any anomaly. |
| |
| * Modifying a heap tuple creates a rw-conflict with any transaction |
| that holds a SIREAD lock on that tuple, or on the page or relation |
| that contains it. |
| |
| * Inserting a new tuple creates a rw-conflict with any transaction |
| holding a SIREAD lock on the entire relation. It doesn't conflict with |
| page-level locks, because page-level locks are only used to aggregate |
| tuple locks. Unlike index page locks, they don't lock "gaps" on the page. |
| |
| |
| Index AM implementations |
| ------------------------ |
| |
| Since predicate locks only exist to detect writes which conflict with |
| earlier reads, and heap tuple locks are acquired to cover all heap |
| tuples actually read, including those read through indexes, the index |
| tuples which were actually scanned are not of interest in themselves; |
| we only care about their "new neighbors" -- later inserts into the |
| index which would have been included in the scan had they existed at |
| the time. Conceptually, we want to lock the gaps between and |
| surrounding index entries within the scanned range. |
| |
| Correctness requires that any insert into an index generates a |
| rw-conflict with a concurrent serializable transaction if, after that |
| insert, re-execution of any index scan of the other transaction would |
| access the heap for a row not accessed during the previous execution. |
| Note that a non-HOT update which expires an old index entry covered |
| by the scan and adds a new entry for the modified row's new tuple |
| need not generate a conflict, although an update which "moves" a row |
| into the scan must generate a conflict. While correctness allows |
| false positives, they should be minimized for performance reasons. |
| |
| Several optimizations are possible, though not all are implemented yet: |
| |
| * An index scan which is just finding the right position for an |
| index insertion or deletion need not acquire a predicate lock. |
| |
| * An index scan which is comparing for equality on the entire key |
| for a unique index need not acquire a predicate lock as long as a key |
| is found corresponding to a visible tuple which has not been modified |
| by another transaction -- there are no "between or around" gaps to |
| cover. |
| |
| * As long as built-in foreign key enforcement continues to use |
| its current "special tricks" to deal with MVCC issues, predicate |
| locks should not be needed for scans done by enforcement code. |
| |
| * If a search determines that no rows can be found regardless of |
| index contents because the search conditions are contradictory (e.g., |
| x = 1 AND x = 2), then no predicate lock is needed. |
| |
| Other index AM implementation considerations: |
| |
| * For an index AM that doesn't have support for predicate locking, |
| we just acquire a predicate lock on the whole index for any search. |
| |
| * B-tree index searches acquire predicate locks only on the |
| index *leaf* pages needed to lock the appropriate index range. If, |
| however, a search discovers that no root page has yet been created, a |
| predicate lock on the index relation is required. |
| |
| * Like a B-tree, GIN searches acquire predicate locks only on the |
| leaf pages of entry tree. When performing an equality scan, and an |
| entry has a posting tree, the posting tree root is locked instead, to |
| lock only that key value. However, fastupdate=on postpones the |
| insertion of tuples into index structure by temporarily storing them |
| into pending list. That makes us unable to detect r-w conflicts using |
| page-level locks. To cope with that, insertions to the pending list |
| conflict with all scans. |
| |
| * GiST searches can determine that there are no matches at any |
| level of the index, so we acquire predicate lock at each index |
| level during a GiST search. An index insert at the leaf level can |
| then be trusted to ripple up to all levels and locations where |
| conflicting predicate locks may exist. In case there is a page split, |
| we need to copy predicate lock from the original page to all the new |
| pages. |
| |
| * Hash index searches acquire predicate locks on the primary |
| page of a bucket. It acquires a lock on both the old and new buckets |
| for scans that happen concurrently with page splits. During a bucket |
| split, a predicate lock is copied from the primary page of an old |
| bucket to the primary page of a new bucket. |
| |
| * The effects of page splits, overflows, consolidations, and |
| removals must be carefully reviewed to ensure that predicate locks |
| aren't "lost" during those operations, or kept with pages which could |
| get re-used for different parts of the index. |
| |
| |
| Innovations |
| ----------- |
| |
| The PostgreSQL implementation of Serializable Snapshot Isolation |
| differs from what is described in the cited papers for several |
| reasons: |
| |
| 1. PostgreSQL didn't have any existing predicate locking. It had |
| to be added from scratch. |
| |
| 2. The existing in-memory lock structures were not suitable for |
| tracking SIREAD locks. |
| * In PostgreSQL, tuple level locks are not held in RAM for |
| any length of time; lock information is written to the tuples |
| involved in the transactions. |
| * In PostgreSQL, existing lock structures have pointers to |
| memory which is related to a session. SIREAD locks need to persist |
| past the end of the originating transaction and even the session |
| which ran it. |
| * PostgreSQL needs to be able to tolerate a large number of |
| transactions executing while one long-running transaction stays open |
| -- the in-RAM techniques discussed in the papers wouldn't support |
| that. |
| |
| 3. Unlike the database products used for the prototypes described |
| in the papers, PostgreSQL didn't already have a true serializable |
| isolation level distinct from snapshot isolation. |
| |
| 4. PostgreSQL supports subtransactions -- an issue not mentioned |
| in the papers. |
| |
| 5. PostgreSQL doesn't assign a transaction number to a database |
| transaction until and unless necessary (normally, when the transaction |
| attempts to modify data). |
| |
| 6. PostgreSQL has pluggable data types with user-definable |
| operators, as well as pluggable index types, not all of which are |
| based around data types which support ordering. |
| |
| 7. Some possible optimizations became apparent during development |
| and testing. |
| |
| Differences from the implementation described in the papers are |
| listed below. |
| |
| * New structures needed to be created in shared memory to track |
| the proper information for serializable transactions and their SIREAD |
| locks. |
| |
| * Because PostgreSQL does not have the same concept of an "oldest |
| transaction ID" for all serializable transactions as assumed in the |
| Cahill thesis, we track the oldest snapshot xmin among serializable |
| transactions, and a count of how many active transactions use that |
| xmin. When the count hits zero we find the new oldest xmin and run a |
| clean-up based on that. |
| |
| * Because reads in a subtransaction may cause that subtransaction |
| to roll back, thereby affecting what is written by the top level |
| transaction, predicate locks must survive a subtransaction rollback. |
| As a consequence, all xid usage in SSI, including predicate locking, |
| is based on the top level xid. When looking at an xid that comes |
| from a tuple's xmin or xmax, for example, we always call |
| SubTransGetTopmostTransaction() before doing much else with it. |
| |
| * PostgreSQL does not use "update in place" with a rollback log |
| for its MVCC implementation. Where possible it uses "HOT" updates on |
| the same page (if there is room and no indexed value is changed). |
| For non-HOT updates the old tuple is expired in place and a new tuple |
| is inserted at a new location. Because of this difference, a tuple |
| lock in PostgreSQL doesn't automatically lock any other versions of a |
| row. We don't try to copy or expand a tuple lock to any other |
| versions of the row, based on the following proof that any additional |
| serialization failures we would get from that would be false |
| positives: |
| |
| o If transaction T1 reads a row version (thus acquiring a |
| predicate lock on it) and a second transaction T2 updates that row |
| version (thus creating a rw-conflict graph edge from T1 to T2), must a |
| third transaction T3 which re-updates the new version of the row also |
| have a rw-conflict in from T1 to prevent anomalies? In other words, |
| does it matter whether we recognize the edge T1 -> T3? |
| |
| o If T1 has a conflict in, it certainly doesn't. Adding the |
| edge T1 -> T3 would create a dangerous structure, but we already had |
| one from the edge T1 -> T2, so we would have aborted something anyway. |
| (T2 has already committed, else T3 could not have updated its output; |
| but we would have aborted either T1 or T1's predecessor(s). Hence |
| no cycle involving T1 and T3 can survive.) |
| |
| o Now let's consider the case where T1 doesn't have a |
| rw-conflict in. If that's the case, for this edge T1 -> T3 to make a |
| difference, T3 must have a rw-conflict out that induces a cycle in the |
| dependency graph, i.e. a conflict out to some transaction preceding T1 |
| in the graph. (A conflict out to T1 itself would be problematic too, |
| but that would mean T1 has a conflict in, the case we already |
| eliminated.) |
| |
| o So now we're trying to figure out if there can be an |
| rw-conflict edge T3 -> T0, where T0 is some transaction that precedes |
| T1. For T0 to precede T1, there has to be some edge, or sequence of |
| edges, from T0 to T1. At least the last edge has to be a wr-dependency |
| or ww-dependency rather than a rw-conflict, because T1 doesn't have a |
| rw-conflict in. And that gives us enough information about the order |
| of transactions to see that T3 can't have a rw-conflict to T0: |
| - T0 committed before T1 started (the wr/ww-dependency implies this) |
| - T1 started before T2 committed (the T1->T2 rw-conflict implies this) |
| - T2 committed before T3 started (otherwise, T3 would get aborted |
| because of an update conflict) |
| |
| o That means T0 committed before T3 started, and therefore |
| there can't be a rw-conflict from T3 to T0. |
| |
| o So in all cases, we don't need the T1 -> T3 edge to |
| recognize cycles. Therefore it's not necessary for T1's SIREAD lock |
| on the original tuple version to cover later versions as well. |
| |
| * Predicate locking in PostgreSQL starts at the tuple level |
| when possible. Multiple fine-grained locks are promoted to a single |
| coarser-granularity lock as needed to avoid resource exhaustion. The |
| amount of memory used for these structures is configurable, to balance |
| RAM usage against SIREAD lock granularity. |
| |
| * Each backend keeps a process-local table of the locks it holds. |
| To support granularity promotion decisions with low CPU and locking |
| overhead, this table also includes the coarser covering locks and the |
| number of finer-granularity locks they cover. |
| |
| * Conflicts are identified by looking for predicate locks |
| when tuples are written, and by looking at the MVCC information when |
| tuples are read. There is no matching between two RAM-based locks. |
| |
| * Because write locks are stored in the heap tuples rather than a |
| RAM-based lock table, the optimization described in the Cahill thesis |
| which eliminates an SIREAD lock where there is a write lock is |
| implemented by the following: |
| 1. When checking a heap write for conflicts against existing |
| predicate locks, a tuple lock on the tuple being written is removed. |
| 2. When acquiring a predicate lock on a heap tuple, we |
| return quickly without doing anything if it is a tuple written by the |
| reading transaction. |
| |
| * Rather than using conflictIn and conflictOut pointers which use |
| NULL to indicate no conflict and a self-reference to indicate |
| multiple conflicts or conflicts with committed transactions, we use a |
| list of rw-conflicts. With the more complete information, false |
| positives are reduced and we have sufficient data for more aggressive |
| clean-up and other optimizations: |
| |
| o We can avoid ever rolling back a transaction until and |
| unless there is a pivot where a transaction on the conflict *out* |
| side of the pivot committed before either of the other transactions. |
| |
| o We can avoid ever rolling back a transaction when the |
| transaction on the conflict *in* side of the pivot is explicitly or |
| implicitly READ ONLY unless the transaction on the conflict *out* |
| side of the pivot committed before the READ ONLY transaction acquired |
| its snapshot. (An implicit READ ONLY transaction is one which |
| committed without writing, even though it was not explicitly declared |
| to be READ ONLY.) |
| |
| o We can more aggressively clean up conflicts, predicate |
| locks, and SSI transaction information. |
| |
| * We allow a READ ONLY transaction to "opt out" of SSI if there are |
| no READ WRITE transactions which could cause the READ ONLY |
| transaction to ever become part of a "dangerous structure" of |
| overlapping transaction dependencies. |
| |
| * We allow the user to request that a READ ONLY transaction wait |
| until the conditions are right for it to start in the "opt out" state |
| described above. We add a DEFERRABLE state to transactions, which is |
| specified and maintained in a way similar to READ ONLY. It is |
| ignored for transactions that are not SERIALIZABLE and READ ONLY. |
| |
| * When a transaction must be rolled back, we pick among the |
| active transactions such that an immediate retry will not fail again |
| on conflicts with the same transactions. |
| |
| * We use the PostgreSQL SLRU system to hold summarized |
| information about older committed transactions to put an upper bound |
| on RAM used. Beyond that limit, information spills to disk. |
| Performance can degrade in a pessimal situation, but it should be |
| tolerable, and transactions won't need to be canceled or blocked |
| from starting. |
| |
| |
| R&D Issues |
| ---------- |
| |
| This is intended to be the place to record specific issues which need |
| more detailed review or analysis. |
| |
| * WAL file replay. While serializable implementations using S2PL |
| can guarantee that the write-ahead log contains commits in a sequence |
| consistent with some serial execution of serializable transactions, |
| SSI cannot make that guarantee. While the WAL replay is no less |
| consistent than under snapshot isolation, it is possible that under |
| PITR recovery or hot standby a database could reach a readable state |
| where some transactions appear before other transactions which would |
| have had to precede them to maintain serializable consistency. In |
| essence, if we do nothing, WAL replay will be at snapshot isolation |
| even for serializable transactions. Is this OK? If not, how do we |
| address it? |
| |
| * External replication. Look at how this impacts external |
| replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc. |
| This is related to the "WAL file replay" issue. |
| |
| * UNIQUE btree search for equality on all columns. Since a search |
| of a UNIQUE index using equality tests on all columns will lock the |
| heap tuple if an entry is found, it appears that there is no need to |
| get a predicate lock on the index in that case. A predicate lock is |
| still needed for such a search if a matching index entry which points |
| to a visible tuple is not found. |
| |
| * Minimize touching of shared memory. Should lists in shared |
| memory push entries which have just been returned to the front of the |
| available list, so they will be popped back off soon and some memory |
| might never be touched, or should we keep adding returned items to |
| the end of the available list? |
| |
| |
| References |
| ---------- |
| |
| [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt |
| Search for serial execution to find the relevant section. |
| |
| [2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM |
| Transactions on Database Systems 30:2, Jun. 2005. |
| http://dx.doi.org/10.1145/1071610.1071615 |
| |
| [3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007. |
| Architecture of a Database System. Foundations and Trends(R) in |
| Databases Vol. 1, No. 2 (2007) 141-259. |
| http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf |
| Of particular interest: |
| * 6.1 A Note on ACID |
| * 6.2 A Brief Review of Serializability |
| * 6.3 Locking and Latching |
| * 6.3.1 Transaction Isolation Levels |
| * 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties |