| Design for Hybrid SQL/CLFS-Based Store in Qpid | |
| ============================================== | |
| CLFS (Common Log File System) is a new facility in recent Windows versions. | |
| CLFS is an ARIES-compliant log intended to support high performance and | |
| transactional applications. CLFS is available in Windows Server 2003R2 and | |
| higher, as well as Windows Vista and Windows 7. | |
| There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store | |
| moves the message, messages-mapping to queues, and transaction aspects | |
| of the SQL store into CLFS logs. Records of queues, exchanges, bindings, | |
| and configurations will remain in SQL. The main goal of this change is | |
| to yield higher performance on the time-critical messaging operations. | |
| CLFS and, therefore, the new hybrid store, is not available on Windows XP | |
| and Windows Server prior to 2003R2; these platforms will need to run the | |
| all-SQL store. | |
| Note for future consideration: it is possible to maintain all durable | |
| objects in CLFS, which would remove the need for SQL completely. It would | |
| require added log handling as well as the logic to ensure referential | |
| integrity between exchanges and queues via bindings as SQL does today. | |
| Also, the CLFS store counts on the SQL-stored queue records being correct | |
| when recovering messages; if a message operation in the log refers to a queue | |
| ID that's unknown, the CLFS store assumes the queue was deleted in the | |
| previous broker session and the log wasn't updated. That sort of assumption | |
| would need to be revisited if all content moves to a log. | |
| CLFS Capabilities | |
| ----------------- | |
| This section explains some of the key CLFS concepts that are important | |
| in order to understand the designed use of CLFS for the store. It is | |
| not a complete explanation and is not feature-complete. Please see the | |
| CLFS documentation at MSDN for complete details | |
| (http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx). | |
| CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed | |
| log has multiple streams of independent log records; a dedicated log has | |
| only one stream. Each log uses containers to hold the actual data; a log | |
| requires a minimum of two containers, each of which must be at least 512KB. | |
| Thus, the smallest log possible is 1MB. They can, of course, be larger, but | |
| with 1 MB as minimum size for a log, they shouldn't be used willy-nilly. | |
| The maximum number of streams per log is approximately 100. | |
| As records are written to the log CLFS assigns Log Sequence Numbers (LSNs). | |
| The first valid LSN in a log stream is called the Base, or Tail. CLFS | |
| can automatically reclaim and reuse container space for the log as the | |
| base LSN is moved when records are no longer needed. When a log is multiplexed, | |
| a stream which doesn't move its tail can prevent CLFS from reclaiming space | |
| and cause the log to grow indefinitely. Thus, mixing streams which don't | |
| update (and, thus, move their tails) with streams that are very dynamic in | |
| a single log will probably cause the log to continue to expand even though | |
| much of the space will be unused. | |
| CLFS provides three LSN types that are used to chain records together: | |
| - Next: This is a forward sequence maintained by CLFS itself by the order | |
| records are put into the stream. | |
| - Undo-next, Undo-prev: These are backward-looking chains that are used | |
| to link a new record to some previous record(s) in the same stream. | |
| Also note that although log files are simply located in the file system, | |
| easily locatable, streams within a log are not easily known or listable | |
| outside of some application-specific recording of the stream names somewhere. | |
| Log Usage | |
| --------- | |
| There are two logs in use. | |
| - Message: Each message will be represented by a chain of log records. All | |
| messages will be intermixed in the same dedicated stream. Each portion of | |
| a message content (sometimes they are written in multiple chunks) as well | |
| as each operation involving a message (enqueue, dequeue, etc.) will be | |
| in a log record chained to the others related to the same message. | |
| - Transaction: Each transaction, local and distributed, will be represented | |
| by a chain of log records. The record content will denote the transaction | |
| as local or distributed. | |
| Both transaction and message logs use the LSN of the first record for a | |
| given object (message or transaction) as the persistence ID for that object. | |
| The LSN is a CLFS-maintained, always-increasing value that is 64 bits long, | |
| the same as a persistence ID. | |
| Log records that relate to a transaction or message previously logged use the | |
| log record undo-prev LSN to indicate which transaction/message the record | |
| relates to. | |
| Message Log Records | |
| ------------------- | |
| Message log records will be one of the following types: | |
| - Message-Start: the first (and possibly only) section of message content | |
| - Message-Chunk: second and succeeding message content chunks | |
| - Message-Delete: marks the end of the message's lifetime | |
| - Message-Enqueue: records the message's placement on a queue | |
| - Message-Dequeue: records the message's removal from a queue | |
| The LSN of the Message-Start record is the persistence ID for the message. | |
| The log record undo-prev LSN is used to link each subsequent record for that | |
| message to the Message-Start record. | |
| A message's sequence of log records is extended for each operation on that | |
| message, until the message is deleted whereupon a Message-Delete record is | |
| written. When the Message-Delete is written, the log's base LSN can be moved | |
| up to the next earliest message if the deleted one opens up a set of | |
| records at the tail of the log that are no longer needed. To help maintain | |
| the order and know when the base can be moved, the store keeps message | |
| information in a STL map whose key is the message ID (Message-Start LSN). | |
| Thus, the first entry in the map is the earliest ID/LSN in use. | |
| During recovery, messages still residing in the log can be ignored when the | |
| record sequence for the message ends with Message-Delete. Similarly, there | |
| may be log records for messages that are deleted; in this case the previous | |
| LSN won't be one that's still within the log and, therefore, there won't have | |
| been a Message Start record recovered and the record can be ignored. | |
| Transaction Log Records | |
| ----------------------- | |
| Transaction log records will be one of the following types: | |
| - Dtx-Start: Start of a distributed transaction | |
| - Tx-Start: Start of a local transaction | |
| - End: End of the transaction | |
| - Rollback: Marks that the transaction is rolled back | |
| - Prepare: Marks the dtx as prepared | |
| - Commit: Marks the transaction as committed | |
| - Delete: Notes that the transaction is no longer valid | |
| Transactions are also identified by the LSN of the start (Dtx-Start or | |
| Tx-Start) record. Successive records associated with the same transaction | |
| are linked backwards using the undo-prev LSN. | |
| The association between messages and transactions is maintained in the | |
| message log; if the message enqueue/dequeue operation is part of a transaction, | |
| the operation includes a transaction ID. The transaction log maintains the | |
| state of the transaction itself. Thus, each operation (enqueue, dequeue, | |
| prepare, rollback, commit) is a single log record. | |
| A few notes: | |
| - The transactions need to be recovered and sorted out prior to recovering | |
| the messages. The message recovery needs to know if a enqueue/dequeue | |
| associated with a transaction can be discarded or should be acted on. | |
| - Transaction IDs need to remain valid as long as any messages exist that | |
| refer to them. This prevents the problem of trying to recover a message | |
| with a transaction ID that doesn't exist - was it finalized? was it aborted? | |
| Reference to a missing transaction ID can be ignored with assurance that | |
| the message was deleted further along or the transaction would still be there. | |
| - Transaction IDs needing to be valid requires that a refcount be kept on each | |
| transaction at run time. As messages are deleted, the transaction set can | |
| be notified that the message is gone. To enforce this, Message objects have | |
| a boost::shared_ptr to each Transaction they're associated with. When the | |
| Message is destroyed, refs to Transactions go down too. When Transaction is | |
| destroyed, it's done so write its delete to the log. | |
| In-Memory Objects | |
| ----------------- | |
| The store holds the message and transaction relationships in memory. CLFS is | |
| a backing store for that information so it can be reliably reconstructed in | |
| the event of a failure. This is a change from the SQL-only store where all | |
| of the information is maintained in SQL and none is kept in memory. The | |
| CLFS-using store is designed for high-throughput operation where it is assumed | |
| that messages will transit the broker (and, therefore, the store) quickly. | |
| - Message list: this is a map of persistence ID (message LSN) to a list of | |
| queues where the message is located and an indication that there is | |
| (or isn't) a transaction involved and in which direction (enqueue/dequeue) | |
| so a dequeued message doesn't get deleted while a transacted enqueue is | |
| pending. | |
| - Transaction list: also probably a map of id/LSN to a transaction object. | |
| The transaction object needs to keep a list of messages/queues that are | |
| impacted as well as the transaction state and Xid (for dtx). | |
| - Right now log records are written as need with no preallocation or | |
| reservation. It may be better to pre-reserve records in some cases, such | |
| as a transaction prepare where the space for commit or rollback may be | |
| reserved at the same time. This may be the only case where losing a | |
| record may be an issue - needs some more thought. | |
| Recovery | |
| -------- | |
| During recovery, need to verify recovered messages' queues exist; if there's a | |
| failure after a queue's deletion is final but before the messages are recorded | |
| as dequeued (and possibly deleted) the remainder of those dequeues (and | |
| possibly deleting the message) needs to be handled during recovery by not | |
| restoring them for the broker, and also logging their deletion. Could also | |
| skip the logging of deletion and let the normal tail-maintenance eventually | |
| move up over the old message entries. Since the invalid messages won't be | |
| kept in the message map, their IDs won't be taken into account when maintaining | |
| the tail - the tail will move up over them as soon as enough messages come | |
| and go. | |
| Plugin Options | |
| -------------- | |
| The command-line options added by the CLFS plugin are; | |
| --connect The SQL connect string for the SQL parts; same as the | |
| SQL plugin. | |
| --catalog The SQL database (catalog) name; same as the SQL plugin. | |
| --store-dir The directory to store the logs in. Defaults to the | |
| broker --data-dir value. If --no-data-dir specified, | |
| --store-dir must be. | |
| --container-size The size of each container in the log, in bytes. The | |
| minimum size is 512K (smaller sizes will be rounded up). | |
| Additionally, the size will be rounded up to a multiple | |
| of the sector size on the disk holding the log. Once | |
| the log is created, each newly added container will | |
| be the same size as the initial container(s). Default | |
| is 1MB. | |
| --initial-containers The number of containers to populate a new log with | |
| if a new log is created. Ignored if the log exists. | |
| Default is 2. | |
| --max-write-buffers The maximum number of write buffers that the plugin can | |
| use before CLFS automatically flushes the log to disk. | |
| Lower values flush more often; higher values have | |
| higher performance. Default is 10. | |
| Maybe need an option to hold messages of a certain size in memory? I think | |
| maybe the broker proper holds the message content, so the store need not. | |
| Testing | |
| ------- | |
| More tests will need to be written to stress the log container extension | |
| capability and ensure that moving the base LSN works properly and the store | |
| doesn't continually grow the log without bounds. | |
| Note that running "qpid-perftest --durable yes" stresses the log extension | |
| and tail maintenance. It doesn't get run as a normal regression test but should | |
| be run when playing with the container/tail maintenance logic to ensure it's | |
| not broken. |