docs/design/windows_clfs_store_design.txt - qpid-cpp - Git at Google

 #
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
 # regarding copyright ownership.  The ASF licenses this file
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
 #

 Design for Hybrid SQL/CLFS-Based Store in Qpid
 ==============================================

 CLFS (Common Log File System) is a new facility in recent Windows versions.
 CLFS is an ARIES-compliant log intended to support high performance and
 transactional applications. CLFS is available in Windows Server 2003R2 and
 higher, as well as Windows Vista and Windows 7.

 There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store
 moves the message, messages-mapping to queues, and transaction aspects
 of the SQL store into CLFS logs. Records of queues, exchanges, bindings,
 and configurations will remain in SQL. The main goal of this change is
 to yield higher performance on the time-critical messaging operations.
 CLFS and, therefore, the new hybrid store, is not available on Windows XP
 and Windows Server prior to 2003R2; these platforms will need to run the
 all-SQL store.

 Note for future consideration: it is possible to maintain all durable
 objects in CLFS, which would remove the need for SQL completely. It would
 require added log handling as well as the logic to ensure referential
 integrity between exchanges and queues via bindings as SQL does today.
 Also, the CLFS store counts on the SQL-stored queue records being correct
 when recovering messages; if a message operation in the log refers to a queue
 ID that's unknown, the CLFS store assumes the queue was deleted in the
 previous broker session and the log wasn't updated. That sort of assumption
 would need to be revisited if all content moves to a log.

 CLFS Capabilities
 -----------------

 This section explains some of the key CLFS concepts that are important
 in order to understand the designed use of CLFS for the store. It is
 not a complete explanation and is not feature-complete. Please see the
 CLFS documentation at MSDN for complete details
 (http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx).

 CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed
 log has multiple streams of independent log records; a dedicated log has
 only one stream. Each log uses containers to hold the actual data; a log
 requires a minimum of two containers, each of which must be at least 512KB.
 Thus, the smallest log possible is 1MB. They can, of course, be larger, but
 with 1 MB as minimum size for a log, they shouldn't be used willy-nilly.
 The maximum number of streams per log is approximately 100.

 As records are written to the log CLFS assigns Log Sequence Numbers (LSNs).
 The first valid LSN in a log stream is called the Base, or Tail. CLFS
 can automatically reclaim and reuse container space for the log as the
 base LSN is moved when records are no longer needed. When a log is multiplexed,
 a stream which doesn't move its tail can prevent CLFS from reclaiming space
 and cause the log to grow indefinitely. Thus, mixing streams which don't
 update (and, thus, move their tails) with streams that are very dynamic in
 a single log will probably cause the log to continue to expand even though
 much of the space will be unused.

 CLFS provides three LSN types that are used to chain records together:

 - Next: This is a forward sequence maintained by CLFS itself by the order
   records are put into the stream.
 - Undo-next, Undo-prev: These are backward-looking chains that are used
   to link a new record to some previous record(s) in the same stream.

 Also note that although log files are simply located in the file system,
 easily locatable, streams within a log are not easily known or listable
 outside of some application-specific recording of the stream names somewhere.

 Log Usage
 ---------

 There are two logs in use.

 - Message: Each message will be represented by a chain of log records. All
   messages will be intermixed in the same dedicated stream. Each portion of
   a message content (sometimes they are written in multiple chunks) as well
   as each operation involving a message (enqueue, dequeue, etc.) will be
   in a log record chained to the others related to the same message.

 - Transaction: Each transaction, local and distributed, will be represented
   by a chain of log records. The record content will denote the transaction
   as local or distributed.

 Both transaction and message logs use the LSN of the first record for a
 given object (message or transaction) as the persistence ID for that object.
 The LSN is a CLFS-maintained, always-increasing value that is 64 bits long,
 the same as a persistence ID.

 Log records that relate to a transaction or message previously logged use the
 log record undo-prev LSN to indicate which transaction/message the record
 relates to.

 Message Log Records
 -------------------

 Message log records will be one of the following types:

 - Message-Start: the first (and possibly only) section of message content
 - Message-Chunk: second and succeeding message content chunks
 - Message-Delete: marks the end of the message's lifetime
 - Message-Enqueue: records the message's placement on a queue
 - Message-Dequeue: records the message's removal from a queue

 The LSN of the Message-Start record is the persistence ID for the message.
 The log record undo-prev LSN is used to link each subsequent record for that
 message to the Message-Start record.

 A message's sequence of log records is extended for each operation on that
 message, until the message is deleted whereupon a Message-Delete record is
 written. When the Message-Delete is written, the log's base LSN can be moved
 up to the next earliest message if the deleted one opens up a set of
 records at the tail of the log that are no longer needed. To help maintain
 the order and know when the base can be moved, the store keeps message
 information in a STL map whose key is the message ID (Message-Start LSN).
 Thus, the first entry in the map is the earliest ID/LSN in use.
 During recovery, messages still residing in the log can be ignored when the
 record sequence for the message ends with Message-Delete. Similarly, there
 may be log records for messages that are deleted; in this case the previous
 LSN won't be one that's still within the log and, therefore, there won't have
 been a Message Start record recovered and the record can be ignored.

 Transaction Log Records
 -----------------------

 Transaction log records will be one of the following types:

 - Dtx-Start: Start of a distributed transaction
 - Tx-Start: Start of a local transaction
 - End: End of the transaction
 - Rollback: Marks that the transaction is rolled back
 - Prepare: Marks the dtx as prepared
 - Commit: Marks the transaction as committed
 - Delete: Notes that the transaction is no longer valid

 Transactions are also identified by the LSN of the start (Dtx-Start or
 Tx-Start) record. Successive records associated with the same transaction
 are linked backwards using the undo-prev LSN.

 The association between messages and transactions is maintained in the
 message log; if the message enqueue/dequeue operation is part of a transaction,
 the operation includes a transaction ID. The transaction log maintains the
 state of the transaction itself. Thus, each operation (enqueue, dequeue,
 prepare, rollback, commit) is a single log record.

 A few notes:
 - The transactions need to be recovered and sorted out prior to recovering
   the messages. The message recovery needs to know if a enqueue/dequeue
   associated with a transaction can be discarded or should be acted on.

 - Transaction IDs need to remain valid as long as any messages exist that
   refer to them. This prevents the problem of trying to recover a message
   with a transaction ID that doesn't exist - was it finalized? was it aborted?
   Reference to a missing transaction ID can be ignored with assurance that
   the message was deleted further along or the transaction would still be there.

 - Transaction IDs needing to be valid requires that a refcount be kept on each
   transaction at run time. As messages are deleted, the transaction set can
   be notified that the message is gone. To enforce this, Message objects have
   a boost::shared_ptr to each Transaction they're associated with. When the
   Message is destroyed, refs to Transactions go down too. When Transaction is
   destroyed, it's done so write its delete to the log.

 In-Memory Objects
 -----------------

 The store holds the message and transaction relationships in memory. CLFS is
 a backing store for that information so it can be reliably reconstructed in
 the event of a failure. This is a change from the SQL-only store where all
 of the information is maintained in SQL and none is kept in memory. The
 CLFS-using store is designed for high-throughput operation where it is assumed
 that messages will transit the broker (and, therefore, the store) quickly.

 - Message list: this is a map of persistence ID (message LSN) to a list of
   queues where the message is located and an indication that there is
   (or isn't) a transaction involved and in which direction (enqueue/dequeue)
   so a dequeued message doesn't get deleted while a transacted enqueue is
   pending.

 - Transaction list: also probably a map of id/LSN to a transaction object.
   The transaction object needs to keep a list of messages/queues that are
   impacted as well as the transaction state and Xid (for dtx).

 - Right now log records are written as need with no preallocation or
   reservation. It may be better to pre-reserve records in some cases, such
   as a transaction prepare where the space for commit or rollback may be
   reserved at the same time. This may be the only case where losing a
   record may be an issue - needs some more thought.

 Recovery
 --------

 During recovery, need to verify recovered messages' queues exist; if there's a
 failure after a queue's deletion is final but before the messages are recorded
 as dequeued (and possibly deleted) the remainder of those dequeues (and
 possibly deleting the message) needs to be handled during recovery by not
 restoring them for the broker, and also logging their deletion. Could also
 skip the logging of deletion and let the normal tail-maintenance eventually
 move up over the old message entries. Since the invalid messages won't be
 kept in the message map, their IDs won't be taken into account when maintaining
 the tail - the tail will move up over them as soon as enough messages come
 and go.

 Plugin Options
 --------------

 The command-line options added by the CLFS plugin are;

   --connect             The SQL connect string for the SQL parts; same as the
                         SQL plugin.
   --catalog             The SQL database (catalog) name; same as the SQL plugin.
   --store-dir           The directory to store the logs in. Defaults to the
                         broker --data-dir value. If --no-data-dir specified,
                         --store-dir must be.
   --container-size      The size of each container in the log, in bytes. The
                         minimum size is 512K (smaller sizes will be rounded up).
                         Additionally, the size will be rounded up to a multiple
                         of the sector size on the disk holding the log. Once
                         the log is created, each newly added container will
                         be the same size as the initial container(s). Default
                         is 1MB.
   --initial-containers  The number of containers to populate a new log with
                         if a new log is created. Ignored if the log exists.
                         Default is 2.
   --max-write-buffers   The maximum number of write buffers that the plugin can
                         use before CLFS automatically flushes the log to disk.
                         Lower values flush more often; higher values have
                         higher performance. Default is 10.

   Maybe need an option to hold messages of a certain size in memory? I think
   maybe the broker proper holds the message content, so the store need not.

 Testing
 -------

 More tests will need to be written to stress the log container extension
 capability and ensure that moving the base LSN works properly and the store
 doesn't continually grow the log without bounds.

 Note that running "qpid-perftest --durable yes" stresses the log extension
 and tail maintenance. It doesn't get run as a normal regression test but should
 be run when playing with the container/tail maintenance logic to ensure it's
 not broken.
	#
	# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing,
	# software distributed under the License is distributed on an
	# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	# KIND, either express or implied. See the License for the
	# specific language governing permissions and limitations
	# under the License.
	#

	Design for Hybrid SQL/CLFS-Based Store in Qpid
	==============================================

	CLFS (Common Log File System) is a new facility in recent Windows versions.
	CLFS is an ARIES-compliant log intended to support high performance and
	transactional applications. CLFS is available in Windows Server 2003R2 and
	higher, as well as Windows Vista and Windows 7.

	There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store
	moves the message, messages-mapping to queues, and transaction aspects
	of the SQL store into CLFS logs. Records of queues, exchanges, bindings,
	and configurations will remain in SQL. The main goal of this change is
	to yield higher performance on the time-critical messaging operations.
	CLFS and, therefore, the new hybrid store, is not available on Windows XP
	and Windows Server prior to 2003R2; these platforms will need to run the
	all-SQL store.

	Note for future consideration: it is possible to maintain all durable
	objects in CLFS, which would remove the need for SQL completely. It would
	require added log handling as well as the logic to ensure referential
	integrity between exchanges and queues via bindings as SQL does today.
	Also, the CLFS store counts on the SQL-stored queue records being correct
	when recovering messages; if a message operation in the log refers to a queue
	ID that's unknown, the CLFS store assumes the queue was deleted in the
	previous broker session and the log wasn't updated. That sort of assumption
	would need to be revisited if all content moves to a log.

	CLFS Capabilities
	-----------------

	This section explains some of the key CLFS concepts that are important
	in order to understand the designed use of CLFS for the store. It is
	not a complete explanation and is not feature-complete. Please see the
	CLFS documentation at MSDN for complete details
	(http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx).

	CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed
	log has multiple streams of independent log records; a dedicated log has
	only one stream. Each log uses containers to hold the actual data; a log
	requires a minimum of two containers, each of which must be at least 512KB.
	Thus, the smallest log possible is 1MB. They can, of course, be larger, but
	with 1 MB as minimum size for a log, they shouldn't be used willy-nilly.
	The maximum number of streams per log is approximately 100.

	As records are written to the log CLFS assigns Log Sequence Numbers (LSNs).
	The first valid LSN in a log stream is called the Base, or Tail. CLFS
	can automatically reclaim and reuse container space for the log as the
	base LSN is moved when records are no longer needed. When a log is multiplexed,
	a stream which doesn't move its tail can prevent CLFS from reclaiming space
	and cause the log to grow indefinitely. Thus, mixing streams which don't
	update (and, thus, move their tails) with streams that are very dynamic in
	a single log will probably cause the log to continue to expand even though
	much of the space will be unused.

	CLFS provides three LSN types that are used to chain records together:

	- Next: This is a forward sequence maintained by CLFS itself by the order
	records are put into the stream.
	- Undo-next, Undo-prev: These are backward-looking chains that are used
	to link a new record to some previous record(s) in the same stream.

	Also note that although log files are simply located in the file system,
	easily locatable, streams within a log are not easily known or listable
	outside of some application-specific recording of the stream names somewhere.

	Log Usage
	---------

	There are two logs in use.

	- Message: Each message will be represented by a chain of log records. All
	messages will be intermixed in the same dedicated stream. Each portion of
	a message content (sometimes they are written in multiple chunks) as well
	as each operation involving a message (enqueue, dequeue, etc.) will be
	in a log record chained to the others related to the same message.

	- Transaction: Each transaction, local and distributed, will be represented
	by a chain of log records. The record content will denote the transaction
	as local or distributed.

	Both transaction and message logs use the LSN of the first record for a
	given object (message or transaction) as the persistence ID for that object.
	The LSN is a CLFS-maintained, always-increasing value that is 64 bits long,
	the same as a persistence ID.

	Log records that relate to a transaction or message previously logged use the
	log record undo-prev LSN to indicate which transaction/message the record
	relates to.

	Message Log Records
	-------------------

	Message log records will be one of the following types:

	- Message-Start: the first (and possibly only) section of message content
	- Message-Chunk: second and succeeding message content chunks
	- Message-Delete: marks the end of the message's lifetime
	- Message-Enqueue: records the message's placement on a queue
	- Message-Dequeue: records the message's removal from a queue

	The LSN of the Message-Start record is the persistence ID for the message.
	The log record undo-prev LSN is used to link each subsequent record for that
	message to the Message-Start record.

	A message's sequence of log records is extended for each operation on that
	message, until the message is deleted whereupon a Message-Delete record is
	written. When the Message-Delete is written, the log's base LSN can be moved
	up to the next earliest message if the deleted one opens up a set of
	records at the tail of the log that are no longer needed. To help maintain
	the order and know when the base can be moved, the store keeps message
	information in a STL map whose key is the message ID (Message-Start LSN).
	Thus, the first entry in the map is the earliest ID/LSN in use.
	During recovery, messages still residing in the log can be ignored when the
	record sequence for the message ends with Message-Delete. Similarly, there
	may be log records for messages that are deleted; in this case the previous
	LSN won't be one that's still within the log and, therefore, there won't have
	been a Message Start record recovered and the record can be ignored.

	Transaction Log Records
	-----------------------

	Transaction log records will be one of the following types:

	- Dtx-Start: Start of a distributed transaction
	- Tx-Start: Start of a local transaction
	- End: End of the transaction
	- Rollback: Marks that the transaction is rolled back
	- Prepare: Marks the dtx as prepared
	- Commit: Marks the transaction as committed
	- Delete: Notes that the transaction is no longer valid

	Transactions are also identified by the LSN of the start (Dtx-Start or
	Tx-Start) record. Successive records associated with the same transaction
	are linked backwards using the undo-prev LSN.

	The association between messages and transactions is maintained in the
	message log; if the message enqueue/dequeue operation is part of a transaction,
	the operation includes a transaction ID. The transaction log maintains the
	state of the transaction itself. Thus, each operation (enqueue, dequeue,
	prepare, rollback, commit) is a single log record.

	A few notes:
	- The transactions need to be recovered and sorted out prior to recovering
	the messages. The message recovery needs to know if a enqueue/dequeue
	associated with a transaction can be discarded or should be acted on.

	- Transaction IDs need to remain valid as long as any messages exist that
	refer to them. This prevents the problem of trying to recover a message
	with a transaction ID that doesn't exist - was it finalized? was it aborted?
	Reference to a missing transaction ID can be ignored with assurance that
	the message was deleted further along or the transaction would still be there.

	- Transaction IDs needing to be valid requires that a refcount be kept on each
	transaction at run time. As messages are deleted, the transaction set can
	be notified that the message is gone. To enforce this, Message objects have
	a boost::shared_ptr to each Transaction they're associated with. When the
	Message is destroyed, refs to Transactions go down too. When Transaction is
	destroyed, it's done so write its delete to the log.

	In-Memory Objects
	-----------------

	The store holds the message and transaction relationships in memory. CLFS is
	a backing store for that information so it can be reliably reconstructed in
	the event of a failure. This is a change from the SQL-only store where all
	of the information is maintained in SQL and none is kept in memory. The
	CLFS-using store is designed for high-throughput operation where it is assumed
	that messages will transit the broker (and, therefore, the store) quickly.

	- Message list: this is a map of persistence ID (message LSN) to a list of
	queues where the message is located and an indication that there is
	(or isn't) a transaction involved and in which direction (enqueue/dequeue)
	so a dequeued message doesn't get deleted while a transacted enqueue is
	pending.

	- Transaction list: also probably a map of id/LSN to a transaction object.
	The transaction object needs to keep a list of messages/queues that are
	impacted as well as the transaction state and Xid (for dtx).

	- Right now log records are written as need with no preallocation or
	reservation. It may be better to pre-reserve records in some cases, such
	as a transaction prepare where the space for commit or rollback may be
	reserved at the same time. This may be the only case where losing a
	record may be an issue - needs some more thought.

	Recovery
	--------

	During recovery, need to verify recovered messages' queues exist; if there's a
	failure after a queue's deletion is final but before the messages are recorded
	as dequeued (and possibly deleted) the remainder of those dequeues (and
	possibly deleting the message) needs to be handled during recovery by not
	restoring them for the broker, and also logging their deletion. Could also
	skip the logging of deletion and let the normal tail-maintenance eventually
	move up over the old message entries. Since the invalid messages won't be
	kept in the message map, their IDs won't be taken into account when maintaining
	the tail - the tail will move up over them as soon as enough messages come
	and go.

	Plugin Options
	--------------

	The command-line options added by the CLFS plugin are;

	--connect The SQL connect string for the SQL parts; same as the
	SQL plugin.
	--catalog The SQL database (catalog) name; same as the SQL plugin.
	--store-dir The directory to store the logs in. Defaults to the
	broker --data-dir value. If --no-data-dir specified,
	--store-dir must be.
	--container-size The size of each container in the log, in bytes. The
	minimum size is 512K (smaller sizes will be rounded up).
	Additionally, the size will be rounded up to a multiple
	of the sector size on the disk holding the log. Once
	the log is created, each newly added container will
	be the same size as the initial container(s). Default
	is 1MB.
	--initial-containers The number of containers to populate a new log with
	if a new log is created. Ignored if the log exists.
	Default is 2.
	--max-write-buffers The maximum number of write buffers that the plugin can
	use before CLFS automatically flushes the log to disk.
	Lower values flush more often; higher values have
	higher performance. Default is 10.

	Maybe need an option to hold messages of a certain size in memory? I think
	maybe the broker proper holds the message content, so the store need not.

	Testing
	-------

	More tests will need to be written to stress the log container extension
	capability and ensure that moving the base LSN works properly and the store
	doesn't continually grow the log without bounds.

	Note that running "qpid-perftest --durable yes" stresses the log extension
	and tail maintenance. It doesn't get run as a normal regression test but should
	be run when playing with the container/tail maintenance logic to ensure it's
	not broken.