# | |
# Licensed to the Apache Software Foundation (ASF) under one | |
# or more contributor license agreements. See the NOTICE file | |
# distributed with this work for additional information | |
# regarding copyright ownership. The ASF licenses this file | |
# to you under the Apache License, Version 2.0 (the | |
# "License"); you may not use this file except in compliance | |
# with the License. You may obtain a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# | |
# Unless required by applicable law or agreed to in writing, | |
# software distributed under the License is distributed on an | |
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
# KIND, either express or implied. See the License for the | |
# specific language governing permissions and limitations | |
# under the License. | |
# | |
Design for Hybrid SQL/CLFS-Based Store in Qpid | |
============================================== | |
CLFS (Common Log File System) is a new facility in recent Windows versions. | |
CLFS is an ARIES-compliant log intended to support high performance and | |
transactional applications. CLFS is available in Windows Server 2003R2 and | |
higher, as well as Windows Vista and Windows 7. | |
There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store | |
moves the message, messages-mapping to queues, and transaction aspects | |
of the SQL store into CLFS logs. Records of queues, exchanges, bindings, | |
and configurations will remain in SQL. The main goal of this change is | |
to yield higher performance on the time-critical messaging operations. | |
CLFS and, therefore, the new hybrid store, is not available on Windows XP | |
and Windows Server prior to 2003R2; these platforms will need to run the | |
all-SQL store. | |
Note for future consideration: it is possible to maintain all durable | |
objects in CLFS, which would remove the need for SQL completely. It would | |
require added log handling as well as the logic to ensure referential | |
integrity between exchanges and queues via bindings as SQL does today. | |
Also, the CLFS store counts on the SQL-stored queue records being correct | |
when recovering messages; if a message operation in the log refers to a queue | |
ID that's unknown, the CLFS store assumes the queue was deleted in the | |
previous broker session and the log wasn't updated. That sort of assumption | |
would need to be revisited if all content moves to a log. | |
CLFS Capabilities | |
----------------- | |
This section explains some of the key CLFS concepts that are important | |
in order to understand the designed use of CLFS for the store. It is | |
not a complete explanation and is not feature-complete. Please see the | |
CLFS documentation at MSDN for complete details | |
(http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx). | |
CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed | |
log has multiple streams of independent log records; a dedicated log has | |
only one stream. Each log uses containers to hold the actual data; a log | |
requires a minimum of two containers, each of which must be at least 512KB. | |
Thus, the smallest log possible is 1MB. They can, of course, be larger, but | |
with 1 MB as minimum size for a log, they shouldn't be used willy-nilly. | |
The maximum number of streams per log is approximately 100. | |
As records are written to the log CLFS assigns Log Sequence Numbers (LSNs). | |
The first valid LSN in a log stream is called the Base, or Tail. CLFS | |
can automatically reclaim and reuse container space for the log as the | |
base LSN is moved when records are no longer needed. When a log is multiplexed, | |
a stream which doesn't move its tail can prevent CLFS from reclaiming space | |
and cause the log to grow indefinitely. Thus, mixing streams which don't | |
update (and, thus, move their tails) with streams that are very dynamic in | |
a single log will probably cause the log to continue to expand even though | |
much of the space will be unused. | |
CLFS provides three LSN types that are used to chain records together: | |
- Next: This is a forward sequence maintained by CLFS itself by the order | |
records are put into the stream. | |
- Undo-next, Undo-prev: These are backward-looking chains that are used | |
to link a new record to some previous record(s) in the same stream. | |
Also note that although log files are simply located in the file system, | |
easily locatable, streams within a log are not easily known or listable | |
outside of some application-specific recording of the stream names somewhere. | |
Log Usage | |
--------- | |
There are two logs in use. | |
- Message: Each message will be represented by a chain of log records. All | |
messages will be intermixed in the same dedicated stream. Each portion of | |
a message content (sometimes they are written in multiple chunks) as well | |
as each operation involving a message (enqueue, dequeue, etc.) will be | |
in a log record chained to the others related to the same message. | |
- Transaction: Each transaction, local and distributed, will be represented | |
by a chain of log records. The record content will denote the transaction | |
as local or distributed. | |
Both transaction and message logs use the LSN of the first record for a | |
given object (message or transaction) as the persistence ID for that object. | |
The LSN is a CLFS-maintained, always-increasing value that is 64 bits long, | |
the same as a persistence ID. | |
Log records that relate to a transaction or message previously logged use the | |
log record undo-prev LSN to indicate which transaction/message the record | |
relates to. | |
Message Log Records | |
------------------- | |
Message log records will be one of the following types: | |
- Message-Start: the first (and possibly only) section of message content | |
- Message-Chunk: second and succeeding message content chunks | |
- Message-Delete: marks the end of the message's lifetime | |
- Message-Enqueue: records the message's placement on a queue | |
- Message-Dequeue: records the message's removal from a queue | |
The LSN of the Message-Start record is the persistence ID for the message. | |
The log record undo-prev LSN is used to link each subsequent record for that | |
message to the Message-Start record. | |
A message's sequence of log records is extended for each operation on that | |
message, until the message is deleted whereupon a Message-Delete record is | |
written. When the Message-Delete is written, the log's base LSN can be moved | |
up to the next earliest message if the deleted one opens up a set of | |
records at the tail of the log that are no longer needed. To help maintain | |
the order and know when the base can be moved, the store keeps message | |
information in a STL map whose key is the message ID (Message-Start LSN). | |
Thus, the first entry in the map is the earliest ID/LSN in use. | |
During recovery, messages still residing in the log can be ignored when the | |
record sequence for the message ends with Message-Delete. Similarly, there | |
may be log records for messages that are deleted; in this case the previous | |
LSN won't be one that's still within the log and, therefore, there won't have | |
been a Message Start record recovered and the record can be ignored. | |
Transaction Log Records | |
----------------------- | |
Transaction log records will be one of the following types: | |
- Dtx-Start: Start of a distributed transaction | |
- Tx-Start: Start of a local transaction | |
- End: End of the transaction | |
- Rollback: Marks that the transaction is rolled back | |
- Prepare: Marks the dtx as prepared | |
- Commit: Marks the transaction as committed | |
- Delete: Notes that the transaction is no longer valid | |
Transactions are also identified by the LSN of the start (Dtx-Start or | |
Tx-Start) record. Successive records associated with the same transaction | |
are linked backwards using the undo-prev LSN. | |
The association between messages and transactions is maintained in the | |
message log; if the message enqueue/dequeue operation is part of a transaction, | |
the operation includes a transaction ID. The transaction log maintains the | |
state of the transaction itself. Thus, each operation (enqueue, dequeue, | |
prepare, rollback, commit) is a single log record. | |
A few notes: | |
- The transactions need to be recovered and sorted out prior to recovering | |
the messages. The message recovery needs to know if a enqueue/dequeue | |
associated with a transaction can be discarded or should be acted on. | |
- Transaction IDs need to remain valid as long as any messages exist that | |
refer to them. This prevents the problem of trying to recover a message | |
with a transaction ID that doesn't exist - was it finalized? was it aborted? | |
Reference to a missing transaction ID can be ignored with assurance that | |
the message was deleted further along or the transaction would still be there. | |
- Transaction IDs needing to be valid requires that a refcount be kept on each | |
transaction at run time. As messages are deleted, the transaction set can | |
be notified that the message is gone. To enforce this, Message objects have | |
a boost::shared_ptr to each Transaction they're associated with. When the | |
Message is destroyed, refs to Transactions go down too. When Transaction is | |
destroyed, it's done so write its delete to the log. | |
In-Memory Objects | |
----------------- | |
The store holds the message and transaction relationships in memory. CLFS is | |
a backing store for that information so it can be reliably reconstructed in | |
the event of a failure. This is a change from the SQL-only store where all | |
of the information is maintained in SQL and none is kept in memory. The | |
CLFS-using store is designed for high-throughput operation where it is assumed | |
that messages will transit the broker (and, therefore, the store) quickly. | |
- Message list: this is a map of persistence ID (message LSN) to a list of | |
queues where the message is located and an indication that there is | |
(or isn't) a transaction involved and in which direction (enqueue/dequeue) | |
so a dequeued message doesn't get deleted while a transacted enqueue is | |
pending. | |
- Transaction list: also probably a map of id/LSN to a transaction object. | |
The transaction object needs to keep a list of messages/queues that are | |
impacted as well as the transaction state and Xid (for dtx). | |
- Right now log records are written as need with no preallocation or | |
reservation. It may be better to pre-reserve records in some cases, such | |
as a transaction prepare where the space for commit or rollback may be | |
reserved at the same time. This may be the only case where losing a | |
record may be an issue - needs some more thought. | |
Recovery | |
-------- | |
During recovery, need to verify recovered messages' queues exist; if there's a | |
failure after a queue's deletion is final but before the messages are recorded | |
as dequeued (and possibly deleted) the remainder of those dequeues (and | |
possibly deleting the message) needs to be handled during recovery by not | |
restoring them for the broker, and also logging their deletion. Could also | |
skip the logging of deletion and let the normal tail-maintenance eventually | |
move up over the old message entries. Since the invalid messages won't be | |
kept in the message map, their IDs won't be taken into account when maintaining | |
the tail - the tail will move up over them as soon as enough messages come | |
and go. | |
Plugin Options | |
-------------- | |
The command-line options added by the CLFS plugin are; | |
--connect The SQL connect string for the SQL parts; same as the | |
SQL plugin. | |
--catalog The SQL database (catalog) name; same as the SQL plugin. | |
--store-dir The directory to store the logs in. Defaults to the | |
broker --data-dir value. If --no-data-dir specified, | |
--store-dir must be. | |
--container-size The size of each container in the log, in bytes. The | |
minimum size is 512K (smaller sizes will be rounded up). | |
Additionally, the size will be rounded up to a multiple | |
of the sector size on the disk holding the log. Once | |
the log is created, each newly added container will | |
be the same size as the initial container(s). Default | |
is 1MB. | |
--initial-containers The number of containers to populate a new log with | |
if a new log is created. Ignored if the log exists. | |
Default is 2. | |
--max-write-buffers The maximum number of write buffers that the plugin can | |
use before CLFS automatically flushes the log to disk. | |
Lower values flush more often; higher values have | |
higher performance. Default is 10. | |
Maybe need an option to hold messages of a certain size in memory? I think | |
maybe the broker proper holds the message content, so the store need not. | |
Testing | |
------- | |
More tests will need to be written to stress the log container extension | |
capability and ensure that moving the base LSN works properly and the store | |
doesn't continually grow the log without bounds. | |
Note that running "qpid-perftest --durable yes" stresses the log extension | |
and tail maintenance. It doesn't get run as a normal regression test but should | |
be run when playing with the container/tail maintenance logic to ensure it's | |
not broken. |