RocksDB is a well proven open source key-value persistent store, optimized for fast storage. It provides scalability with number of CPUs and storage IOPS, to support IO-bound, in-memory and write-once workloads, most importantly, to be flexible to allow for innovation.
As Microsoft Bing team we have been continuously pushing hard to improve the scalability, efficiency of platform and eventually benefit Bing end-user satisfaction. We would like to explore the opportunity to embrace open source, RocksDB here, to use, enhance and customize for our usage, and also contribute back to the RocksDB community. Herein, we are pleased to offer this RocksDB port for Windows platform.
These notes describe some decisions and changes we had to make with regards to porting RocksDB on Windows. We hope this will help both reviewers and users of the Windows port. We are open for comments and improvements.
All of the porting, testing and benchmarking was done on Windows Server 2012 R2 Datacenter 64-bit but to the best of our knowledge there is not a specific API we used during porting that is unsupported on other Windows OS after Vista.
We strive to achieve the following goals:
We have chosen CMake as a widely accepted build system to build the Windows port. It is very fast and convenient.
At the same time it generates Visual Studio projects that are both usable from a command line and IDE.
The top-level CMakeLists.txt file contains description of all targets and build rules. It also provides brief instructions on how to build the software for Windows. One more build related file is thirdparty.inc that also resides on the top level. This file must be edited to point to actual third party libraries location. We think that it would be beneficial to merge the existing make-based build system and the new cmake-based build system into a single one to use on all platforms.
All building and testing was done for 64-bit. We have not conducted any testing for 32-bit and early reports indicate that it will not run on 32-bit.
We had to make some minimum changes within the portable files that either account for OS differences or the shortcomings of C++11 support in the current version of the MS compiler. Most or all of them are expected to be fixed in the upcoming compiler releases.
We plan to use this port for our business purposes here at Bing and this provided business justification for this port. This also means, we do not have at present to choose the compiler version at will.
#ifndef OS_WINin a few places (
port/dirent.h(very few places) with the implementation of the relevant interfaces within
port/sys_time.h(few places) implemented equivalents within
printf %zspecification is not supported on Windows. To imitate existing standards we came up with a string macro
ROCKSDB_PRIsztwhich expands to
%zon posix systems and to Iu on windows.
constexpris not supported. We had to replace
std::numeric_limits<>::max/min()to its C macros for constants. Sometimes we had to make class members
static constand place a definition within a .cc file.
constexprfor functions was replaced to a template specialization (1 place)
charin one place along with bug fixes (spatial experimental feature)
std::chronolacks nanoseconds support (fixed in the upcoming release of the STL) and we had to use
std::onceto mitigate within WinEnv.
We endeavored to make it functionally on par with posix_env. This means we replicated the functionality of the thread pool and other things as precise as possible, including:
use_os_buffer=falseto disable OS disk buffering for WinWritableFile and WinRandomAccessFile.
SetFileInformationByHandleto compensate absence of
Even though Windows provides its own efficient thread-pool implementation we chose to replicate posix logic using
std::thread primitives. This allows anyone to quickly detect any changes within the posix source code and replicate them within windows env. This has proven to work very well. At the same time for anyone who wishes to replace the built-in thread-pool can do so using RocksDB stackable environments.
For disk access we implemented all of the functionality present within the posix_env which includes memory mapped files, random access, rate-limiter support etc. The
use_os_buffer flag on Posix platforms currently denotes disabling read-ahead log via
fadvise mechanism. Windows does not have
fadvise system call. What is more, it implements disk cache in a way that differs from Linux greatly. Its not an uncommon practice on Windows to perform un-buffered disk access to gain control of the memory consumption. We think that in our use case this may also be a good configuration option at the expense of disk throughput. To compensate one may increase the configured in-memory cache size instead. Thus we have chosen
use_os_buffer=false to disable OS disk buffering for
WinRandomAccessFile. The OS imposes restrictions on the alignment of the disk offsets, buffers used and the amount of data that is read/written when accessing files in un-buffered mode. When the option is true, the classes behave in a standard way. This allows to perform writes and reads in cases when un-buffered access does not make sense such as WAL and MANIFEST.
We have replaced
OVERLAPPED structure so we can atomically seek to the position of the disk operation but still perform the operation synchronously. Thus we able to emulate that functionality of
pread/pwrite reasonably well. The only difference is that the file pointer is not returned to its original position but that hardly matters given the random nature of access.
SetFileInformationByHandle both to truncate files after writing a full final page to disk and to pre-allocate disk space for faster I/O thus compensating for the absence of
fallocate although some differences remain. For example, the pre-allocated space is not filled with zeros like on Linux, however, on a positive note, the end of file position is also not modified after pre-allocation.
RocksDB renames, copies and deletes files at will even though they may be opened with another handle at the same time. We had to relax and allow nearly all the concurrent access permissions possible.
Thread-Local storage plays a significant role for RocksDB performance. Rather than creating a separate implementation we chose to create inline wrappers that forward
pthread_specific calls to Windows
Tls interfaces within
rocksdb::port namespace. This leaves the existing meat of the logic in tact and unchanged and just as maintainable.
To mitigate the lack of thread local storage cleanup on thread-exit we added a limited amount of windows specific code within the same thread_local.cc file that injects a cleanup callback into a
"__tls" structure within
".CRT$XLB" data segment. This approach guarantees that the callback is invoked regardless of whether RocksDB used within an executable, standalone DLL or within another DLL.
When RocksDB is used with Jemalloc the latter needs to be initialized before any of the C++ globals or statics. To accomplish that we injected an initialization routine into
".CRT$XCT" that is automatically invoked by the runtime before initializing static objects. je-uninit is queued to
The jemalloc redirecting
new/delete global operators are used by the linker providing certain conditions are met. See build section in these notes.
We decided not to implement these two features because the hosting program as a rule has these two things in it. We experienced no inconveniences debugging issues in the debugger or analyzing process dumps if need be and thus we did not see this as a priority.
All of the benchmarks are run on the same set of machines. Here are the details of the test setup:
We think that there is still big room to improve the performance, which will be an ongoing effort for us.