Scalability and performance are key features for Mesos. Some users of Mesos already run production clusters that consist of more than 35,000+ agents and 100,000+ active tasks. However, there remains a lot of room for improvement across a variety of areas of the system.
The performance working group was created in order to focus on some of these areas. The group's charter is to improve scalability / throughput / latency across the system, and in order to measure our improvements and prevent performance regressions we will write benchmarks and automate them.
In the past few months, we've focused on making improvements to the following areas:
Before we dive into the master failover improvements, I would like to recognize and thank the following contributors:
Our first area of focus was to improve the time it takes for a master failover to complete, where completion is defined as all of the agents successfully reregistering. Mesos is architected to use a centralized master with standby masters that participate in a quorum for high availability. For scalability reasons, the leading master stores the state of the cluster in-memory. During a master failover, the leading master needs to therefore re-build the in-memory state from all of the agents that reregister. During this time, the master is available to process other requests, but will be exposing only partial state to API consumers.
The rebuilding of the master’s in-memory state can be expensive for larger clusters, and so the focus of this effort was to improve the efficiency of this. Improvements were made via several areas, and only the highest-impact changes are listed below:
We upgraded to protobuf 3.5.0 in order to gain move support. When we profiled the master, we found that it spent a lot of time copying protobuf messages during agent re-registration. This support allowed us to eliminate copies of protobuf messages while retaining value semantics.
Libprocess provides several primitives for message passing:
dispatch: Provides the ability to post a messages to a local
defer: Provides a deferred
dispatch. i.e. a function object that when invoked will issue a
install: Installs a handler for receiving a protobuf message.
These primitives did not have move support, as they were originally added prior to the addition of C++11 support to the code-base. In order to eliminate copies, we enhanced these primitives to support moving arguments in and out.
This required introducing a new C++ utility, because
defer takes on the same API as
std::bind (e.g., placeholders). Specifically, the function object returned by
std::bind does not move the bound arguments into the stored callable. In order to enable this,
defer now uses a utility we introduced called
lambda::partial rather than
lambda::partial performs partial function application similar to
std::bind except the returned function object moves the bound arguments into the stored callable if the invocation is performed on an r-value function object.
With these previous enhancements in place, we were able to eliminate many of the expensive copies of protobuf messages performed by the master.
We wrote a synthetic benchmark to simulate a master failover. This benchmark prepares all the messages that would be sent to the master by the agents that need to reregister:
The benchmark has a few caveats:
This was tested on a 2015 Macbook Pro with 2.8 GHz Intel Core i7 processor. Mesos was configured using:
Apple LLVM version 9.0.0 (clang-900.0.38), with
-O2 enabled in 1.5.0.
The first results represent a cluster with 10 active tasks per agent across 5 frameworks, with no completed tasks. The results from 1,000 - 40,000 agents with 10,000 - 400,000 active tasks:
There was a reduction in the time-to-completion of ~80% due to a 450-500% improvement in throughput across 1.3.0 to 1.5.0.
The second results add task history: each agent also now contains 100 completed tasks across 5 completed frameworks. The results from 1,000 - 40,000 agents with 10,000 - 400,000 active tasks and 100,000 - 4,000,000 completed tasks are shown below:
This represents a reduction in time-to-completion of ~85% due to a 550-700% improvement in throughput across 1.3.0 to 1.5.0.
We're currently targeting the following areas for improvements:
If you are a user and would like to suggest some areas for performance improvement, please let us know by emailing