| Performance Notes |
| ================= |
| The python driver for Cassandra offers several methods for executing queries. |
| You can synchronously block for queries to complete using |
| :meth:`.Session.execute()`, you can use a future-like interface through |
| :meth:`.Session.execute_async()`, or you can attach a callback to the future |
| with :meth:`.ResponseFuture.add_callback()`. Each of these methods has |
| different performance characteristics and behaves differently when |
| multiple threads are used. |
| |
| Benchmark Notes |
| --------------- |
| All benchmarks were executed using the |
| `benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_ |
| in the driver repository. They were executed on a laptop with 16 GiB of RAM, an SSD, |
| and a 2 GHz, four core CPU with hyperthreading. The Cassandra cluster was a three |
| node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop |
| with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your |
| own cluster when tuning the driver for optimal throughput or latency. |
| |
| The 1.0.0 version of the driver was used with all default settings. For these |
| benchmarks, the driver was configured to use the ``libev`` reactor. You can also run |
| the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`) |
| by using the ``--asyncore-only`` command line option. |
| |
| Each benchmark completes 100,000 small inserts. The replication factor for the |
| keyspace was three, so all nodes were replicas for the inserted rows. |
| |
| Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_) |
| ------------------------------------------------------------------------------------------------------------- |
| Although this is the simplest way to make queries, it has low throughput |
| in single threaded environments. This is basically what the benchmark |
| is doing: |
| |
| .. code-block:: python |
| |
| from cassandra.cluster import Cluster |
| |
| cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3]) |
| session = cluster.connect() |
| |
| for i in range(100000): |
| session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')") |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 434.08/sec |
| |
| |
| This technique does scale reasonably well as we add more threads: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 |
| Average throughput: 830.49/sec |
| ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 |
| Average throughput: 1078.27/sec |
| ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 |
| Average throughput: 1275.20/sec |
| ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16 |
| Average throughput: 1345.56/sec |
| |
| |
| In my environment, throughput is maximized at about 20 threads. |
| |
| |
| Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_) |
| --------------------------------------------------------------------------------------------------------------------------- |
| This is a simple way to work with futures for higher throughput. Essentially, |
| we start 120 queries asynchronously at the same time and then wait for them |
| all to complete. We then repeat this process until all 100,000 operations |
| have completed: |
| |
| .. code-block:: python |
| |
| futures = Queue.Queue(maxsize=121) |
| for i in range(100000): |
| if i % 120 == 0: |
| # clear the existing queue |
| while True: |
| try: |
| futures.get_nowait().result() |
| except Queue.Empty: |
| break |
| |
| future = session.execute_async(query) |
| futures.put_nowait(future) |
| |
| As expected, this improves throughput in a single-threaded environment: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 3477.56/sec |
| |
| However, adding more threads may actually harm throughput: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 |
| Average throughput: 2360.52/sec |
| ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 |
| Average throughput: 2293.21/sec |
| ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 |
| Average throughput: 2244.85/sec |
| |
| |
| Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_) |
| -------------------------------------------------------------------------------------------------------------------------------------- |
| This pattern is similar to batched futures. The main difference is that |
| every time we put a future on the queue, we pull the oldest future out |
| and wait for it to complete: |
| |
| .. code-block:: python |
| |
| futures = Queue.Queue(maxsize=121) |
| for i in range(100000): |
| if i >= 120: |
| old_future = futures.get_nowait() |
| old_future.result() |
| |
| future = session.execute_async(query) |
| futures.put_nowait(future) |
| |
| This gets slightly better throughput than the Batched Futures pattern: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 3635.76/sec |
| |
| But this has the same throughput issues when multiple threads are used: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 |
| Average throughput: 2213.62/sec |
| ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 |
| Average throughput: 2707.62/sec |
| ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 |
| Average throughput: 2462.42/sec |
| |
| Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_) |
| ------------------------------------------------------------------------------------------------------------------------------------------- |
| What happens if we don't throttle our async requests at all? |
| |
| .. code-block:: python |
| |
| futures = [] |
| for i in range(100000): |
| future = session.execute_async(query) |
| futures.append(future) |
| |
| for future in futures: |
| future.result() |
| |
| Throughput is about the same as the previous pattern, but a lot of memory will |
| be consumed by the list of Futures: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 3474.11/sec |
| ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 |
| Average throughput: 2389.61/sec |
| ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 |
| Average throughput: 2371.75/sec |
| ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 |
| Average throughput: 2165.29/sec |
| |
| Callback Chaining (`callback_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callback_full_pipeline.py>`_) |
| ----------------------------------------------------------------------------------------------------------------------------------------------- |
| This pattern is very different from the previous patterns. Here we're taking |
| advantage of the :meth:`.ResponseFuture.add_callback()` function to start |
| another request as soon as one finishes. Futhermore, we're starting 120 |
| of these callback chains, so we've always got about 120 operations in |
| flight at any time: |
| |
| .. code-block:: python |
| |
| from itertools import count |
| from threading import Event |
| |
| sentinel = object() |
| num_queries = 100000 |
| num_started = count() |
| num_finished = count() |
| finished_event = Event() |
| |
| def insert_next(previous_result=sentinel): |
| if previous_result is not sentinel: |
| if isinstance(previous_result, BaseException): |
| log.error("Error on insert: %r", previous_result) |
| if num_finished.next() >= num_queries: |
| finished_event.set() |
| |
| if num_started.next() <= num_queries: |
| future = session.execute_async(query) |
| # NOTE: this callback also handles errors |
| future.add_callbacks(insert_next, insert_next) |
| |
| for i in range(min(120, num_queries)): |
| insert_next() |
| |
| finished_event.wait() |
| |
| This is a more complex pattern, but the throughput is excellent: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 7647.30/sec |
| |
| Part of the reason why performance is so good is that everything is running on |
| single thread: the internal event loop thread that powers the driver. The |
| downside to this is that adding more threads doesn't improve anything: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 |
| Average throughput: 7704.58/sec |
| |
| |
| What happens if we have more than 120 callback chains running? |
| |
| With 250 chains: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 7794.22/sec |
| |
| Things look pretty good with 250 chains. If we try 500 chains, we start to max out |
| all of the connections in the connection pools. The problem is that the current |
| version of the driver isn't very good at throttling these callback chains, so |
| a lot of time gets spent waiting for new connections and performance drops |
| dramatically: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 |
| Average throughput: 679.61/sec |
| |
| Until this is improved, you should limit the number of callback chains you run. |
| |
| PyPy |
| ---- |
| Almost all of these patterns become CPU-bound pretty quickly with CPython, the |
| normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative |
| implementation of Python (written in Python) which uses a JIT compiler to |
| reduce CPU consumption. This leads to a huge improvement in the driver |
| performance: |
| |
| .. code-block:: bash |
| |
| ~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1 |
| Average throughput: 18782.00/sec |
| |
| Eventually the driver may add C extensions to reduce CPU consumption, which |
| would probably narrow the gap between the performance of CPython and PyPy. |
| |
| multiprocessing |
| --------------- |
| All of the patterns here may be used over multiple processes using the |
| `multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_ |
| module. Multiple processes will scale significantly better than multiple |
| threads will, so if high throughput is your goal, consider this option. |
| |
| Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`, |
| **or** :class:`~.ResponseFuture` **objects across multiple processes**. These |
| objects should all be created after forking the process, not before. |