docs/performance.rst - cassandra-python-driver - Git at Google

 Performance Notes
 =================
 The python driver for Cassandra offers several methods for executing queries.
 You can synchronously block for queries to complete using
 :meth:`.Session.execute()`, you can use a future-like interface through
 :meth:`.Session.execute_async()`, or you can attach a callback to the future
 with :meth:`.ResponseFuture.add_callback()`.  Each of these methods has
 different performance characteristics and behaves differently when
 multiple threads are used.

 Benchmark Notes
 ---------------
 All benchmarks were executed using the
 `benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_
 in the driver repository.  They were executed on a laptop with 16 GiB of RAM, an SSD,
 and a 2 GHz, four core CPU with hyperthreading.  The Cassandra cluster was a three
 node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop
 with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your
 own cluster when tuning the driver for optimal throughput or latency.

 The 1.0.0 version of the driver was used with all default settings.  For these
 benchmarks, the driver was configured to use the ``libev`` reactor.  You can also run
 the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`)
 by using the ``--asyncore-only`` command line option.

 Each benchmark completes 100,000 small inserts. The replication factor for the
 keyspace was three, so all nodes were replicas for the inserted rows.

 Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_)
 -------------------------------------------------------------------------------------------------------------
 Although this is the simplest way to make queries, it has low throughput
 in single threaded environments.  This is basically what the benchmark
 is doing:

 .. code-block:: python

     from cassandra.cluster import Cluster

     cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3])
     session = cluster.connect()

     for i in range(100000):
         session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')")

 .. code-block:: bash

     ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 434.08/sec


 This technique does scale reasonably well as we add more threads:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
     Average throughput: 830.49/sec
     ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
     Average throughput: 1078.27/sec
     ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
     Average throughput: 1275.20/sec
     ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16
     Average throughput: 1345.56/sec


 In my environment, throughput is maximized at about 20 threads.


 Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_)
 ---------------------------------------------------------------------------------------------------------------------------
 This is a simple way to work with futures for higher throughput.  Essentially,
 we start 120 queries asynchronously at the same time and then wait for them
 all to complete. We then repeat this process until all 100,000 operations
 have completed:

 .. code-block:: python

     futures = Queue.Queue(maxsize=121)
     for i in range(100000):
         if i % 120 == 0:
             # clear the existing queue
             while True:
                 try:
                     futures.get_nowait().result()
                 except Queue.Empty:
                     break

         future = session.execute_async(query)
         futures.put_nowait(future)

 As expected, this improves throughput in a single-threaded environment:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 3477.56/sec

 However, adding more threads may actually harm throughput:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
     Average throughput: 2360.52/sec
     ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
     Average throughput: 2293.21/sec
     ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
     Average throughput: 2244.85/sec


 Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_)
 --------------------------------------------------------------------------------------------------------------------------------------
 This pattern is similar to batched futures.  The main difference is that
 every time we put a future on the queue, we pull the oldest future out
 and wait for it to complete:

 .. code-block:: python

         futures = Queue.Queue(maxsize=121)
         for i in range(100000):
             if i >= 120:
                 old_future = futures.get_nowait()
                 old_future.result()

             future = session.execute_async(query)
             futures.put_nowait(future)

 This gets slightly better throughput than the Batched Futures pattern:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 3635.76/sec

 But this has the same throughput issues when multiple threads are used:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
     Average throughput: 2213.62/sec
     ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
     Average throughput: 2707.62/sec
     ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
     Average throughput: 2462.42/sec

 Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_)
 -------------------------------------------------------------------------------------------------------------------------------------------
 What happens if we don't throttle our async requests at all?

 .. code-block:: python

     futures = []
     for i in range(100000):
         future = session.execute_async(query)
         futures.append(future)

     for future in futures:
         future.result()

 Throughput is about the same as the previous pattern, but a lot of memory will
 be consumed by the list of Futures:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 3474.11/sec
     ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
     Average throughput: 2389.61/sec
     ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
     Average throughput: 2371.75/sec
     ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
     Average throughput: 2165.29/sec

 Callback Chaining (`callback_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callback_full_pipeline.py>`_)
 -----------------------------------------------------------------------------------------------------------------------------------------------
 This pattern is very different from the previous patterns.  Here we're taking
 advantage of the :meth:`.ResponseFuture.add_callback()` function to start
 another request as soon as one finishes.  Futhermore, we're starting 120
 of these callback chains, so we've always got about 120 operations in
 flight at any time:

 .. code-block:: python

     from itertools import count
     from threading import Event

     sentinel = object()
     num_queries = 100000
     num_started = count()
     num_finished = count()
     finished_event = Event()

     def insert_next(previous_result=sentinel):
         if previous_result is not sentinel:
             if isinstance(previous_result, BaseException):
                 log.error("Error on insert: %r", previous_result)
             if num_finished.next() >= num_queries:
                 finished_event.set()

         if num_started.next() <= num_queries:
             future = session.execute_async(query)
             # NOTE: this callback also handles errors
             future.add_callbacks(insert_next, insert_next)

     for i in range(min(120, num_queries)):
         insert_next()

     finished_event.wait()

 This is a more complex pattern, but the throughput is excellent:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 7647.30/sec

 Part of the reason why performance is so good is that everything is running on
 single thread: the internal event loop thread that powers the driver.  The
 downside to this is that adding more threads doesn't improve anything:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
     Average throughput: 7704.58/sec


 What happens if we have more than 120 callback chains running?

 With 250 chains:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 7794.22/sec

 Things look pretty good with 250 chains.  If we try 500 chains, we start to max out
 all of the connections in the connection pools.  The problem is that the current
 version of the driver isn't very good at throttling these callback chains, so
 a lot of time gets spent waiting for new connections and performance drops
 dramatically:

 .. code-block:: bash

     ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
     Average throughput: 679.61/sec

 Until this is improved, you should limit the number of callback chains you run.

 PyPy
 ----
 Almost all of these patterns become CPU-bound pretty quickly with CPython, the
 normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative
 implementation of Python (written in Python) which uses a JIT compiler to
 reduce CPU consumption.  This leads to a huge improvement in the driver
 performance:

 .. code-block:: bash

     ~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1
     Average throughput: 18782.00/sec

 Eventually the driver may add C extensions to reduce CPU consumption, which
 would probably narrow the gap between the performance of CPython and PyPy.

 multiprocessing
 ---------------
 All of the patterns here may be used over multiple processes using the
 `multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_
 module.  Multiple processes will scale significantly better than multiple
 threads will, so if high throughput is your goal, consider this option.

 Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
 **or** :class:`~.ResponseFuture` **objects across multiple processes**. These
 objects should all be created after forking the process, not before.
	Performance Notes
	=================
	The python driver for Cassandra offers several methods for executing queries.
	You can synchronously block for queries to complete using
	:meth:`.Session.execute()`, you can use a future-like interface through
	:meth:`.Session.execute_async()`, or you can attach a callback to the future
	with :meth:`.ResponseFuture.add_callback()`. Each of these methods has
	different performance characteristics and behaves differently when
	multiple threads are used.

	Benchmark Notes
	---------------
	All benchmarks were executed using the
	`benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_
	in the driver repository. They were executed on a laptop with 16 GiB of RAM, an SSD,
	and a 2 GHz, four core CPU with hyperthreading. The Cassandra cluster was a three
	node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop
	with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your
	own cluster when tuning the driver for optimal throughput or latency.

	The 1.0.0 version of the driver was used with all default settings. For these
	benchmarks, the driver was configured to use the ``libev`` reactor. You can also run
	the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`)
	by using the ``--asyncore-only`` command line option.

	Each benchmark completes 100,000 small inserts. The replication factor for the
	keyspace was three, so all nodes were replicas for the inserted rows.

	Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_)
	-------------------------------------------------------------------------------------------------------------
	Although this is the simplest way to make queries, it has low throughput
	in single threaded environments. This is basically what the benchmark
	is doing:

	.. code-block:: python

	from cassandra.cluster import Cluster

	cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3])
	session = cluster.connect()

	for i in range(100000):
	session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')")

	.. code-block:: bash

	~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 434.08/sec


	This technique does scale reasonably well as we add more threads:

	.. code-block:: bash

	~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
	Average throughput: 830.49/sec
	~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
	Average throughput: 1078.27/sec
	~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
	Average throughput: 1275.20/sec
	~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16
	Average throughput: 1345.56/sec


	In my environment, throughput is maximized at about 20 threads.


	Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_)
	---------------------------------------------------------------------------------------------------------------------------
	This is a simple way to work with futures for higher throughput. Essentially,
	we start 120 queries asynchronously at the same time and then wait for them
	all to complete. We then repeat this process until all 100,000 operations
	have completed:

	.. code-block:: python

	futures = Queue.Queue(maxsize=121)
	for i in range(100000):
	if i % 120 == 0:
	# clear the existing queue
	while True:
	try:
	futures.get_nowait().result()
	except Queue.Empty:
	break

	future = session.execute_async(query)
	futures.put_nowait(future)

	As expected, this improves throughput in a single-threaded environment:

	.. code-block:: bash

	~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 3477.56/sec

	However, adding more threads may actually harm throughput:

	.. code-block:: bash

	~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
	Average throughput: 2360.52/sec
	~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
	Average throughput: 2293.21/sec
	~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
	Average throughput: 2244.85/sec


	Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_)
	--------------------------------------------------------------------------------------------------------------------------------------
	This pattern is similar to batched futures. The main difference is that
	every time we put a future on the queue, we pull the oldest future out
	and wait for it to complete:

	.. code-block:: python

	futures = Queue.Queue(maxsize=121)
	for i in range(100000):
	if i >= 120:
	old_future = futures.get_nowait()
	old_future.result()

	future = session.execute_async(query)
	futures.put_nowait(future)

	This gets slightly better throughput than the Batched Futures pattern:

	.. code-block:: bash

	~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 3635.76/sec

	But this has the same throughput issues when multiple threads are used:

	.. code-block:: bash

	~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
	Average throughput: 2213.62/sec
	~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
	Average throughput: 2707.62/sec
	~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
	Average throughput: 2462.42/sec

	Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_)
	-------------------------------------------------------------------------------------------------------------------------------------------
	What happens if we don't throttle our async requests at all?

	.. code-block:: python

	futures = []
	for i in range(100000):
	future = session.execute_async(query)
	futures.append(future)

	for future in futures:
	future.result()

	Throughput is about the same as the previous pattern, but a lot of memory will
	be consumed by the list of Futures:

	.. code-block:: bash

	~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 3474.11/sec
	~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
	Average throughput: 2389.61/sec
	~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
	Average throughput: 2371.75/sec
	~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
	Average throughput: 2165.29/sec

	Callback Chaining (`callback_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callback_full_pipeline.py>`_)
	-----------------------------------------------------------------------------------------------------------------------------------------------
	This pattern is very different from the previous patterns. Here we're taking
	advantage of the :meth:`.ResponseFuture.add_callback()` function to start
	another request as soon as one finishes. Futhermore, we're starting 120
	of these callback chains, so we've always got about 120 operations in
	flight at any time:

	.. code-block:: python

	from itertools import count
	from threading import Event

	sentinel = object()
	num_queries = 100000
	num_started = count()
	num_finished = count()
	finished_event = Event()

	def insert_next(previous_result=sentinel):
	if previous_result is not sentinel:
	if isinstance(previous_result, BaseException):
	log.error("Error on insert: %r", previous_result)
	if num_finished.next() >= num_queries:
	finished_event.set()

	if num_started.next() <= num_queries:
	future = session.execute_async(query)
	# NOTE: this callback also handles errors
	future.add_callbacks(insert_next, insert_next)

	for i in range(min(120, num_queries)):
	insert_next()

	finished_event.wait()

	This is a more complex pattern, but the throughput is excellent:

	.. code-block:: bash

	~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 7647.30/sec

	Part of the reason why performance is so good is that everything is running on
	single thread: the internal event loop thread that powers the driver. The
	downside to this is that adding more threads doesn't improve anything:

	.. code-block:: bash

	~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
	Average throughput: 7704.58/sec


	What happens if we have more than 120 callback chains running?

	With 250 chains:

	.. code-block:: bash

	~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 7794.22/sec

	Things look pretty good with 250 chains. If we try 500 chains, we start to max out
	all of the connections in the connection pools. The problem is that the current
	version of the driver isn't very good at throttling these callback chains, so
	a lot of time gets spent waiting for new connections and performance drops
	dramatically:

	.. code-block:: bash

	~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
	Average throughput: 679.61/sec

	Until this is improved, you should limit the number of callback chains you run.

	PyPy
	----
	Almost all of these patterns become CPU-bound pretty quickly with CPython, the
	normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative
	implementation of Python (written in Python) which uses a JIT compiler to
	reduce CPU consumption. This leads to a huge improvement in the driver
	performance:

	.. code-block:: bash

	~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1
	Average throughput: 18782.00/sec

	Eventually the driver may add C extensions to reduce CPU consumption, which
	would probably narrow the gap between the performance of CPython and PyPy.

	multiprocessing
	---------------
	All of the patterns here may be used over multiple processes using the
	`multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_
	module. Multiple processes will scale significantly better than multiple
	threads will, so if high throughput is your goal, consider this option.

	Just be sure to never share any :class:`~.Cluster`, :class:`~.Session`,
	or :class:`~.ResponseFuture` objects across multiple processes. These
	objects should all be created after forking the process, not before.