Python users who upgrade to recently released pyarrow
0.12 may find that their applications use significantly less memory when converting Arrow string data to pandas format. This includes using pyarrow.parquet.read_table
and pandas.read_parquet
. This article details some of what is going on under the hood, and why Python applications dealing with large amounts of strings are prone to memory use problems.
Let‘s start with some possibly surprising facts. I’m going to create an empty bytes
object and an empty str
(unicode) object in Python 3.7:
In [1]: val = b'' In [2]: unicode_val = u''
The sys.getsizeof
function accurately reports the number of bytes used by built-in Python objects. You might be surprised to find that:
In [4]: import sys In [5]: sys.getsizeof(val) Out[5]: 33 In [6]: sys.getsizeof(unicode_val) Out[6]: 49
Since strings in Python are nul-terminated, we can infer that a bytes object has 32 bytes of overhead while unicode has 48 bytes. One must also account for PyObject*
pointer references to the objects, so the actual overhead is 40 and 56 bytes, respectively. With large strings and text, this overhead may not matter much, but when you have a lot of small strings, such as those arising from reading a CSV or Apache Parquet file, they can take up an unexpected amount of memory. pandas represents strings in NumPy arrays of PyObject*
pointers, so the total memory used by a unique unicode string is
8 (PyObject*) + 48 (Python C struct) + string_length + 1
Suppose that we read a CSV file with
On disk this file would take approximately 10MB. Read into memory, however, it could take up over 60MB, as a 10 character string object takes up 67 bytes in a pandas.Series
.
While a Python unicode string can have 57 bytes of overhead, a string in the Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of overhead. 32-bit integer offsets encodes the position and size of a string value in a contiguous chunk of memory:
When you call table.to_pandas()
or array.to_pandas()
with pyarrow
, we have to convert this compact string representation back to pandas's Python-based strings. This can use a huge amount of memory when we have a large number of small strings. It is a quite common occurrence when working with web analytics data, which compresses to a compact size when stored in the Parquet columnar file format.
Note that the Arrow string memory format has other benefits beyond memory use. It is also much more efficient for analytics due to the guarantee of data locality; all strings are next to each other in memory. In the case of pandas and Python strings, the string data can be located anywhere in the process heap. Arrow PMC member Uwe Korn did some work to extend pandas with Arrow string arrays for improved performance and memory use.
For many years, the pandas.read_csv
function has relied on a trick to limit the amount of string memory allocated. Because pandas uses arrays of PyObject*
pointers to refer to objects in the Python heap, we can avoid creating multiple strings with the same value, instead reusing existing objects and incrementing their reference counts.
Schematically, we have the following:
In pyarrow
0.12, we have implemented this when calling to_pandas
. It requires using a hash table to deduplicate the Arrow string data as it's being converted to pandas. Hashing data is not free, but counterintuitively it can be faster in addition to being vastly more memory efficient in the common case in analytics where we have table columns with many instances of the same string values.
We can use the memory_profiler
Python package to easily get process memory usage within a running Python application.
import memory_profiler def mem(): return memory_profiler.memory_usage()[0]
In a new application I have:
In [7]: mem() Out[7]: 86.21875
I will generate approximate 1 gigabyte of string data represented as Python strings with length 10. The pandas.util.testing
module has a handy rands
function for generating random strings. Here is the data generation function:
from pandas.util.testing import rands def generate_strings(length, nunique, string_length=10): unique_values = [rands(string_length) for i in range(nunique)] values = unique_values * (length // nunique) return values
This generates a certain number of unique strings, then duplicates then to yield the desired number of total strings. So I'm going to create 100 million strings with only 10000 unique values:
In [8]: values = generate_strings(100000000, 10000) In [9]: mem() Out[9]: 852.140625
100 million PyObject*
values is only 745 MB, so this increase of a little over 770 MB is consistent with what we know so far. Now I'm going to convert this to Arrow format:
In [11]: arr = pa.array(values) In [12]: mem() Out[12]: 2276.9609375
Since pyarrow
exactly accounts for all of its memory allocations, we also check that
In [13]: pa.total_allocated_bytes() Out[13]: 1416777280
Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead), this is what we expect.
Now, converting arr
back to pandas is where things get tricky. The minimum amount of memory that pandas can use is a little under 800 MB as above as we need 100 million PyObject*
values, which are 8 bytes each.
In [14]: arr_as_pandas = arr.to_pandas() In [15]: mem() Out[15]: 3041.78125
Doing the math, we used 765 MB which seems right. We can disable the string deduplication logic by passing deduplicate_objects=False
to to_pandas
:
In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False) In [17]: mem() Out[17]: 10006.95703125
Without object deduplication, we use 6965 megabytes, or an average of 73 bytes per value. This is a little bit higher than the theoretical size of 67 bytes computed above.
One of the more surprising results is that the new behavior is about twice as fast:
In [18]: %time arr_as_pandas_time = arr.to_pandas() CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s Wall time: 3.14 s In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False) CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s Wall time: 6.21 s
The reason for this is that creating so many Python objects is more expensive than hashing the 10 byte values and looking them up in a hash table.
Note that when you convert Arrow data with mostly unique values back to pandas, the memory use benefits here won't have as much of an impact.
In Apache Arrow, our goal is to develop computational tools to operate natively on the cache- and SIMD-friendly efficient Arrow columnar format. In the meantime, though, we recognize that users have legacy applications using the native memory layout of pandas or other analytics tools. We will do our best to provide fast and memory-efficient interoperability with pandas and other popular libraries.