| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. currentmodule:: pyarrow |
| .. highlight:: python |
| |
| .. _io: |
| |
| ======================== |
| Memory and IO Interfaces |
| ======================== |
| |
| This section will introduce you to the major concepts in PyArrow's memory |
| management and IO systems: |
| |
| * Buffers |
| * Memory pools |
| * File-like and stream-like objects |
| |
| Referencing and Allocating Memory |
| ================================= |
| |
| pyarrow.Buffer |
| -------------- |
| |
| The :class:`Buffer` object wraps the C++ :cpp:class:`arrow::Buffer` type |
| which is the primary tool for memory management in Apache Arrow in C++. It permits |
| higher-level array classes to safely interact with memory which they may or may |
| not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply |
| reference other Buffers, while preserving memory lifetime and clean |
| parent-child relationships. |
| |
| There are many implementations of ``arrow::Buffer``, but they all provide a |
| standard interface: a data pointer and length. This is similar to Python's |
| built-in `buffer protocol` and ``memoryview`` objects. |
| |
| A :class:`Buffer` can be created from any Python object implementing |
| the buffer protocol by calling the :func:`py_buffer` function. Let's consider |
| a bytes object: |
| |
| .. ipython:: python |
| |
| import pyarrow as pa |
| |
| data = b'abcdefghijklmnopqrstuvwxyz' |
| buf = pa.py_buffer(data) |
| buf |
| buf.size |
| |
| Creating a Buffer in this way does not allocate any memory; it is a zero-copy |
| view on the memory exported from the ``data`` bytes object. |
| |
| External memory, under the form of a raw pointer and size, can also be |
| referenced using the :func:`foreign_buffer` function. |
| |
| Buffers can be used in circumstances where a Python buffer or memoryview is |
| required, and such conversions are zero-copy: |
| |
| .. ipython:: python |
| |
| memoryview(buf) |
| |
| The Buffer's :meth:`~Buffer.to_pybytes` method converts the Buffer's data to a |
| Python bytestring (thus making a copy of the data): |
| |
| .. ipython:: python |
| |
| buf.to_pybytes() |
| |
| Memory Pools |
| ------------ |
| |
| All memory allocations and deallocations (like ``malloc`` and ``free`` in C) |
| are tracked in an instance of :class:`MemoryPool`. This means that we can |
| then precisely track amount of memory that has been allocated: |
| |
| .. ipython:: python |
| |
| pa.total_allocated_bytes() |
| |
| Let's allocate a resizable :class:`Buffer` from the default pool: |
| |
| .. ipython:: python |
| |
| buf = pa.allocate_buffer(1024, resizable=True) |
| pa.total_allocated_bytes() |
| buf.resize(2048) |
| pa.total_allocated_bytes() |
| |
| The default allocator requests memory in a minimum increment of 64 bytes. If |
| the buffer is garbaged-collected, all of the memory is freed: |
| |
| .. ipython:: python |
| |
| buf = None |
| pa.total_allocated_bytes() |
| |
| Besides the default built-in memory pool, there may be additional memory pools |
| to choose (such as `mimalloc <https://github.com/microsoft/mimalloc>`_) |
| from depending on how Arrow was built. One can get the backend |
| name for a memory pool:: |
| |
| >>> pa.default_memory_pool().backend_name |
| 'jemalloc' |
| |
| .. seealso:: |
| :ref:`API documentation for memory pools <api.memory_pool>`. |
| |
| .. seealso:: |
| On-GPU buffers using Arrow's optional :doc:`CUDA integration <cuda>`. |
| |
| |
| Input and Output |
| ================ |
| |
| .. _io.native_file: |
| |
| The Arrow C++ libraries have several abstract interfaces for different kinds of |
| IO objects: |
| |
| * Read-only streams |
| * Read-only files supporting random access |
| * Write-only streams |
| * Write-only files supporting random access |
| * File supporting reads, writes, and random access |
| |
| In the interest of making these objects behave more like Python's built-in |
| ``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class |
| which implements the same API as regular Python file objects. |
| |
| :class:`~pyarrow.NativeFile` has some important features which make it |
| preferable to using Python files with PyArrow where possible: |
| |
| * Other Arrow classes can access the internal C++ IO objects natively, and do |
| not need to acquire the Python GIL |
| * Native C++ IO may be able to do zero-copy IO, such as with memory maps |
| |
| There are several kinds of :class:`~pyarrow.NativeFile` options available: |
| |
| * :class:`~pyarrow.OSFile`, a native file that uses your operating system's |
| file descriptors |
| * :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with |
| memory maps |
| * :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects |
| as a file |
| * :class:`~pyarrow.BufferOutputStream`, for writing data in-memory, producing a |
| Buffer at the end |
| * :class:`~pyarrow.FixedSizeBufferWriter`, for writing data into an already |
| allocated Buffer |
| * :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem |
| * :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++ |
| * :class:`~pyarrow.CompressedInputStream` and |
| :class:`~pyarrow.CompressedOutputStream`, for on-the-fly compression or |
| decompression to/from another stream |
| |
| There are also high-level APIs to make instantiating common kinds of streams |
| easier. |
| |
| High-Level API |
| -------------- |
| |
| Input Streams |
| ~~~~~~~~~~~~~ |
| |
| The :func:`~pyarrow.input_stream` function allows creating a readable |
| :class:`~pyarrow.NativeFile` from various kinds of sources. |
| |
| * If passed a :class:`~pyarrow.Buffer` or a ``memoryview`` object, a |
| :class:`~pyarrow.BufferReader` will be returned: |
| |
| .. ipython:: python |
| |
| buf = memoryview(b"some data") |
| stream = pa.input_stream(buf) |
| stream.read(4) |
| |
| * If passed a string or file path, it will open the given file on disk |
| for reading, creating a :class:`~pyarrow.OSFile`. Optionally, the file |
| can be compressed: if its filename ends with a recognized extension |
| such as ``.gz``, its contents will automatically be decompressed on |
| reading. |
| |
| .. ipython:: python |
| |
| import gzip |
| with gzip.open('example.gz', 'wb') as f: |
| f.write(b'some data\n' * 3) |
| |
| stream = pa.input_stream('example.gz') |
| stream.read() |
| |
| * If passed a Python file object, it will wrapped in a :class:`PythonFile` |
| such that the Arrow C++ libraries can read data from it (at the expense |
| of a slight overhead). |
| |
| Output Streams |
| ~~~~~~~~~~~~~~ |
| |
| :func:`~pyarrow.output_stream` is the equivalent function for output streams |
| and allows creating a writable :class:`~pyarrow.NativeFile`. It has the same |
| features as explained above for :func:`~pyarrow.input_stream`, such as being |
| able to write to buffers or do on-the-fly compression. |
| |
| .. ipython:: python |
| |
| with pa.output_stream('example1.dat') as stream: |
| stream.write(b'some data') |
| |
| f = open('example1.dat', 'rb') |
| f.read() |
| |
| |
| On-Disk and Memory Mapped Files |
| ------------------------------- |
| |
| PyArrow includes two ways to interact with data on disk: standard operating |
| system-level file APIs, and memory-mapped files. In regular Python we can |
| write: |
| |
| .. ipython:: python |
| |
| with open('example2.dat', 'wb') as f: |
| f.write(b'some example data') |
| |
| Using pyarrow's :class:`~pyarrow.OSFile` class, you can write: |
| |
| .. ipython:: python |
| |
| with pa.OSFile('example3.dat', 'wb') as f: |
| f.write(b'some example data') |
| |
| For reading files, you can use :class:`~pyarrow.OSFile` or |
| :class:`~pyarrow.MemoryMappedFile`. The difference between these is that |
| :class:`~pyarrow.OSFile` allocates new memory on each read, like Python file |
| objects. In reads from memory maps, the library constructs a buffer referencing |
| the mapped memory without any memory allocation or copying: |
| |
| .. ipython:: python |
| |
| file_obj = pa.OSFile('example2.dat') |
| mmap = pa.memory_map('example3.dat') |
| file_obj.read(4) |
| mmap.read(4) |
| |
| The ``read`` method implements the standard Python file ``read`` API. To read |
| into Arrow Buffer objects, use ``read_buffer``: |
| |
| .. ipython:: python |
| |
| mmap.seek(0) |
| buf = mmap.read_buffer(4) |
| print(buf) |
| buf.to_pybytes() |
| |
| Many tools in PyArrow, particular the Apache Parquet interface and the file and |
| stream messaging tools, are more efficient when used with these ``NativeFile`` |
| types than with normal Python file objects. |
| |
| .. ipython:: python |
| :suppress: |
| |
| buf = mmap = file_obj = None |
| !rm example.dat |
| !rm example2.dat |
| |
| In-Memory Reading and Writing |
| ----------------------------- |
| |
| To assist with serialization and deserialization of in-memory data, we have |
| file interfaces that can read and write to Arrow Buffers. |
| |
| .. ipython:: python |
| |
| writer = pa.BufferOutputStream() |
| writer.write(b'hello, friends') |
| |
| buf = writer.getvalue() |
| buf |
| buf.size |
| reader = pa.BufferReader(buf) |
| reader.seek(7) |
| reader.read(7) |
| |
| These have similar semantics to Python's built-in ``io.BytesIO``. |