commit: 6da06adfa82eda8d647060632115e75a35634b87
[log]
author: GayathriSrividya <gayathrirajavarapu7@gmail.com>
Thu Jun 18 12:24:10 2026 +0530
committer: GitHub <noreply@github.com>
Thu Jun 18 08:54:10 2026 +0200
tree: a5cf047ec7ac795d6e8288152ffc584bdb71a061
parent: a1e12ad7d9ac253e9c134fc53feb2c7cd1031079 [diff]

feat: add `dictionary_columns` to Arrow scans (#3461) Closes #3170 ## Rationale Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates values and can dramatically reduce peak memory usage. This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale. ## Changes - Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`, `TableScan.__init__`, and `StagedTable.scan()`. - Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` → `ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`. - Only applied when `task.file.file_format == FileFormat.PARQUET`; silently ignored for ORC (which does not support this kwarg). ## Usage ```python # Read the "payload" column as dictionary-encoded to save memory df = table.scan(dictionary_columns=("payload",)).to_arrow() ``` ## Verification - Added `test_dictionary_columns_produces_dict_encoded_output` — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical. - `make lint` ✓ - `pytest tests/table/ tests/io/test_pyarrow.py` ✓ --------- Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

tree: a5cf047ec7ac795d6e8288152ffc584bdb71a061

README.md

Iceberg Python

PyIceberg is a Python library for programmatic access to Iceberg table metadata as well as to table data in Iceberg format. It is a Python implementation of the Iceberg table spec.

The documentation is available at https://py.iceberg.apache.org/.

Get in Touch

Iceberg community