feat: add `dictionary_columns` to Arrow scans (#3461)

Closes #3170

## Rationale

Columns that contain large or frequently repeated string values (e.g.
JSON blobs, low-cardinality categoricals) can exhaust memory when
PyArrow loads them as plain string arrays. PyArrow's Parquet reader
natively supports dictionary-encoded reads via its `dictionary_columns`
kwarg, which deduplicates values and can dramatically reduce peak memory
usage.

This was previously discussed in #3168 and a prior implementation
(#3234) was closed as stale.

## Changes

- Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`,
`TableScan.__init__`, and `StagedTable.scan()`.
- Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()`
→ `ArrowScan.__init__` → `_task_to_record_batches` →
`_get_file_format()`.
- Only applied when `task.file.file_format == FileFormat.PARQUET`;
silently ignored for ORC (which does not support this kwarg).

## Usage

```python
# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()
```

## Verification

- Added `test_dictionary_columns_produces_dict_encoded_output` —
confirms the requested column is dict-encoded, non-requested columns are
plain, and values are identical.
- `make lint` ✓
- `pytest tests/table/ tests/io/test_pyarrow.py` ✓

---------

Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
3 files changed
tree: a5cf047ec7ac795d6e8288152ffc584bdb71a061
  1. .github/
  2. dev/
  3. mkdocs/
  4. notebooks/
  5. pyiceberg/
  6. tests/
  7. vendor/
  8. .asf.yaml
  9. .codespellrc
  10. .gitignore
  11. .markdownlint.yaml
  12. .pre-commit-config.yaml
  13. AGENTS.md
  14. LICENSE
  15. Makefile
  16. MANIFEST.in
  17. NOTICE
  18. pyproject.toml
  19. README.md
  20. ruff.toml
  21. SECURITY-THREAT-MODEL.md
  22. setup.py
  23. uv.lock
README.md

Iceberg Python

PyIceberg is a Python library for programmatic access to Iceberg table metadata as well as to table data in Iceberg format. It is a Python implementation of the Iceberg table spec.

The documentation is available at https://py.iceberg.apache.org/.

Get in Touch