feat: add `dictionary_columns` to Arrow scans (#3461)
Closes #3170
## Rationale
Columns that contain large or frequently repeated string values (e.g.
JSON blobs, low-cardinality categoricals) can exhaust memory when
PyArrow loads them as plain string arrays. PyArrow's Parquet reader
natively supports dictionary-encoded reads via its `dictionary_columns`
kwarg, which deduplicates values and can dramatically reduce peak memory
usage.
This was previously discussed in #3168 and a prior implementation
(#3234) was closed as stale.
## Changes
- Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`,
`TableScan.__init__`, and `StagedTable.scan()`.
- Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()`
→ `ArrowScan.__init__` → `_task_to_record_batches` →
`_get_file_format()`.
- Only applied when `task.file.file_format == FileFormat.PARQUET`;
silently ignored for ORC (which does not support this kwarg).
## Usage
```python
# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()
```
## Verification
- Added `test_dictionary_columns_produces_dict_encoded_output` —
confirms the requested column is dict-encoded, non-requested columns are
plain, and values are identical.
- `make lint` ✓
- `pytest tests/table/ tests/io/test_pyarrow.py` ✓
---------
Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>PyIceberg is a Python library for programmatic access to Iceberg table metadata as well as to table data in Iceberg format. It is a Python implementation of the Iceberg table spec.
The documentation is available at https://py.iceberg.apache.org/.