blob: 74d04bc48962ed7120df5b710e07a2b6ca8afc31 [file] [log] [blame] [view]
<div id="main" class="col-md-9" role="main">
# Define Partitioning for a Dataset
<div class="ref-description section level2">
Pass a `Partitioning` object to a
[FileSystemDatasetFactory](https://arrow.apache.org/docs/r/reference/Dataset.md)'s
`$create()` method to indicate how the file's paths should be
interpreted to define partitioning.
`DirectoryPartitioning` describes how to interpret raw path segments, in
order. For example, `schema(year = int16(), month = int8())` would
define partitions for file paths like "2019/01/file.parquet",
"2019/02/file.parquet", etc. In this scheme `NULL` values will be
skipped. In the previous example: when writing a dataset if the month
was `NA` (or `NULL`), the files would be placed in "2019/file.parquet".
When reading, the rows in "2019/file.parquet" would return an `NA` for
the month column. An error will be raised if an outer directory is
`NULL` and an inner directory is not.
`HivePartitioning` is for Hive-style partitioning, which embeds field
names and values in path segments, such as
"/year=2019/month=2/data.parquet". Because fields are named in the path
segments, order does not matter. This partitioning scheme allows `NULL`
values. They will be replaced by a configurable `null_fallback` which
defaults to the string `"__HIVE_DEFAULT_PARTITION__"` when writing. When
reading, the `null_fallback` string will be replaced with `NA`s as
appropriate.
`PartitioningFactory` subclasses instruct the `DatasetFactory` to detect
partition features from the file paths.
</div>
<div class="section level2">
## Factory
Both `DirectoryPartitioning$create()` and `HivePartitioning$create()`
methods take a
[Schema](https://arrow.apache.org/docs/r/reference/Schema-class.md) as a
single input argument. The helper function `hive_partition(...)` is
shorthand for `HivePartitioning$create(schema(...))`.
With `DirectoryPartitioningFactory$create()`, you can provide just the
names of the path segments (in our example, `c("year", "month")`), and
the `DatasetFactory` will infer the data types for those partition
variables. `HivePartitioningFactory$create()` takes no arguments: both
variable names and their types can be inferred from the file paths.
`hive_partition()` with no arguments returns a
`HivePartitioningFactory`.
</div>
</div>