blob: 24c4ac0526755addaa6c2d31d7be2cb94863ca01 [file] [log] [blame] [view]
<div id="main" class="col-md-9" role="main">
# Dataset file formats
<div class="ref-description section level2">
A `FileFormat` holds information about how to read and parse the files
included in a `Dataset`. There are subclasses corresponding to the
supported file formats (`ParquetFileFormat` and `IpcFileFormat`).
</div>
<div class="section level2">
## Factory
`FileFormat$create()` takes the following arguments:
- `format`: A string identifier of the file format. Currently
supported values:
- "parquet"
- "ipc"/"arrow"/"feather", all aliases for each other; for
Feather, note that only version 2 files are supported
- "csv"/"text", aliases for the same thing (because comma is the
default delimiter for text files
- "tsv", equivalent to passing `format = "text", delimiter = "\t"`
- `...`: Additional format-specific options
`format = "parquet"`:
- `dict_columns`: Names of columns which should be read as
dictionaries.
- Any Parquet options from
[FragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md).
`format = "text"`: see
[CsvParseOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md).
Note that you can specify them either with the Arrow C++ library
naming ("delimiter", "quoting", etc.) or the `readr`-style naming
used in `read_csv_arrow()` ("delim", "quote", etc.). Not all `readr`
options are currently supported; please file an issue if you
encounter one that `arrow` should support. Also, the following
options are supported. From
[CsvReadOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md):
- `skip_rows`
- `column_names`. Note that if a
[Schema](https://arrow.apache.org/docs/r/reference/Schema-class.md)
is specified, `column_names` must match those specified in the
schema.
- `autogenerate_column_names` From
[CsvFragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md)
(these values can be overridden at scan time):
- `convert_options`: a
[CsvConvertOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md)
- `block_size`
It returns the appropriate subclass of `FileFormat` (e.g.
`ParquetFileFormat`)
</div>
<div class="section level2">
## Examples
<div class="sourceCode">
``` r
## Semi-colon delimited files
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
write.table(mtcars, file.path(tf, "file1.txt"), sep = ";", row.names = FALSE)
# Create FileFormat object
format <- FileFormat$create(format = "text", delimiter = ";")
open_dataset(tf, format = format)
#> FileSystemDataset with 1 csv file
#> 11 columns
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
```
</div>
</div>
</div>