docs/dev/r/reference/FileFormat.md - arrow-site - Git at Google

 <div id="main" class="col-md-9" role="main">

 # Dataset file formats

 <div class="ref-description section level2">

 A `FileFormat` holds information about how to read and parse the files
 included in a `Dataset`. There are subclasses corresponding to the
 supported file formats (`ParquetFileFormat` and `IpcFileFormat`).

 </div>

 <div class="section level2">

 ## Factory

 `FileFormat$create()` takes the following arguments:

 -   `format`: A string identifier of the file format. Currently
     supported values:

     -   "parquet"

     -   "ipc"/"arrow"/"feather", all aliases for each other; for
         Feather, note that only version 2 files are supported

     -   "csv"/"text", aliases for the same thing (because comma is the
         default delimiter for text files

     -   "tsv", equivalent to passing `format = "text", delimiter = "\t"`

 -   `...`: Additional format-specific options

     `format = "parquet"`:

     -   `dict_columns`: Names of columns which should be read as
         dictionaries.

     -   Any Parquet options from
         [FragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md).

     `format = "text"`: see
     [CsvParseOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md).
     Note that you can specify them either with the Arrow C++ library
     naming ("delimiter", "quoting", etc.) or the `readr`-style naming
     used in `read_csv_arrow()` ("delim", "quote", etc.). Not all `readr`
     options are currently supported; please file an issue if you
     encounter one that `arrow` should support. Also, the following
     options are supported. From
     [CsvReadOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md):

     -   `skip_rows`

     -   `column_names`. Note that if a
         [Schema](https://arrow.apache.org/docs/r/reference/Schema-class.md)
         is specified, `column_names` must match those specified in the
         schema.

     -   `autogenerate_column_names` From
         [CsvFragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md)
         (these values can be overridden at scan time):

     -   `convert_options`: a
         [CsvConvertOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md)

     -   `block_size`

 It returns the appropriate subclass of `FileFormat` (e.g.
 `ParquetFileFormat`)

 </div>

 <div class="section level2">

 ## Examples

 <div class="sourceCode">

 ``` r
 ## Semi-colon delimited files
 # Set up directory for examples
 tf <- tempfile()
 dir.create(tf)
 on.exit(unlink(tf))
 write.table(mtcars, file.path(tf, "file1.txt"), sep = ";", row.names = FALSE)

 # Create FileFormat object
 format <- FileFormat$create(format = "text", delimiter = ";")

 open_dataset(tf, format = format)
 #> FileSystemDataset with 1 csv file
 #> 11 columns
 #> mpg: double
 #> cyl: int64
 #> disp: double
 #> hp: int64
 #> drat: double
 #> wt: double
 #> qsec: double
 #> vs: int64
 #> am: int64
 #> gear: int64
 #> carb: int64
 ```

 </div>

 </div>

 </div>
	<div id="main" class="col-md-9" role="main">

	# Dataset file formats

	<div class="ref-description section level2">

	A `FileFormat` holds information about how to read and parse the files
	included in a `Dataset`. There are subclasses corresponding to the
	supported file formats (`ParquetFileFormat` and `IpcFileFormat`).

	</div>

	<div class="section level2">

	## Factory

	`FileFormat$create()` takes the following arguments:

	- `format`: A string identifier of the file format. Currently
	supported values:

	- "parquet"

	- "ipc"/"arrow"/"feather", all aliases for each other; for
	Feather, note that only version 2 files are supported

	- "csv"/"text", aliases for the same thing (because comma is the
	default delimiter for text files

	- "tsv", equivalent to passing `format = "text", delimiter = "\t"`

	- `...`: Additional format-specific options

	`format = "parquet"`:

	- `dict_columns`: Names of columns which should be read as
	dictionaries.

	- Any Parquet options from
	[FragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md).

	`format = "text"`: see
	[CsvParseOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md).
	Note that you can specify them either with the Arrow C++ library
	naming ("delimiter", "quoting", etc.) or the `readr`-style naming
	used in `read_csv_arrow()` ("delim", "quote", etc.). Not all `readr`
	options are currently supported; please file an issue if you
	encounter one that `arrow` should support. Also, the following
	options are supported. From
	[CsvReadOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md):

	- `skip_rows`

	- `column_names`. Note that if a
	[Schema](https://arrow.apache.org/docs/r/reference/Schema-class.md)
	is specified, `column_names` must match those specified in the
	schema.

	- `autogenerate_column_names` From
	[CsvFragmentScanOptions](https://arrow.apache.org/docs/r/reference/FragmentScanOptions.md)
	(these values can be overridden at scan time):

	- `convert_options`: a
	[CsvConvertOptions](https://arrow.apache.org/docs/r/reference/CsvReadOptions.md)

	- `block_size`

	It returns the appropriate subclass of `FileFormat` (e.g.
	`ParquetFileFormat`)

	</div>

	<div class="section level2">

	## Examples

	<div class="sourceCode">

	``` r
	## Semi-colon delimited files
	# Set up directory for examples
	tf <- tempfile()
	dir.create(tf)
	on.exit(unlink(tf))
	write.table(mtcars, file.path(tf, "file1.txt"), sep = ";", row.names = FALSE)

	# Create FileFormat object
	format <- FileFormat$create(format = "text", delimiter = ";")

	open_dataset(tf, format = format)
	#> FileSystemDataset with 1 csv file
	#> 11 columns
	#> mpg: double
	#> cyl: int64
	#> disp: double
	#> hp: int64
	#> drat: double
	#> wt: double
	#> qsec: double
	#> vs: int64
	#> am: int64
	#> gear: int64
	#> carb: int64
	```

	</div>

	</div>

	</div>