r/man/dataset_factory.Rd - arrow - Git at Google

 % Generated by roxygen2: do not edit by hand
 % Please edit documentation in R/dataset-factory.R
 \name{dataset_factory}
 \alias{dataset_factory}
 \title{Create a DatasetFactory}
 \usage{
 dataset_factory(
   x,
   filesystem = NULL,
   format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text", "json"),
   partitioning = NULL,
   hive_style = NA,
   factory_options = list(),
   ...
 )
 }
 \arguments{
 \item{x}{A string path to a directory containing data files, a vector of one
 one or more string paths to data files, or a list of \code{DatasetFactory} objects
 whose datasets should be combined. If this argument is specified it will be
 used to construct a \code{UnionDatasetFactory} and other arguments will be
 ignored.}

 \item{filesystem}{A \link{FileSystem} object; if omitted, the \code{FileSystem} will
 be detected from \code{x}}

 \item{format}{A \link{FileFormat} object, or a string identifier of the format of
 the files in \code{x}. Currently supported values:
 \itemize{
 \item "parquet"
 \item "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that
 only version 2 files are supported
 \item "csv"/"text", aliases for the same thing (because comma is the default
 delimiter for text files
 \item "tsv", equivalent to passing \verb{format = "text", delimiter = "\\t"}
 }

 Default is "parquet", unless a \code{delimiter} is also specified, in which case
 it is assumed to be "text".}

 \item{partitioning}{One of
 \itemize{
 \item A \code{Schema}, in which case the file paths relative to \code{sources} will be
 parsed, and path segments will be matched with the schema fields. For
 example, \code{schema(year = int16(), month = int8())} would create partitions
 for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
 \item A character vector that defines the field names corresponding to those
 path segments (that is, you're providing the names that would correspond
 to a \code{Schema} but the types will be autodetected)
 \item A \code{HivePartitioning} or \code{HivePartitioningFactory}, as returned
 by \code{\link[=hive_partition]{hive_partition()}} which parses explicit or autodetected fields from
 Hive-style path segments
 \item \code{NULL} for no partitioning
 }}

 \item{hive_style}{Logical: if \code{partitioning} is a character vector or a
 \code{Schema}, should it be interpreted as specifying Hive-style partitioning?
 Default is \code{NA}, which means to inspect the file paths for Hive-style
 partitioning and behave accordingly.}

 \item{factory_options}{list of optional FileSystemFactoryOptions:
 \itemize{
 \item \code{partition_base_dir}: string path segment prefix to ignore when
 discovering partition information with DirectoryPartitioning. Not
 meaningful (ignored with a warning) for HivePartitioning, nor is it
 valid when providing a vector of file paths.
 \item \code{exclude_invalid_files}: logical: should files that are not valid data
 files be excluded? Default is \code{FALSE} because checking all files up
 front incurs I/O and thus will be slower, especially on remote
 filesystems. If false and there are invalid files, there will be an
 error at scan time. This is the only FileSystemFactoryOption that is
 valid for both when providing a directory path in which to discover
 files and when providing a vector of file paths.
 \item \code{selector_ignore_prefixes}: character vector of file prefixes to ignore
 when discovering files in a directory. If invalid files can be excluded
 by a common filename prefix this way, you can avoid the I/O cost of
 \code{exclude_invalid_files}. Not valid when providing a vector of file paths
 (but if you're providing the file list, you can filter invalid files
 yourself).
 }}

 \item{...}{Additional format-specific options, passed to
 \code{\link[=FileFormat]{FileFormat$create()}}. For CSV options, note that you can specify them either
 with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
 \code{readr}-style naming used in \code{\link[=read_csv_arrow]{read_csv_arrow()}} ("delim", "quote", etc.).
 Not all \code{readr} options are currently supported; please file an issue if you
 encounter one that \code{arrow} should support.}
 }
 \value{
 A \code{DatasetFactory} object. Pass this to \code{\link[=open_dataset]{open_dataset()}},
 in a list potentially with other \code{DatasetFactory} objects, to create
 a \code{Dataset}.
 }
 \description{
 A \link{Dataset} can constructed using one or more \link{DatasetFactory}s.
 This function helps you construct a \code{DatasetFactory} that you can pass to
 \code{\link[=open_dataset]{open_dataset()}}.
 }
 \details{
 If you would only have a single \code{DatasetFactory} (for example, you have a
 single directory containing Parquet files), you can call \code{open_dataset()}
 directly. Use \code{dataset_factory()} when you
 want to combine different directories, file systems, or file formats.
 }
	% Generated by roxygen2: do not edit by hand
	% Please edit documentation in R/dataset-factory.R
	\name{dataset_factory}
	\alias{dataset_factory}
	\title{Create a DatasetFactory}
	\usage{
	dataset_factory(
	x,
	filesystem = NULL,
	format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text", "json"),
	partitioning = NULL,
	hive_style = NA,
	factory_options = list(),
	...
	)
	}
	\arguments{
	\item{x}{A string path to a directory containing data files, a vector of one
	one or more string paths to data files, or a list of \code{DatasetFactory} objects
	whose datasets should be combined. If this argument is specified it will be
	used to construct a \code{UnionDatasetFactory} and other arguments will be
	ignored.}

	\item{filesystem}{A \link{FileSystem} object; if omitted, the \code{FileSystem} will
	be detected from \code{x}}

	\item{format}{A \link{FileFormat} object, or a string identifier of the format of
	the files in \code{x}. Currently supported values:
	\itemize{
	\item "parquet"
	\item "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that
	only version 2 files are supported
	\item "csv"/"text", aliases for the same thing (because comma is the default
	delimiter for text files
	\item "tsv", equivalent to passing \verb{format = "text", delimiter = "\\t"}
	}

	Default is "parquet", unless a \code{delimiter} is also specified, in which case
	it is assumed to be "text".}

	\item{partitioning}{One of
	\itemize{
	\item A \code{Schema}, in which case the file paths relative to \code{sources} will be
	parsed, and path segments will be matched with the schema fields. For
	example, \code{schema(year = int16(), month = int8())} would create partitions
	for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
	\item A character vector that defines the field names corresponding to those
	path segments (that is, you're providing the names that would correspond
	to a \code{Schema} but the types will be autodetected)
	\item A \code{HivePartitioning} or \code{HivePartitioningFactory}, as returned
	by \code{\link[=hive_partition]{hive_partition()}} which parses explicit or autodetected fields from
	Hive-style path segments
	\item \code{NULL} for no partitioning
	}}

	\item{hive_style}{Logical: if \code{partitioning} is a character vector or a
	\code{Schema}, should it be interpreted as specifying Hive-style partitioning?
	Default is \code{NA}, which means to inspect the file paths for Hive-style
	partitioning and behave accordingly.}

	\item{factory_options}{list of optional FileSystemFactoryOptions:
	\itemize{
	\item \code{partition_base_dir}: string path segment prefix to ignore when
	discovering partition information with DirectoryPartitioning. Not
	meaningful (ignored with a warning) for HivePartitioning, nor is it
	valid when providing a vector of file paths.
	\item \code{exclude_invalid_files}: logical: should files that are not valid data
	files be excluded? Default is \code{FALSE} because checking all files up
	front incurs I/O and thus will be slower, especially on remote
	filesystems. If false and there are invalid files, there will be an
	error at scan time. This is the only FileSystemFactoryOption that is
	valid for both when providing a directory path in which to discover
	files and when providing a vector of file paths.
	\item \code{selector_ignore_prefixes}: character vector of file prefixes to ignore
	when discovering files in a directory. If invalid files can be excluded
	by a common filename prefix this way, you can avoid the I/O cost of
	\code{exclude_invalid_files}. Not valid when providing a vector of file paths
	(but if you're providing the file list, you can filter invalid files
	yourself).
	}}

	\item{...}{Additional format-specific options, passed to
	\code{\link[=FileFormat]{FileFormat$create()}}. For CSV options, note that you can specify them either
	with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
	\code{readr}-style naming used in \code{\link[=read_csv_arrow]{read_csv_arrow()}} ("delim", "quote", etc.).
	Not all \code{readr} options are currently supported; please file an issue if you
	encounter one that \code{arrow} should support.}
	}
	\value{
	A \code{DatasetFactory} object. Pass this to \code{\link[=open_dataset]{open_dataset()}},
	in a list potentially with other \code{DatasetFactory} objects, to create
	a \code{Dataset}.
	}
	\description{
	A \link{Dataset} can constructed using one or more \link{DatasetFactory}s.
	This function helps you construct a \code{DatasetFactory} that you can pass to
	\code{\link[=open_dataset]{open_dataset()}}.
	}
	\details{
	If you would only have a single \code{DatasetFactory} (for example, you have a
	single directory containing Parquet files), you can call \code{open_dataset()}
	directly. Use \code{dataset_factory()} when you
	want to combine different directories, file systems, or file formats.
	}