| % Generated by roxygen2: do not edit by hand |
| % Please edit documentation in R/dataset.R, R/dataset-factory.R |
| \name{Dataset} |
| \alias{Dataset} |
| \alias{FileSystemDataset} |
| \alias{UnionDataset} |
| \alias{InMemoryDataset} |
| \alias{DatasetFactory} |
| \alias{FileSystemDatasetFactory} |
| \title{Multi-file datasets} |
| \description{ |
| Arrow Datasets allow you to query against data that has been split across |
| multiple files. This sharding of data may indicate partitioning, which |
| can accelerate queries that only touch some partitions (files). |
| |
| A \code{Dataset} contains one or more \code{Fragments}, such as files, of potentially |
| differing type and partitioning. |
| |
| For \code{Dataset$create()}, see \code{\link[=open_dataset]{open_dataset()}}, which is an alias for it. |
| |
| \code{DatasetFactory} is used to provide finer control over the creation of \code{Dataset}s. |
| } |
| \section{Factory}{ |
| |
| \code{DatasetFactory} is used to create a \code{Dataset}, inspect the \link{Schema} of the |
| fragments contained in it, and declare a partitioning. |
| \code{FileSystemDatasetFactory} is a subclass of \code{DatasetFactory} for |
| discovering files in the local file system, the only currently supported |
| file system. |
| |
| For the \code{DatasetFactory$create()} factory method, see \code{\link[=dataset_factory]{dataset_factory()}}, an |
| alias for it. A \code{DatasetFactory} has: |
| \itemize{ |
| \item \verb{$Inspect(unify_schemas)}: If \code{unify_schemas} is \code{TRUE}, all fragments |
| will be scanned and a unified \link{Schema} will be created from them; if \code{FALSE} |
| (default), only the first fragment will be inspected for its schema. Use this |
| fast path when you know and trust that all fragments have an identical schema. |
| \item \verb{$Finish(schema, unify_schemas)}: Returns a \code{Dataset}. If \code{schema} is provided, |
| it will be used for the \code{Dataset}; if omitted, a \code{Schema} will be created from |
| inspecting the fragments (files) in the dataset, following \code{unify_schemas} |
| as described above. |
| } |
| |
| \code{FileSystemDatasetFactory$create()} is a lower-level factory method and |
| takes the following arguments: |
| \itemize{ |
| \item \code{filesystem}: A \link{FileSystem} |
| \item \code{selector}: Either a \link{FileSelector} or \code{NULL} |
| \item \code{paths}: Either a character vector of file paths or \code{NULL} |
| \item \code{format}: A \link{FileFormat} |
| \item \code{partitioning}: Either \code{Partitioning}, \code{PartitioningFactory}, or \code{NULL} |
| } |
| } |
| |
| \section{Methods}{ |
| |
| |
| A \code{Dataset} has the following methods: |
| \itemize{ |
| \item \verb{$NewScan()}: Returns a \link{ScannerBuilder} for building a query |
| \item \verb{$schema}: Active binding that returns the \link{Schema} of the Dataset; you |
| may also replace the dataset's schema by using \code{ds$schema <- new_schema}. |
| This method currently supports only adding, removing, or reordering |
| fields in the schema: you cannot alter or cast the field types. |
| } |
| |
| \code{FileSystemDataset} has the following methods: |
| \itemize{ |
| \item \verb{$files}: Active binding, returns the files of the \code{FileSystemDataset} |
| \item \verb{$format}: Active binding, returns the \link{FileFormat} of the \code{FileSystemDataset} |
| } |
| |
| \code{UnionDataset} has the following methods: |
| \itemize{ |
| \item \verb{$children}: Active binding, returns all child \code{Dataset}s. |
| } |
| } |
| |
| \seealso{ |
| \code{\link[=open_dataset]{open_dataset()}} for a simple interface to creating a \code{Dataset} |
| } |