| % Generated by roxygen2: do not edit by hand |
| % Please edit documentation in R/dataset.R |
| \name{open_dataset} |
| \alias{open_dataset} |
| \title{Open a multi-file dataset} |
| \usage{ |
| open_dataset( |
| sources, |
| schema = NULL, |
| partitioning = hive_partition(), |
| unify_schemas = NULL, |
| ... |
| ) |
| } |
| \arguments{ |
| \item{sources}{One of: |
| \itemize{ |
| \item a string path or URI to a directory containing data files |
| \item a string path or URI to a single file |
| \item a character vector of paths or URIs to individual data files |
| \item a list of \code{Dataset} objects as created by this function |
| \item a list of \code{DatasetFactory} objects as created by \code{\link[=dataset_factory]{dataset_factory()}}. |
| } |
| |
| When \code{sources} is a vector of file URIs, they must all use the same protocol |
| and point to files located in the same file system and having the same |
| format.} |
| |
| \item{schema}{\link{Schema} for the \code{Dataset}. If \code{NULL} (the default), the schema |
| will be inferred from the data sources.} |
| |
| \item{partitioning}{When \code{sources} is a directory path/URI, one of: |
| \itemize{ |
| \item a \code{Schema}, in which case the file paths relative to \code{sources} will be |
| parsed, and path segments will be matched with the schema fields. For |
| example, \code{schema(year = int16(), month = int8())} would create partitions |
| for file paths like \code{"2019/01/file.parquet"}, \code{"2019/02/file.parquet"}, |
| etc. |
| \item a character vector that defines the field names corresponding to those |
| path segments (that is, you're providing the names that would correspond |
| to a \code{Schema} but the types will be autodetected) |
| \item a \code{HivePartitioning} or \code{HivePartitioningFactory}, as returned |
| by \code{\link[=hive_partition]{hive_partition()}} which parses explicit or autodetected fields from |
| Hive-style path segments |
| \item \code{NULL} for no partitioning |
| } |
| |
| The default is to autodetect Hive-style partitions. When \code{sources} is not a |
| directory path/URI, \code{partitioning} is ignored.} |
| |
| \item{unify_schemas}{logical: should all data fragments (files, \code{Dataset}s) |
| be scanned in order to create a unified schema from them? If \code{FALSE}, only |
| the first fragment will be inspected for its schema. Use this fast path |
| when you know and trust that all fragments have an identical schema. |
| The default is \code{FALSE} when creating a dataset from a directory path/URI or |
| vector of file paths/URIs (because there may be many files and scanning may |
| be slow) but \code{TRUE} when \code{sources} is a list of \code{Dataset}s (because there |
| should be few \code{Dataset}s in the list and their \code{Schema}s are already in |
| memory).} |
| |
| \item{...}{additional arguments passed to \code{dataset_factory()} when \code{sources} |
| is a directory path/URI or vector of file paths/URIs, otherwise ignored. |
| These may include \code{format} to indicate the file format, or other |
| format-specific options.} |
| } |
| \value{ |
| A \link{Dataset} R6 object. Use \code{dplyr} methods on it to query the data, |
| or call \code{\link[=Scanner]{$NewScan()}} to construct a query directly. |
| } |
| \description{ |
| Arrow Datasets allow you to query against data that has been split across |
| multiple files. This sharding of data may indicate partitioning, which |
| can accelerate queries that only touch some partitions (files). Call |
| \code{open_dataset()} to point to a directory of data files and return a |
| \code{Dataset}, then use \code{dplyr} methods to query it. |
| } |
| \seealso{ |
| \code{vignette("dataset", package = "arrow")} |
| } |