| % Generated by roxygen2: do not edit by hand |
| % Please edit documentation in R/dataset.R |
| \name{open_dataset} |
| \alias{open_dataset} |
| \title{Open a multi-file dataset} |
| \usage{ |
| open_dataset( |
| sources, |
| schema = NULL, |
| partitioning = hive_partition(), |
| unify_schemas = NULL, |
| ... |
| ) |
| } |
| \arguments{ |
| \item{sources}{Either: |
| \itemize{ |
| \item a string path to a directory containing data files |
| \item a list of \code{Dataset} objects as created by this function |
| \item a list of \code{DatasetFactory} objects as created by \code{\link[=dataset_factory]{dataset_factory()}}. |
| }} |
| |
| \item{schema}{\link{Schema} for the dataset. If \code{NULL} (the default), the schema |
| will be inferred from the data sources.} |
| |
| \item{partitioning}{When \code{sources} is a file path, one of |
| \itemize{ |
| \item a \code{Schema}, in which case the file paths relative to \code{sources} will be |
| parsed, and path segments will be matched with the schema fields. For |
| example, \code{schema(year = int16(), month = int8())} would create partitions |
| for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc. |
| \item a character vector that defines the field names corresponding to those |
| path segments (that is, you're providing the names that would correspond |
| to a \code{Schema} but the types will be autodetected) |
| \item a \code{HivePartitioning} or \code{HivePartitioningFactory}, as returned |
| by \code{\link[=hive_partition]{hive_partition()}} which parses explicit or autodetected fields from |
| Hive-style path segments |
| \item \code{NULL} for no partitioning |
| } |
| |
| The default is to autodetect Hive-style partitions.} |
| |
| \item{unify_schemas}{logical: should all data fragments (files, \code{Dataset}s) |
| be scanned in order to create a unified schema from them? If \code{FALSE}, only |
| the first fragment will be inspected for its schema. Use this |
| fast path when you know and trust that all fragments have an identical schema. |
| The default is \code{FALSE} when creating a dataset from a file path (because |
| there may be many files and scanning may be slow) but \code{TRUE} when \code{sources} |
| is a list of \code{Dataset}s (because there should be few \code{Dataset}s in the list |
| and their \code{Schema}s are already in memory).} |
| |
| \item{...}{additional arguments passed to \code{dataset_factory()} when |
| \code{sources} is a file path, otherwise ignored. These may include "format" to |
| indicate the file format, or other format-specific options.} |
| } |
| \value{ |
| A \link{Dataset} R6 object. Use \code{dplyr} methods on it to query the data, |
| or call \code{\link[=Scanner]{$NewScan()}} to construct a query directly. |
| } |
| \description{ |
| Arrow Datasets allow you to query against data that has been split across |
| multiple files. This sharding of data may indicate partitioning, which |
| can accelerate queries that only touch some partitions (files). Call |
| \code{open_dataset()} to point to a directory of data files and return a |
| \code{Dataset}, then use \code{dplyr} methods to query it. |
| } |
| \seealso{ |
| \code{vignette("dataset", package = "arrow")} |
| } |