blob: f69d71b101c28192948ef87e99d1c63a5eb39409 [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Multi-file datasets — Dataset • Arrow R Package</title><!-- favicons --><link rel="icon" type="image/png" sizes="96x96" href="../favicon-96x96.png"><link rel="icon" type="”image/svg+xml”" href="../favicon.svg"><link rel="apple-touch-icon" sizes="180x180" href="../apple-touch-icon.png"><link rel="icon" sizes="any" href="../favicon.ico"><link rel="manifest" href="../site.webmanifest"><script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"><script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><link href="../deps/font-awesome-6.5.2/css/all.min.css" rel="stylesheet"><link href="../deps/font-awesome-6.5.2/css/v4-shims.min.css" rel="stylesheet"><script src="../deps/headroom-0.11.0/headroom.min.js"></script><script src="../deps/headroom-0.11.0/jQuery.headroom.min.js"></script><script src="../deps/bootstrap-toc-1.0.1/bootstrap-toc.min.js"></script><script src="../deps/clipboard.js-2.0.11/clipboard.min.js"></script><script src="../deps/search-1.0.0/autocomplete.jquery.min.js"></script><script src="../deps/search-1.0.0/fuse.min.js"></script><script src="../deps/search-1.0.0/mark.min.js"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"><meta property="og:title" content="Multi-file datasets — Dataset"><meta name="description" content="Arrow Datasets allow you to query against data that has been split across
multiple files. This sharding of data may indicate partitioning, which
can accelerate queries that only touch some partitions (files).
A Dataset contains one or more Fragments, such as files, of potentially
differing type and partitioning.
For Dataset$create(), see open_dataset(), which is an alias for it.
DatasetFactory is used to provide finer control over the creation of Datasets."><meta property="og:description" content="Arrow Datasets allow you to query against data that has been split across
multiple files. This sharding of data may indicate partitioning, which
can accelerate queries that only touch some partitions (files).
A Dataset contains one or more Fragments, such as files, of potentially
differing type and partitioning.
For Dataset$create(), see open_dataset(), which is an alias for it.
DatasetFactory is used to provide finer control over the creation of Datasets."><meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"><meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"><!-- Matomo --><script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script><!-- End Matomo Code --><!-- Kapa AI --><script async src="https://widget.kapa.ai/kapa-widget.bundle.js" data-website-id="9db461d5-ac77-4b3f-a5c5-75efa78339d2" data-project-name="Apache Arrow" data-project-color="#000000" data-project-logo="https://arrow.apache.org/img/arrow-logo_chevrons_white-txt_black-bg.png" data-modal-disclaimer="This is a custom LLM with access to all of [Arrow documentation](https://arrow.apache.org/docs/). If you want an R-specific answer, please mention this in your question." data-consent-required="true" data-user-analytics-cookie-enabled="false" data-consent-screen-disclaimer="By clicking &quot;I agree, let's chat&quot;, you consent to the use of the AI assistant in accordance with kapa.ai's [Privacy Policy](https://www.kapa.ai/content/privacy-policy). This service uses reCAPTCHA, which requires your consent to Google's [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms). By proceeding, you explicitly agree to both kapa.ai's and Google's privacy policies."></script><!-- End Kapa AI --></head><body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">
<a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>
<span class="version">
<small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">22.0.0.9000</small>
</span>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto"><li class="nav-item"><a class="nav-link" href="../articles/arrow.html">Get started</a></li>
<li class="active nav-item"><a class="nav-link" href="../reference/index.html">Reference</a></li>
<li class="nav-item dropdown">
<button class="nav-link dropdown-toggle" type="button" id="dropdown-articles" data-bs-toggle="dropdown" aria-expanded="false" aria-haspopup="true">Articles</button>
<ul class="dropdown-menu" aria-labelledby="dropdown-articles"><li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Using the package</h6></li>
<li><a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a></li>
<li><a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a></li>
<li><a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a></li>
<li><a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a></li>
<li><a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a></li>
<li><a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a></li>
<li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6></li>
<li><a class="dropdown-item" href="../articles/data_objects.html">Data objects</a></li>
<li><a class="dropdown-item" href="../articles/data_types.html">Data types</a></li>
<li><a class="dropdown-item" href="../articles/metadata.html">Metadata</a></li>
<li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Installation</h6></li>
<li><a class="dropdown-item" href="../articles/install.html">Installing on Linux</a></li>
<li><a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a></li>
<li><hr class="dropdown-divider"></li>
<li><a class="dropdown-item" href="../articles/index.html">More articles...</a></li>
</ul></li>
<li class="nav-item"><a class="nav-link" href="../news/index.html">Changelog</a></li>
</ul><form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="" autocomplete="off"></form>
<ul class="navbar-nav"><li class="nav-item"><a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="GitHub"><span class="fa fab fa-github fa-lg"></span></a></li>
</ul></div>
</div>
</nav><div class="container template-reference-topic">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<h1>Multi-file datasets</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/R/dataset.R" class="external-link"><code>R/dataset.R</code></a>, <a href="https://github.com/apache/arrow/blob/main/r/R/dataset-factory.R" class="external-link"><code>R/dataset-factory.R</code></a></small>
<div class="d-none name"><code>Dataset.Rd</code></div>
</div>
<div class="ref-description section level2">
<p>Arrow Datasets allow you to query against data that has been split across
multiple files. This sharding of data may indicate partitioning, which
can accelerate queries that only touch some partitions (files).</p>
<p>A <code>Dataset</code> contains one or more <code>Fragments</code>, such as files, of potentially
differing type and partitioning.</p>
<p>For <code>Dataset$create()</code>, see <code><a href="open_dataset.html">open_dataset()</a></code>, which is an alias for it.</p>
<p><code>DatasetFactory</code> is used to provide finer control over the creation of <code>Dataset</code>s.</p>
</div>
<div class="section level2">
<h2 id="factory">Factory<a class="anchor" aria-label="anchor" href="#factory"></a></h2>
<p><code>DatasetFactory</code> is used to create a <code>Dataset</code>, inspect the <a href="Schema-class.html">Schema</a> of the
fragments contained in it, and declare a partitioning.
<code>FileSystemDatasetFactory</code> is a subclass of <code>DatasetFactory</code> for
discovering files in the local file system, the only currently supported
file system.</p>
<p>For the <code>DatasetFactory$create()</code> factory method, see <code><a href="dataset_factory.html">dataset_factory()</a></code>, an
alias for it. A <code>DatasetFactory</code> has:</p><ul><li><p><code>$Inspect(unify_schemas)</code>: If <code>unify_schemas</code> is <code>TRUE</code>, all fragments
will be scanned and a unified <a href="Schema-class.html">Schema</a> will be created from them; if <code>FALSE</code>
(default), only the first fragment will be inspected for its schema. Use this
fast path when you know and trust that all fragments have an identical schema.</p></li>
<li><p><code>$Finish(schema, unify_schemas)</code>: Returns a <code>Dataset</code>. If <code>schema</code> is provided,
it will be used for the <code>Dataset</code>; if omitted, a <code>Schema</code> will be created from
inspecting the fragments (files) in the dataset, following <code>unify_schemas</code>
as described above.</p></li>
</ul><p><code>FileSystemDatasetFactory$create()</code> is a lower-level factory method and
takes the following arguments:</p><ul><li><p><code>filesystem</code>: A <a href="FileSystem.html">FileSystem</a></p></li>
<li><p><code>selector</code>: Either a <a href="FileSelector.html">FileSelector</a> or <code>NULL</code></p></li>
<li><p><code>paths</code>: Either a character vector of file paths or <code>NULL</code></p></li>
<li><p><code>format</code>: A <a href="FileFormat.html">FileFormat</a></p></li>
<li><p><code>partitioning</code>: Either <code>Partitioning</code>, <code>PartitioningFactory</code>, or <code>NULL</code></p></li>
</ul></div>
<div class="section level2">
<h2 id="methods">Methods<a class="anchor" aria-label="anchor" href="#methods"></a></h2>
<p>A <code>Dataset</code> has the following methods:</p><ul><li><p><code>$NewScan()</code>: Returns a <a href="Scanner.html">ScannerBuilder</a> for building a query</p></li>
<li><p><code>$WithSchema()</code>: Returns a new Dataset with the specified schema.
This method currently supports only adding, removing, or reordering
fields in the schema: you cannot alter or cast the field types.</p></li>
<li><p><code>$schema</code>: Active binding that returns the <a href="Schema-class.html">Schema</a> of the Dataset; you
may also replace the dataset's schema by using <code>ds$schema &lt;- new_schema</code>.</p></li>
</ul><p><code>FileSystemDataset</code> has the following methods:</p><ul><li><p><code>$files</code>: Active binding, returns the files of the <code>FileSystemDataset</code></p></li>
<li><p><code>$format</code>: Active binding, returns the <a href="FileFormat.html">FileFormat</a> of the <code>FileSystemDataset</code></p></li>
</ul><p><code>UnionDataset</code> has the following methods:</p><ul><li><p><code>$children</code>: Active binding, returns all child <code>Dataset</code>s.</p></li>
</ul></div>
<div class="section level2">
<h2 id="see-also">See also<a class="anchor" aria-label="anchor" href="#see-also"></a></h2>
<div class="dont-index"><p><code><a href="open_dataset.html">open_dataset()</a></code> for a simple interface to creating a <code>Dataset</code></p></div>
</div>
</main><aside class="col-md-3"><nav id="toc" aria-label="Table of contents"><h2>On this page</h2>
</nav></aside></div>
<footer><div class="pkgdown-footer-left">
<p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
</div>
<div class="pkgdown-footer-right">
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.1.3.</p>
</div>
</footer></div>
</body></html>