blob: 72fdf5a9ed250eab1bd30b2fd8552a677b6c1141 [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Create a DatasetFactory — dataset_factory • Arrow R Package</title><!-- favicons --><link rel="icon" type="image/png" sizes="96x96" href="../favicon-96x96.png"><link rel="icon" type="”image/svg+xml”" href="../favicon.svg"><link rel="apple-touch-icon" sizes="180x180" href="../apple-touch-icon.png"><link rel="icon" sizes="any" href="../favicon.ico"><link rel="manifest" href="../site.webmanifest"><script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"><script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><link href="../deps/font-awesome-6.5.2/css/all.min.css" rel="stylesheet"><link href="../deps/font-awesome-6.5.2/css/v4-shims.min.css" rel="stylesheet"><script src="../deps/headroom-0.11.0/headroom.min.js"></script><script src="../deps/headroom-0.11.0/jQuery.headroom.min.js"></script><script src="../deps/bootstrap-toc-1.0.1/bootstrap-toc.min.js"></script><script src="../deps/clipboard.js-2.0.11/clipboard.min.js"></script><script src="../deps/search-1.0.0/autocomplete.jquery.min.js"></script><script src="../deps/search-1.0.0/fuse.min.js"></script><script src="../deps/search-1.0.0/mark.min.js"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"><meta property="og:title" content="Create a DatasetFactory — dataset_factory"><meta name="description" content="A Dataset can constructed using one or more DatasetFactorys.
This function helps you construct a DatasetFactory that you can pass to
open_dataset()."><meta property="og:description" content="A Dataset can constructed using one or more DatasetFactorys.
This function helps you construct a DatasetFactory that you can pass to
open_dataset()."><meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"><meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"><!-- Matomo --><script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script><!-- End Matomo Code --><!-- Kapa AI --><script async src="https://widget.kapa.ai/kapa-widget.bundle.js" data-website-id="9db461d5-ac77-4b3f-a5c5-75efa78339d2" data-project-name="Apache Arrow" data-project-color="#000000" data-project-logo="https://arrow.apache.org/img/arrow-logo_chevrons_white-txt_black-bg.png" data-modal-disclaimer="This is a custom LLM with access to all of [Arrow documentation](https://arrow.apache.org/docs/). If you want an R-specific answer, please mention this in your question." data-consent-required="true" data-user-analytics-cookie-enabled="false" data-consent-screen-disclaimer="By clicking &quot;I agree, let's chat&quot;, you consent to the use of the AI assistant in accordance with kapa.ai's [Privacy Policy](https://www.kapa.ai/content/privacy-policy). This service uses reCAPTCHA, which requires your consent to Google's [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms). By proceeding, you explicitly agree to both kapa.ai's and Google's privacy policies."></script><!-- End Kapa AI --></head><body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">
<a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>
<span class="version">
<small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">22.0.0.9000</small>
</span>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto"><li class="nav-item"><a class="nav-link" href="../articles/arrow.html">Get started</a></li>
<li class="active nav-item"><a class="nav-link" href="../reference/index.html">Reference</a></li>
<li class="nav-item dropdown">
<button class="nav-link dropdown-toggle" type="button" id="dropdown-articles" data-bs-toggle="dropdown" aria-expanded="false" aria-haspopup="true">Articles</button>
<ul class="dropdown-menu" aria-labelledby="dropdown-articles"><li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Using the package</h6></li>
<li><a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a></li>
<li><a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a></li>
<li><a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a></li>
<li><a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a></li>
<li><a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a></li>
<li><a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a></li>
<li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6></li>
<li><a class="dropdown-item" href="../articles/data_objects.html">Data objects</a></li>
<li><a class="dropdown-item" href="../articles/data_types.html">Data types</a></li>
<li><a class="dropdown-item" href="../articles/metadata.html">Metadata</a></li>
<li><hr class="dropdown-divider"></li>
<li><h6 class="dropdown-header" data-toc-skip>Installation</h6></li>
<li><a class="dropdown-item" href="../articles/install.html">Installing on Linux</a></li>
<li><a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a></li>
<li><hr class="dropdown-divider"></li>
<li><a class="dropdown-item" href="../articles/index.html">More articles...</a></li>
</ul></li>
<li class="nav-item"><a class="nav-link" href="../news/index.html">Changelog</a></li>
</ul><form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="" autocomplete="off"></form>
<ul class="navbar-nav"><li class="nav-item"><a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="GitHub"><span class="fa fab fa-github fa-lg"></span></a></li>
</ul></div>
</div>
</nav><div class="container template-reference-topic">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<h1>Create a DatasetFactory</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/R/dataset-factory.R" class="external-link"><code>R/dataset-factory.R</code></a></small>
<div class="d-none name"><code>dataset_factory.Rd</code></div>
</div>
<div class="ref-description section level2">
<p>A <a href="Dataset.html">Dataset</a> can constructed using one or more <a href="Dataset.html">DatasetFactory</a>s.
This function helps you construct a <code>DatasetFactory</code> that you can pass to
<code><a href="open_dataset.html">open_dataset()</a></code>.</p>
</div>
<div class="section level2">
<h2 id="ref-usage">Usage<a class="anchor" aria-label="anchor" href="#ref-usage"></a></h2>
<div class="sourceCode"><pre class="sourceCode r"><code><span><span class="fu">dataset_factory</span><span class="op">(</span></span>
<span> <span class="va">x</span>,</span>
<span> filesystem <span class="op">=</span> <span class="cn">NULL</span>,</span>
<span> format <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"parquet"</span>, <span class="st">"arrow"</span>, <span class="st">"ipc"</span>, <span class="st">"feather"</span>, <span class="st">"csv"</span>, <span class="st">"tsv"</span>, <span class="st">"text"</span>, <span class="st">"json"</span><span class="op">)</span>,</span>
<span> partitioning <span class="op">=</span> <span class="cn">NULL</span>,</span>
<span> hive_style <span class="op">=</span> <span class="cn">NA</span>,</span>
<span> factory_options <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span><span class="op">)</span>,</span>
<span> <span class="va">...</span></span>
<span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level2">
<h2 id="arguments">Arguments<a class="anchor" aria-label="anchor" href="#arguments"></a></h2>
<dl><dt id="arg-x">x<a class="anchor" aria-label="anchor" href="#arg-x"></a></dt>
<dd><p>A string path to a directory containing data files, a vector of one
one or more string paths to data files, or a list of <code>DatasetFactory</code> objects
whose datasets should be combined. If this argument is specified it will be
used to construct a <code>UnionDatasetFactory</code> and other arguments will be
ignored.</p></dd>
<dt id="arg-filesystem">filesystem<a class="anchor" aria-label="anchor" href="#arg-filesystem"></a></dt>
<dd><p>A <a href="FileSystem.html">FileSystem</a> object; if omitted, the <code>FileSystem</code> will
be detected from <code>x</code></p></dd>
<dt id="arg-format">format<a class="anchor" aria-label="anchor" href="#arg-format"></a></dt>
<dd><p>A <a href="FileFormat.html">FileFormat</a> object, or a string identifier of the format of
the files in <code>x</code>. Currently supported values:</p><ul><li><p>"parquet"</p></li>
<li><p>"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that
only version 2 files are supported</p></li>
<li><p>"csv"/"text", aliases for the same thing (because comma is the default
delimiter for text files</p></li>
<li><p>"tsv", equivalent to passing <code>format = "text", delimiter = "\t"</code></p></li>
</ul><p>Default is "parquet", unless a <code>delimiter</code> is also specified, in which case
it is assumed to be "text".</p></dd>
<dt id="arg-partitioning">partitioning<a class="anchor" aria-label="anchor" href="#arg-partitioning"></a></dt>
<dd><p>One of</p><ul><li><p>A <code>Schema</code>, in which case the file paths relative to <code>sources</code> will be
parsed, and path segments will be matched with the schema fields. For
example, <code>schema(year = int16(), month = int8())</code> would create partitions
for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.</p></li>
<li><p>A character vector that defines the field names corresponding to those
path segments (that is, you're providing the names that would correspond
to a <code>Schema</code> but the types will be autodetected)</p></li>
<li><p>A <code>HivePartitioning</code> or <code>HivePartitioningFactory</code>, as returned
by <code><a href="hive_partition.html">hive_partition()</a></code> which parses explicit or autodetected fields from
Hive-style path segments</p></li>
<li><p><code>NULL</code> for no partitioning</p></li>
</ul></dd>
<dt id="arg-hive-style">hive_style<a class="anchor" aria-label="anchor" href="#arg-hive-style"></a></dt>
<dd><p>Logical: if <code>partitioning</code> is a character vector or a
<code>Schema</code>, should it be interpreted as specifying Hive-style partitioning?
Default is <code>NA</code>, which means to inspect the file paths for Hive-style
partitioning and behave accordingly.</p></dd>
<dt id="arg-factory-options">factory_options<a class="anchor" aria-label="anchor" href="#arg-factory-options"></a></dt>
<dd><p>list of optional FileSystemFactoryOptions:</p><ul><li><p><code>partition_base_dir</code>: string path segment prefix to ignore when
discovering partition information with DirectoryPartitioning. Not
meaningful (ignored with a warning) for HivePartitioning, nor is it
valid when providing a vector of file paths.</p></li>
<li><p><code>exclude_invalid_files</code>: logical: should files that are not valid data
files be excluded? Default is <code>FALSE</code> because checking all files up
front incurs I/O and thus will be slower, especially on remote
filesystems. If false and there are invalid files, there will be an
error at scan time. This is the only FileSystemFactoryOption that is
valid for both when providing a directory path in which to discover
files and when providing a vector of file paths.</p></li>
<li><p><code>selector_ignore_prefixes</code>: character vector of file prefixes to ignore
when discovering files in a directory. If invalid files can be excluded
by a common filename prefix this way, you can avoid the I/O cost of
<code>exclude_invalid_files</code>. Not valid when providing a vector of file paths
(but if you're providing the file list, you can filter invalid files
yourself).</p></li>
</ul></dd>
<dt id="arg--">...<a class="anchor" aria-label="anchor" href="#arg--"></a></dt>
<dd><p>Additional format-specific options, passed to
<code><a href="FileFormat.html">FileFormat$create()</a></code>. For CSV options, note that you can specify them either
with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
<code>readr</code>-style naming used in <code><a href="read_delim_arrow.html">read_csv_arrow()</a></code> ("delim", "quote", etc.).
Not all <code>readr</code> options are currently supported; please file an issue if you
encounter one that <code>arrow</code> should support.</p></dd>
</dl></div>
<div class="section level2">
<h2 id="value">Value<a class="anchor" aria-label="anchor" href="#value"></a></h2>
<p>A <code>DatasetFactory</code> object. Pass this to <code><a href="open_dataset.html">open_dataset()</a></code>,
in a list potentially with other <code>DatasetFactory</code> objects, to create
a <code>Dataset</code>.</p>
</div>
<div class="section level2">
<h2 id="details">Details<a class="anchor" aria-label="anchor" href="#details"></a></h2>
<p>If you would only have a single <code>DatasetFactory</code> (for example, you have a
single directory containing Parquet files), you can call <code><a href="open_dataset.html">open_dataset()</a></code>
directly. Use <code>dataset_factory()</code> when you
want to combine different directories, file systems, or file formats.</p>
</div>
</main><aside class="col-md-3"><nav id="toc" aria-label="Table of contents"><h2>On this page</h2>
</nav></aside></div>
<footer><div class="pkgdown-footer-left">
<p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
</div>
<div class="pkgdown-footer-right">
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.1.3.</p>
</div>
</footer></div>
</body></html>