blob: 44b35a09fdb08f1d38719b59536827e87349b33a [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><meta name="description" content="The write_*_dataset() are a family of wrappers around write_dataset to allow for easy switching
between functions for writing datasets."><title>Write a dataset into partitioned flat files. — write_delim_dataset • Arrow R Package</title><!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png"><link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png"><link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png"><link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png"><link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png"><link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png"><script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"><script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous"><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous"><!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.11/clipboard.min.js" integrity="sha512-7O5pXpc0oCRrxk8RUfDYFgn0nO1t+jLuIOQdOMRp4APB7uZ4vSjspzp5y6YDtDs4VzUSTbWzBFZ/LKJhnyFOKw==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"><meta property="og:title" content="Write a dataset into partitioned flat files. — write_delim_dataset"><meta property="og:description" content="The write_*_dataset() are a family of wrappers around write_dataset to allow for easy switching
between functions for writing datasets."><meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"><meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"><meta name="twitter:card" content="summary_large_image"><meta name="twitter:creator" content="@apachearrow"><meta name="twitter:site" content="@apachearrow"><!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--><!-- Matomo --><script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script><!-- End Matomo Code --></head><body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">
<a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>
<span class="version">
<small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">16.0.0.9000</small>
</span>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto"><li class="nav-item">
<a class="nav-link" href="../articles/arrow.html">Get started</a>
</li>
<li class="active nav-item">
<a class="nav-link" href="../reference/index.html">Reference</a>
</li>
<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
<div class="dropdown-menu" aria-labelledby="dropdown-articles">
<h6 class="dropdown-header" data-toc-skip>Using the package</h6>
<a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a>
<a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a>
<a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a>
<a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a>
<a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a>
<a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6>
<a class="dropdown-item" href="../articles/data_objects.html">Data objects</a>
<a class="dropdown-item" href="../articles/data_types.html">Data types</a>
<a class="dropdown-item" href="../articles/metadata.html">Metadata</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Installation</h6>
<a class="dropdown-item" href="../articles/install.html">Installing on Linux</a>
<a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="../articles/index.html">More articles...</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="../news/index.html">Changelog</a>
</li>
</ul><form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off"></form>
<ul class="navbar-nav"><li class="nav-item">
<a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul></div>
</div>
</nav><div class="container template-reference-topic">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<img src="" class="logo" alt=""><h1>Write a dataset into partitioned flat files.</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/R/dataset-write.R" class="external-link"><code>R/dataset-write.R</code></a></small>
<div class="d-none name"><code>write_delim_dataset.Rd</code></div>
</div>
<div class="ref-description section level2">
<p>The <code>write_*_dataset()</code> are a family of wrappers around <a href="write_dataset.html">write_dataset</a> to allow for easy switching
between functions for writing datasets.</p>
</div>
<div class="section level2">
<h2 id="ref-usage">Usage<a class="anchor" aria-label="anchor" href="#ref-usage"></a></h2>
<div class="sourceCode"><pre class="sourceCode r"><code><span><span class="fu">write_delim_dataset</span><span class="op">(</span></span>
<span> <span class="va">dataset</span>,</span>
<span> <span class="va">path</span>,</span>
<span> partitioning <span class="op">=</span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_data.html" class="external-link">group_vars</a></span><span class="op">(</span><span class="va">dataset</span><span class="op">)</span>,</span>
<span> basename_template <span class="op">=</span> <span class="st">"part-{i}.txt"</span>,</span>
<span> hive_style <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> existing_data_behavior <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"overwrite"</span>, <span class="st">"error"</span>, <span class="st">"delete_matching"</span><span class="op">)</span>,</span>
<span> max_partitions <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> max_open_files <span class="op">=</span> <span class="fl">900L</span>,</span>
<span> max_rows_per_file <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> min_rows_per_group <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> max_rows_per_group <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/bitwise.html" class="external-link">bitwShiftL</a></span><span class="op">(</span><span class="fl">1</span>, <span class="fl">20</span><span class="op">)</span>,</span>
<span> col_names <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> batch_size <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> delim <span class="op">=</span> <span class="st">","</span>,</span>
<span> na <span class="op">=</span> <span class="st">""</span>,</span>
<span> eol <span class="op">=</span> <span class="st">"\n"</span>,</span>
<span> quote <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"needed"</span>, <span class="st">"all"</span>, <span class="st">"none"</span><span class="op">)</span></span>
<span><span class="op">)</span></span>
<span></span>
<span><span class="fu">write_csv_dataset</span><span class="op">(</span></span>
<span> <span class="va">dataset</span>,</span>
<span> <span class="va">path</span>,</span>
<span> partitioning <span class="op">=</span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_data.html" class="external-link">group_vars</a></span><span class="op">(</span><span class="va">dataset</span><span class="op">)</span>,</span>
<span> basename_template <span class="op">=</span> <span class="st">"part-{i}.csv"</span>,</span>
<span> hive_style <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> existing_data_behavior <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"overwrite"</span>, <span class="st">"error"</span>, <span class="st">"delete_matching"</span><span class="op">)</span>,</span>
<span> max_partitions <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> max_open_files <span class="op">=</span> <span class="fl">900L</span>,</span>
<span> max_rows_per_file <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> min_rows_per_group <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> max_rows_per_group <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/bitwise.html" class="external-link">bitwShiftL</a></span><span class="op">(</span><span class="fl">1</span>, <span class="fl">20</span><span class="op">)</span>,</span>
<span> col_names <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> batch_size <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> delim <span class="op">=</span> <span class="st">","</span>,</span>
<span> na <span class="op">=</span> <span class="st">""</span>,</span>
<span> eol <span class="op">=</span> <span class="st">"\n"</span>,</span>
<span> quote <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"needed"</span>, <span class="st">"all"</span>, <span class="st">"none"</span><span class="op">)</span></span>
<span><span class="op">)</span></span>
<span></span>
<span><span class="fu">write_tsv_dataset</span><span class="op">(</span></span>
<span> <span class="va">dataset</span>,</span>
<span> <span class="va">path</span>,</span>
<span> partitioning <span class="op">=</span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_data.html" class="external-link">group_vars</a></span><span class="op">(</span><span class="va">dataset</span><span class="op">)</span>,</span>
<span> basename_template <span class="op">=</span> <span class="st">"part-{i}.tsv"</span>,</span>
<span> hive_style <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> existing_data_behavior <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"overwrite"</span>, <span class="st">"error"</span>, <span class="st">"delete_matching"</span><span class="op">)</span>,</span>
<span> max_partitions <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> max_open_files <span class="op">=</span> <span class="fl">900L</span>,</span>
<span> max_rows_per_file <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> min_rows_per_group <span class="op">=</span> <span class="fl">0L</span>,</span>
<span> max_rows_per_group <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/bitwise.html" class="external-link">bitwShiftL</a></span><span class="op">(</span><span class="fl">1</span>, <span class="fl">20</span><span class="op">)</span>,</span>
<span> col_names <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> batch_size <span class="op">=</span> <span class="fl">1024L</span>,</span>
<span> na <span class="op">=</span> <span class="st">""</span>,</span>
<span> eol <span class="op">=</span> <span class="st">"\n"</span>,</span>
<span> quote <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"needed"</span>, <span class="st">"all"</span>, <span class="st">"none"</span><span class="op">)</span></span>
<span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level2">
<h2 id="arguments">Arguments<a class="anchor" aria-label="anchor" href="#arguments"></a></h2>
<dl><dt>dataset</dt>
<dd><p><a href="Dataset.html">Dataset</a>, <a href="RecordBatch-class.html">RecordBatch</a>, <a href="Table-class.html">Table</a>, <code>arrow_dplyr_query</code>, or
<code>data.frame</code>. If an <code>arrow_dplyr_query</code>, the query will be evaluated and
the result will be written. This means that you can <code><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate()</a></code>,
etc. to transform the data before it is written if you need to.</p></dd>
<dt>path</dt>
<dd><p>string path, URI, or <code>SubTreeFileSystem</code> referencing a directory
to write to (directory will be created if it does not exist)</p></dd>
<dt>partitioning</dt>
<dd><p><code>Partitioning</code> or a character vector of columns to
use as partition keys (to be written as path segments). Default is to
use the current <code><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by()</a></code> columns.</p></dd>
<dt>basename_template</dt>
<dd><p>string template for the names of files to be written.
Must contain <code>"{i}"</code>, which will be replaced with an autoincremented
integer to generate basenames of datafiles. For example, <code>"part-{i}.arrow"</code>
will yield <code>"part-0.arrow", ...</code>.
If not specified, it defaults to <code>"part-{i}.&lt;default extension&gt;"</code>.</p></dd>
<dt>hive_style</dt>
<dd><p>logical: write partition segments as Hive-style
(<code>key1=value1/key2=value2/file.ext</code>) or as just bare values. Default is <code>TRUE</code>.</p></dd>
<dt>existing_data_behavior</dt>
<dd><p>The behavior to use when there is already data
in the destination directory. Must be one of "overwrite", "error", or
"delete_matching".</p><ul><li><p>"overwrite" (the default) then any new files created will overwrite
existing files</p></li>
<li><p>"error" then the operation will fail if the destination directory is not
empty</p></li>
<li><p>"delete_matching" then the writer will delete any existing partitions
if data is going to be written to those partitions and will leave alone
partitions which data is not written to.</p></li>
</ul></dd>
<dt>max_partitions</dt>
<dd><p>maximum number of partitions any batch may be
written into. Default is 1024L.</p></dd>
<dt>max_open_files</dt>
<dd><p>maximum number of files that can be left opened
during a write operation. If greater than 0 then this will limit the
maximum number of files that can be left open. If an attempt is made to open
too many files then the least recently used file will be closed.
If this setting is set too low you may end up fragmenting your data
into many small files. The default is 900 which also allows some # of files to be
open by the scanner before hitting the default Linux limit of 1024.</p></dd>
<dt>max_rows_per_file</dt>
<dd><p>maximum number of rows per file.
If greater than 0 then this will limit how many rows are placed in any single file.
Default is 0L.</p></dd>
<dt>min_rows_per_group</dt>
<dd><p>write the row groups to the disk when this number of
rows have accumulated. Default is 0L.</p></dd>
<dt>max_rows_per_group</dt>
<dd><p>maximum rows allowed in a single
group and when this number of rows is exceeded, it is split and the next set
of rows is written to the next group. This value must be set such that it is
greater than <code>min_rows_per_group</code>. Default is 1024 * 1024.</p></dd>
<dt>col_names</dt>
<dd><p>Whether to write an initial header line with column names.</p></dd>
<dt>batch_size</dt>
<dd><p>Maximum number of rows processed at a time. Default is 1024L.</p></dd>
<dt>delim</dt>
<dd><p>Delimiter used to separate values. Defaults to <code>","</code> for <code>write_delim_dataset()</code> and
<code>write_csv_dataset()</code>, and <code>"\t</code> for <code>write_tsv_dataset()</code>. Cannot be changed for <code>write_tsv_dataset()</code>.</p></dd>
<dt>na</dt>
<dd><p>a character vector of strings to interpret as missing values. Quotes are not allowed in this string.
The default is an empty string <code>""</code>.</p></dd>
<dt>eol</dt>
<dd><p>the end of line character to use for ending rows. The default is <code>"\n"</code>.</p></dd>
<dt>quote</dt>
<dd><p>How to handle fields which contain characters that need to be quoted.</p><ul><li><p><code>needed</code> - Enclose all strings and binary values in quotes which need them, because their CSV rendering can
contain quotes itself (the default)</p></li>
<li><p><code>all</code> - Enclose all valid values in quotes. Nulls are not quoted. May cause readers to
interpret all values as strings if schema is inferred.</p></li>
<li><p><code>none</code> - Do not enclose any values in quotes. Prevents values from containing quotes ("),
cell delimiters (,) or line endings (\r, \n), (following RFC4180). If values
contain these characters, an error is caused when attempting to write.</p></li>
</ul></dd>
</dl></div>
<div class="section level2">
<h2 id="value">Value<a class="anchor" aria-label="anchor" href="#value"></a></h2>
<p>The input <code>dataset</code>, invisibly.</p>
</div>
<div class="section level2">
<h2 id="see-also">See also<a class="anchor" aria-label="anchor" href="#see-also"></a></h2>
<div class="dont-index"><p><code><a href="write_dataset.html">write_dataset()</a></code></p></div>
</div>
</main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
</nav></aside></div>
<footer><div class="pkgdown-footer-left">
<p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
</div>
<div class="pkgdown-footer-right">
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.9.</p>
</div>
</footer></div>
</body></html>