blob: de19f80594499a534df2b3722affcdf2d5091824 [file] [log] [blame]
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="Learn how to work with data sets stored in an Amazon S3 bucket or on Google Cloud Storage
">
<title>Using cloud storage (S3, GCS) • Arrow R Package</title>
<!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png">
<link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png">
<script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="../deps/bootstrap-5.2.2/bootstrap.min.css" rel="stylesheet">
<script src="../deps/bootstrap-5.2.2/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><meta property="og:title" content="Using cloud storage (S3, GCS)">
<meta property="og:description" content="Learn how to work with data sets stored in an Amazon S3 bucket or on Google Cloud Storage
">
<meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png">
<meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:creator" content="@apachearrow">
<meta name="twitter:site" content="@apachearrow">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--><!-- Matomo --><script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script><!-- End Matomo Code -->
</head>
<body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container">
<a class="navbar-brand me-2" href="../index.html">Arrow R Package</a>
<span class="version">
<small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">13.0.0</small>
</span>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto">
<li class="nav-item">
<a class="nav-link" href="../articles/arrow.html">Get started</a>
</li>
<li class="nav-item">
<a class="nav-link" href="../reference/index.html">Reference</a>
</li>
<li class="active nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
<div class="dropdown-menu" aria-labelledby="dropdown-articles">
<h6 class="dropdown-header" data-toc-skip>Using the package</h6>
<a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a>
<a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a>
<a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a>
<a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a>
<a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a>
<a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6>
<a class="dropdown-item" href="../articles/data_objects.html">Data objects</a>
<a class="dropdown-item" href="../articles/data_types.html">Data types</a>
<a class="dropdown-item" href="../articles/metadata.html">Metadata</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Installation</h6>
<a class="dropdown-item" href="../articles/install.html">Installing on Linux</a>
<a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="../articles/index.html">More articles...</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="../news/index.html">Changelog</a>
</li>
</ul>
<form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off">
</form>
<ul class="navbar-nav">
<li class="nav-item">
<a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="github">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
</div>
</nav><div class="container template-article">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<img src="" class="logo" alt=""><h1>Using cloud storage (S3, GCS)</h1>
<small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/fs.Rmd" class="external-link"><code>vignettes/fs.Rmd</code></a></small>
<div class="d-none name"><code>fs.Rmd</code></div>
</div>
<p>Working with data stored in cloud storage systems like <a href="https://docs.aws.amazon.com/s3/" class="external-link">Amazon Simple Storage Service</a> (S3) and <a href="https://cloud.google.com/storage/docs" class="external-link">Google Cloud Storage</a> (GCS) is a very common task. Because of this, the Arrow C++ library provides a toolkit aimed to make it as simple to work with cloud storage as it is to work with the local filesystem.</p>
<p>To make this work, the Arrow C++ library contains a general-purpose interface for file systems, and the arrow package exposes this interface to R users. For instance, if you want to you can create a <code>LocalFileSystem</code> object that allows you to interact with the local file system in the usual ways: copying, moving, and deleting files, obtaining information about files and folders, and so on (see <code><a href="../reference/FileSystem.html">help("FileSystem", package = "arrow")</a></code> for details). In general you probably don’t need this functionality because you already have tools for working with your local file system, but this interface becomes much more useful in the context of remote file systems. Currently there is a specific implementation for Amazon S3 provided by the <code>S3FileSystem</code> class, and another one for Google Cloud Storage provided by <code>GcsFileSystem</code>.</p>
<p>This article provides an overview of working with both S3 and GCS data using the Arrow toolkit.</p>
<div class="section level2">
<h2 id="s3-and-gcs-support-on-linux">S3 and GCS support on Linux<a class="anchor" aria-label="anchor" href="#s3-and-gcs-support-on-linux"></a>
</h2>
<p>Before you start, make sure that your arrow install has support for S3 and/or GCS enabled. For most users this will be true by default, because the Windows and MacOS binary packages hosted on CRAN include S3 and GCS support. You can check whether support is enabled via helper functions:</p>
<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../reference/arrow_info.html">arrow_with_s3</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="fu"><a href="../reference/arrow_info.html">arrow_with_gcs</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
<p>If these return <code>TRUE</code> then the relevant support is enabled.</p>
<p>In some cases you may find that your system does not have support enabled. The most common case for this occurs on Linux when installing arrow from source. In this situation S3 and GCS support is not always enabled by default, and there are additional system requirements involved. See the <a href="./install.html">installation article</a> for details on how to resolve this.</p>
</div>
<div class="section level2">
<h2 id="connecting-to-cloud-storage">Connecting to cloud storage<a class="anchor" aria-label="anchor" href="#connecting-to-cloud-storage"></a>
</h2>
<p>One way of working with filesystems is to create <code><a href="../reference/FileSystem.html">?FileSystem</a></code> objects. <code><a href="../reference/FileSystem.html">?S3FileSystem</a></code> objects can be created with the <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code> function, which automatically detects the bucket’s AWS region. Similarly, <code><a href="../reference/FileSystem.html">?GcsFileSystem</a></code> objects can be created with the <code><a href="../reference/gs_bucket.html">gs_bucket()</a></code> function. The resulting <code>FileSystem</code> will consider paths relative to the bucket’s path (so for example you don’t need to prefix the bucket path when listing a directory).</p>
<p>With a <code>FileSystem</code> object, you can point to specific files in it with the <code>$path()</code> method and pass the result to file readers and writers (<code><a href="../reference/read_parquet.html">read_parquet()</a></code>, <code><a href="../reference/write_feather.html">write_feather()</a></code>, et al.).</p>
<p>Often the reason users work with cloud storage in real world analysis is to access large data sets. An example of this is discussed in the <a href="./dataset.html">datasets article</a>, but new users may prefer to work with a much smaller data set while learning how the arrow cloud storage interface works. To that end, the examples in this article rely on a multi-file Parquet dataset that stores a copy of the <code>diamonds</code> data made available through the <a href="https://ggplot2.tidyverse.org/" class="external-link"><code>ggplot2</code></a> package, documented in <code>help("diamonds", package = "ggplot2")</code>. The cloud storage version of this data set consists of 5 Parquet files totaling less than 1MB in size.</p>
<p>The diamonds data set is hosted on both S3 and GCS, in a bucket named <code>voltrondata-labs-datasets</code>. To create an S3FileSystem object that refers to that bucket, use the following command:</p>
<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span><span class="op">)</span></span></code></pre></div>
<p>To do this for the GCS version of the data, the command is as follows:</p>
<div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div>
<p>Note that <code>anonymous = TRUE</code> is required for GCS if credentials have not been configured.</p>
<!-- TODO: update GCS note above if ARROW-17097 is addressed -->
<p>Within this bucket there is a folder called <code>diamonds</code>. We can call <code>bucket$ls("diamonds")</code> to list the files stored in this folder, or <code>bucket$ls("diamonds", recursive = TRUE)</code> to recursively search subfolders. Note that on GCS, you should always set <code>recursive = TRUE</code> because directories often don’t appear in the results.</p>
<p>Here’s what we get when we list the files stored in the GCS bucket:</p>
<div class="sourceCode" id="cb4"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span><span class="op">$</span><span class="fu">ls</span><span class="op">(</span><span class="st">"diamonds"</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div>
<div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co">## [1] "diamonds/cut=Fair/part-0.parquet" </span></span>
<span><span class="co">## [2] "diamonds/cut=Good/part-0.parquet" </span></span>
<span><span class="co">## [3] "diamonds/cut=Ideal/part-0.parquet" </span></span>
<span><span class="co">## [4] "diamonds/cut=Premium/part-0.parquet" </span></span>
<span><span class="co">## [5] "diamonds/cut=Very Good/part-0.parquet"</span></span></code></pre></div>
<p>There are 5 Parquet files here, one corresponding to each of the “cut” categories in the <code>diamonds</code> data set. We can specify the path to a specific file by calling <code>bucket$path()</code>:</p>
<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">parquet_good</span> <span class="op">&lt;-</span> <span class="va">bucket</span><span class="op">$</span><span class="fu">path</span><span class="op">(</span><span class="st">"diamonds/cut=Good/part-0.parquet"</span><span class="op">)</span></span></code></pre></div>
<p>We can use <code><a href="../reference/read_parquet.html">read_parquet()</a></code> to read from this path directly into R:</p>
<div class="sourceCode" id="cb7"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">diamonds_good</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">parquet_good</span><span class="op">)</span></span>
<span><span class="va">diamonds_good</span></span></code></pre></div>
<div class="sourceCode" id="cb8"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co">## # A tibble: 4,906 × 9</span></span>
<span><span class="co">## carat color clarity depth table price x y z</span></span>
<span><span class="co">## &lt;dbl&gt; &lt;ord&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span></span>
<span><span class="co">## 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31</span></span>
<span><span class="co">## 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75</span></span>
<span><span class="co">## 3 0.3 J SI1 64 55 339 4.25 4.28 2.73</span></span>
<span><span class="co">## 4 0.3 J SI1 63.4 54 351 4.23 4.29 2.7 </span></span>
<span><span class="co">## 5 0.3 J SI1 63.8 56 351 4.23 4.26 2.71</span></span>
<span><span class="co">## 6 0.3 I SI2 63.3 56 351 4.26 4.3 2.71</span></span>
<span><span class="co">## 7 0.23 F VS1 58.2 59 402 4.06 4.08 2.37</span></span>
<span><span class="co">## 8 0.23 E VS1 64.1 59 402 3.83 3.85 2.46</span></span>
<span><span class="co">## 9 0.31 H SI1 64 54 402 4.29 4.31 2.75</span></span>
<span><span class="co">## 10 0.26 D VS2 65.2 56 403 3.99 4.02 2.61</span></span>
<span><span class="co">## # … with 4,896 more rows</span></span>
<span><span class="co">## # ℹ Use `print(n = ...)` to see more rows</span></span></code></pre></div>
<p>Note that this will be slower to read than if the file were local.</p>
<!-- though if you're running on a machine in the same AWS region as the file in S3,
the cost of reading the data over the network should be much lower. -->
<!--
See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()`
and `gs_bucket()`/`GcsFileSystem$create()` can take.
The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`,
which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be
useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere).
One way to get a subtree is to call the `$cd()` method on a `FileSystem`
```r
june2019 <- bucket$cd("nyc-taxi/year=2019/month=6")
df <- read_parquet(june2019$path("part-0.parquet"))
```
`SubTreeFileSystem` can also be made from a URI:
```r
june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
```
-->
</div>
<div class="section level2">
<h2 id="connecting-directly-with-a-uri">Connecting directly with a URI<a class="anchor" aria-label="anchor" href="#connecting-directly-with-a-uri"></a>
</h2>
<p>In most use cases, the easiest and most natural way to connect to cloud storage in arrow is to use the FileSystem objects returned by <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code> and <code><a href="../reference/gs_bucket.html">gs_bucket()</a></code>, especially when multiple file operations are required. However, in some cases you may want to download a file directly by specifying the URI. This is permitted by arrow, and functions like <code><a href="../reference/read_parquet.html">read_parquet()</a></code>, <code><a href="../reference/write_feather.html">write_feather()</a></code>, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> etc will all accept URIs to cloud resources hosted on S3 or GCS. The format of an S3 URI is as follows:</p>
<pre><code>s3://[access_key:secret_key@]bucket/path[?region=]</code></pre>
<p>For GCS, the URI format looks like this:</p>
<pre><code>gs://[access_key:secret_key@]bucket/path
gs://anonymous@bucket/path</code></pre>
<p>For example, the Parquet file storing the “good cut” diamonds that we downloaded earlier in the article is available on both S3 and CGS. The relevant URIs are as follows:</p>
<div class="sourceCode" id="cb11"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">uri</span> <span class="op">&lt;-</span> <span class="st">"s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span></span>
<span><span class="va">uri</span> <span class="op">&lt;-</span> <span class="st">"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span></span></code></pre></div>
<p>Note that “anonymous” is required on GCS for public buckets. Regardless of which version you use, you can pass this URI to <code><a href="../reference/read_parquet.html">read_parquet()</a></code> as if the file were stored locally:</p>
<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">df</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">uri</span><span class="op">)</span></span></code></pre></div>
<p>URIs accept additional options in the query parameters (the part after the <code>?</code>) that are passed down to configure the underlying file system. They are separated by <code>&amp;</code>. For example,</p>
<pre><code>s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&amp;allow_bucket_creation=true</code></pre>
<p>is equivalent to:</p>
<div class="sourceCode" id="cb14"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="va">S3FileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span>
<span> endpoint_override<span class="op">=</span><span class="st">"https://storage.googleapis.com"</span>,</span>
<span> allow_bucket_creation<span class="op">=</span><span class="cn">TRUE</span></span>
<span><span class="op">)</span></span>
<span><span class="va">bucket</span><span class="op">$</span><span class="fu">path</span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/"</span><span class="op">)</span></span></code></pre></div>
<p>Both tell the <code>S3FileSystem</code> object that it should allow the creation of new buckets and to talk to Google Storage instead of S3. The latter works because GCS implements an S3-compatible API – see <a href="#file-systems-that-emulate-s3">File systems that emulate S3</a> below – but if you want better support for GCS you should refer to a <code>GcsFileSystem</code> but using a URI that starts with <code>gs://</code>.</p>
<p>Also note that parameters in the URI need to be <a href="https://en.wikipedia.org/wiki/Percent-encoding" class="external-link">percent encoded</a>, which is why <code>://</code> is written as <code>%3A%2F%2F</code>.</p>
<p>For S3, only the following options can be included in the URI as query parameters are <code>region</code>, <code>scheme</code>, <code>endpoint_override</code>, <code>access_key</code>, <code>secret_key</code>, <code>allow_bucket_creation</code>, and <code>allow_bucket_deletion</code>. For GCS, the supported parameters are <code>scheme</code>, <code>endpoint_override</code>, and <code>retry_limit_seconds</code>.</p>
<p>In GCS, a useful option is <code>retry_limit_seconds</code>, which sets the number of seconds a request may spend retrying before returning an error. The current default is 15 minutes, so in many interactive contexts it’s nice to set a lower value:</p>
<pre><code>gs://anonymous@voltrondata-labs-datasets/diamonds/?retry_limit_seconds=10</code></pre>
</div>
<div class="section level2">
<h2 id="authentication">Authentication<a class="anchor" aria-label="anchor" href="#authentication"></a>
</h2>
<div class="section level3">
<h3 id="s3-authentication">S3 Authentication<a class="anchor" aria-label="anchor" href="#s3-authentication"></a>
</h3>
<p>To access private S3 buckets, you need typically need two secret parameters: a <code>access_key</code>, which is like a user id, and <code>secret_key</code>, which is like a token or password. There are a few options for passing these credentials:</p>
<ul>
<li><p>Include them in the URI, like <code>s3://access_key:secret_key@bucket-name/path/to/file</code>. Be sure to <a href="https://en.wikipedia.org/wiki/Percent-encoding" class="external-link">URL-encode</a> your secrets if they contain special characters like “/” (e.g., <code>URLencode("123/456", reserved = TRUE)</code>).</p></li>
<li><p>Pass them as <code>access_key</code> and <code>secret_key</code> to <code>S3FileSystem$create()</code> or <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code></p></li>
<li><p>Set them as environment variables named <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>, respectively.</p></li>
<li><p>Define them in a <code>~/.aws/credentials</code> file, according to the <a href="https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html" class="external-link">AWS documentation</a>.</p></li>
<li><p>Use an <a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" class="external-link">AccessRole</a> for temporary access by passing the <code>role_arn</code> identifier to <code>S3FileSystem$create()</code> or <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code>.</p></li>
</ul>
</div>
<div class="section level3">
<h3 id="gcs-authentication">GCS Authentication<a class="anchor" aria-label="anchor" href="#gcs-authentication"></a>
</h3>
<p>The simplest way to authenticate with GCS is to run the <a href="https://cloud.google.com/sdk/docs/" class="external-link">gcloud</a> command to setup application default credentials:</p>
<pre><code>gcloud auth application-default login</code></pre>
<p>To manually configure credentials, you can pass either <code>access_token</code> and <code>expiration</code>, for using temporary tokens generated elsewhere, or <code>json_credentials</code>, to reference a downloaded credentials file.</p>
<p>If you haven’t configured credentials, then to access <em>public</em> buckets, you must pass <code>anonymous = TRUE</code> or <code>anonymous</code> as the user in a URI:</p>
<div class="sourceCode" id="cb17"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span>
<span><span class="va">fs</span> <span class="op">&lt;-</span> <span class="va">GcsFileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span>anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span>
<span><span class="va">df</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="st">"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span><span class="op">)</span></span></code></pre></div>
<!-- TODO(ARROW-16880): Describe what credentials to use for particular use cases
and how to integrate with gargle library. -->
</div>
</div>
<div class="section level2">
<h2 id="using-a-proxy-server">Using a proxy server<a class="anchor" aria-label="anchor" href="#using-a-proxy-server"></a>
</h2>
<p>If you need to use a proxy server to connect to an S3 bucket, you can provide a URI in the form <code>http://user:password@host:port</code> to <code>proxy_options</code>. For example, a local proxy server running on port 1316 can be used like this:</p>
<div class="sourceCode" id="cb18"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">bucket</span> <span class="op">&lt;-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span></span>
<span> bucket <span class="op">=</span> <span class="st">"voltrondata-labs-datasets"</span>, </span>
<span> proxy_options <span class="op">=</span> <span class="st">"http://localhost:1316"</span></span>
<span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level2">
<h2 id="file-systems-that-emulate-s3">File systems that emulate S3<a class="anchor" aria-label="anchor" href="#file-systems-that-emulate-s3"></a>
</h2>
<p>The <code>S3FileSystem</code> machinery enables you to work with any file system that provides an S3-compatible interface. For example, <a href="https://min.io/" class="external-link">MinIO</a> is and object-storage server that emulates the S3 API. If you were to run <code>minio server</code> locally with its default settings, you could connect to it with arrow using <code>S3FileSystem</code> like this:</p>
<div class="sourceCode" id="cb19"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">minio</span> <span class="op">&lt;-</span> <span class="va">S3FileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span>
<span> access_key <span class="op">=</span> <span class="st">"minioadmin"</span>,</span>
<span> secret_key <span class="op">=</span> <span class="st">"minioadmin"</span>,</span>
<span> scheme <span class="op">=</span> <span class="st">"http"</span>,</span>
<span> endpoint_override <span class="op">=</span> <span class="st">"localhost:9000"</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>or, as a URI, it would be</p>
<pre><code>s3://minioadmin:minioadmin@?scheme=http&amp;endpoint_override=localhost%3A9000</code></pre>
<p>(Note the URL escaping of the <code>:</code> in <code>endpoint_override</code>).</p>
<p>Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.</p>
</div>
<div class="section level2">
<h2 id="disabling-environment-variables">Disabling environment variables<a class="anchor" aria-label="anchor" href="#disabling-environment-variables"></a>
</h2>
<p>As mentioned above, it is possible to make use of environment variables to configure access. However, if you wish to pass in connection details via a URI or alternative methods but also have existing AWS environment variables defined, these may interfere with your session. For example, you may see an error message like:</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" title="1">Error<span class="op">:</span><span class="st"> </span>IOError<span class="op">:</span><span class="st"> </span>When resolving region <span class="cf">for</span> bucket <span class="st">'analysis'</span><span class="op">:</span><span class="st"> </span>AWS Error [code <span class="dv">99</span>]<span class="op">:</span><span class="st"> </span>curlCode<span class="op">:</span><span class="st"> </span><span class="dv">6</span>, Couldn<span class="st">'t resolve host name </span></a></code></pre></div>
<p>You can unset these environment variables using <code><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv()</a></code>, for example:</p>
<div class="sourceCode" id="cb22"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv</a></span><span class="op">(</span><span class="st">"AWS_DEFAULT_REGION"</span><span class="op">)</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv</a></span><span class="op">(</span><span class="st">"AWS_S3_ENDPOINT"</span><span class="op">)</span></span></code></pre></div>
<p>By default, the AWS SDK tries to retrieve metadata about user configuration, which can cause conficts when passing in connection details via URI (for example when accessing a MINIO bucket). To disable the use of AWS environment variables, you can set environment variable <code>AWS_EC2_METADATA_DISABLED</code> to <code>TRUE</code>.</p>
<div class="sourceCode" id="cb23"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.setenv</a></span><span class="op">(</span>AWS_EC2_METADATA_DISABLED <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level2">
<h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a>
</h2>
<ul>
<li>To learn more about <code>FileSystem</code> classes, including <code>S3FileSystem</code> and <code>GcsFileSystem</code>, see <code><a href="../reference/FileSystem.html">help("FileSystem", package = "arrow")</a></code>.</li>
<li>To see a data analysis example that relies on data hosted on cloud storage, see the <a href="./dataset.html">dataset article</a>.</li>
</ul>
</div>
</main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
</nav></aside>
</div>
<footer><div class="pkgdown-footer-left">
<p></p>
<p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p>
</div>
<div class="pkgdown-footer-right">
<p></p>
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.7.</p>
</div>
</footer>
</div>
</body>
</html>