| <!DOCTYPE html> |
| <!-- Generated by pkgdown: do not edit by hand --><html lang="en-US"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <title>Using cloud storage (S3, GCS) • Arrow R Package</title> |
| <!-- favicons --><link rel="icon" type="image/png" sizes="96x96" href="../favicon-96x96.png"> |
| <link rel="icon" type="”image/svg+xml”" href="../favicon.svg"> |
| <link rel="apple-touch-icon" sizes="180x180" href="../apple-touch-icon.png"> |
| <link rel="icon" sizes="any" href="../favicon.ico"> |
| <link rel="manifest" href="../site.webmanifest"> |
| <script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
| <link href="../deps/bootstrap-5.3.1/bootstrap.min.css" rel="stylesheet"> |
| <script src="../deps/bootstrap-5.3.1/bootstrap.bundle.min.js"></script><link href="../deps/font-awesome-6.5.2/css/all.min.css" rel="stylesheet"> |
| <link href="../deps/font-awesome-6.5.2/css/v4-shims.min.css" rel="stylesheet"> |
| <script src="../deps/headroom-0.11.0/headroom.min.js"></script><script src="../deps/headroom-0.11.0/jQuery.headroom.min.js"></script><script src="../deps/bootstrap-toc-1.0.1/bootstrap-toc.min.js"></script><script src="../deps/clipboard.js-2.0.11/clipboard.min.js"></script><script src="../deps/search-1.0.0/autocomplete.jquery.min.js"></script><script src="../deps/search-1.0.0/fuse.min.js"></script><script src="../deps/search-1.0.0/mark.min.js"></script><!-- pkgdown --><script src="../pkgdown.js"></script><link href="../extra.css" rel="stylesheet"> |
| <meta property="og:title" content="Using cloud storage (S3, GCS)"> |
| <meta name="description" content="Learn how to work with data sets stored in an Amazon S3 bucket or on Google Cloud Storage |
| "> |
| <meta property="og:description" content="Learn how to work with data sets stored in an Amazon S3 bucket or on Google Cloud Storage |
| "> |
| <meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png"> |
| <meta property="og:image:alt" content="Apache Arrow logo, displaying the triple chevron image adjacent to the text"> |
| <!-- Matomo --><script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| /* We explicitly disable cookie tracking to avoid privacy issues */ |
| _paq.push(['disableCookies']); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '20']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script><!-- End Matomo Code --><!-- Kapa AI --><script async src="https://widget.kapa.ai/kapa-widget.bundle.js" data-website-id="9db461d5-ac77-4b3f-a5c5-75efa78339d2" data-project-name="Apache Arrow" data-project-color="#000000" data-project-logo="https://arrow.apache.org/img/arrow-logo_chevrons_white-txt_black-bg.png" data-modal-disclaimer="This is a custom LLM with access to all of [Arrow documentation](https://arrow.apache.org/docs/). If you want an R-specific answer, please mention this in your question." data-consent-required="true" data-user-analytics-cookie-enabled="false" data-consent-screen-disclaimer="By clicking "I agree, let's chat", you consent to the use of the AI assistant in accordance with kapa.ai's [Privacy Policy](https://www.kapa.ai/content/privacy-policy). This service uses reCAPTCHA, which requires your consent to Google's [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms). By proceeding, you explicitly agree to both kapa.ai's and Google's privacy policies."></script><!-- End Kapa AI --> |
| </head> |
| <body> |
| <a href="#main" class="visually-hidden-focusable">Skip to contents</a> |
| |
| |
| <nav class="navbar fixed-top navbar-dark navbar-expand-lg bg-black"><div class="container"> |
| |
| <a class="navbar-brand me-2" href="../index.html">Arrow R Package</a> |
| |
| <span class="version"> |
| <small class="nav-text text-muted me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="">22.0.0.9000</small> |
| </span> |
| |
| |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| <div id="navbar" class="collapse navbar-collapse ms-3"> |
| <ul class="navbar-nav me-auto"> |
| <li class="nav-item"><a class="nav-link" href="../articles/arrow.html">Get started</a></li> |
| <li class="nav-item"><a class="nav-link" href="../reference/index.html">Reference</a></li> |
| <li class="active nav-item dropdown"> |
| <button class="nav-link dropdown-toggle" type="button" id="dropdown-articles" data-bs-toggle="dropdown" aria-expanded="false" aria-haspopup="true">Articles</button> |
| <ul class="dropdown-menu" aria-labelledby="dropdown-articles"> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Using the package</h6></li> |
| <li><a class="dropdown-item" href="../articles/read_write.html">Reading and writing data files</a></li> |
| <li><a class="dropdown-item" href="../articles/data_wrangling.html">Data analysis with dplyr syntax</a></li> |
| <li><a class="dropdown-item" href="../articles/dataset.html">Working with multi-file data sets</a></li> |
| <li><a class="dropdown-item" href="../articles/python.html">Integrating Arrow, Python, and R</a></li> |
| <li><a class="dropdown-item" href="../articles/fs.html">Using cloud storage (S3, GCS)</a></li> |
| <li><a class="dropdown-item" href="../articles/flight.html">Connecting to a Flight server</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Arrow concepts</h6></li> |
| <li><a class="dropdown-item" href="../articles/data_objects.html">Data objects</a></li> |
| <li><a class="dropdown-item" href="../articles/data_types.html">Data types</a></li> |
| <li><a class="dropdown-item" href="../articles/metadata.html">Metadata</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><h6 class="dropdown-header" data-toc-skip>Installation</h6></li> |
| <li><a class="dropdown-item" href="../articles/install.html">Installing on Linux</a></li> |
| <li><a class="dropdown-item" href="../articles/install_nightly.html">Installing development versions</a></li> |
| <li><hr class="dropdown-divider"></li> |
| <li><a class="dropdown-item" href="../articles/index.html">More articles...</a></li> |
| </ul> |
| </li> |
| <li class="nav-item"><a class="nav-link" href="../news/index.html">Changelog</a></li> |
| </ul> |
| <form class="form-inline my-2 my-lg-0" role="search"> |
| <input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="" autocomplete="off"> |
| </form> |
| |
| <ul class="navbar-nav"> |
| <li class="nav-item"><a class="external-link nav-link" href="https://github.com/apache/arrow/" aria-label="GitHub"><span class="fa fab fa-github fa-lg"></span></a></li> |
| </ul> |
| </div> |
| |
| |
| </div> |
| </nav><div class="container template-article"> |
| |
| |
| |
| |
| <div class="row"> |
| <main id="main" class="col-md-9"><div class="page-header"> |
| |
| <h1>Using cloud storage (S3, GCS)</h1> |
| |
| |
| <small class="dont-index">Source: <a href="https://github.com/apache/arrow/blob/main/r/vignettes/fs.Rmd" class="external-link"><code>vignettes/fs.Rmd</code></a></small> |
| <div class="d-none name"><code>fs.Rmd</code></div> |
| </div> |
| |
| |
| |
| <p>Working with data stored in cloud storage systems like <a href="https://docs.aws.amazon.com/s3/" class="external-link">Amazon Simple Storage Service</a> |
| (S3) and <a href="https://cloud.google.com/storage/docs" class="external-link">Google Cloud |
| Storage</a> (GCS) is a very common task. Because of this, the Arrow C++ |
| library provides a toolkit aimed to make it as simple to work with cloud |
| storage as it is to work with the local filesystem.</p> |
| <p>To make this work, the Arrow C++ library contains a general-purpose |
| interface for file systems, and the arrow package exposes this interface |
| to R users. For instance, if you want to you can create a |
| <code>LocalFileSystem</code> object that allows you to interact with the |
| local file system in the usual ways: copying, moving, and deleting |
| files, obtaining information about files and folders, and so on (see |
| <code><a href="../reference/FileSystem.html">help("FileSystem", package = "arrow")</a></code> for details). In |
| general you probably don’t need this functionality because you already |
| have tools for working with your local file system, but this interface |
| becomes much more useful in the context of remote file systems. |
| Currently there is a specific implementation for Amazon S3 provided by |
| the <code>S3FileSystem</code> class, and another one for Google Cloud |
| Storage provided by <code>GcsFileSystem</code>.</p> |
| <p>This article provides an overview of working with both S3 and GCS |
| data using the Arrow toolkit.</p> |
| <div class="section level2"> |
| <h2 id="s3-and-gcs-support-on-linux">S3 and GCS support on Linux<a class="anchor" aria-label="anchor" href="#s3-and-gcs-support-on-linux"></a> |
| </h2> |
| <p>Before you start, make sure that your arrow install has support for |
| S3 and/or GCS enabled. For most users this will be true by default, |
| because the Windows and macOS binary packages hosted on CRAN include S3 |
| and GCS support. You can check whether support is enabled via helper |
| functions:</p> |
| <div class="sourceCode" id="cb1"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="../reference/arrow_info.html">arrow_with_s3</a></span><span class="op">(</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="../reference/arrow_info.html">arrow_with_gcs</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div> |
| <p>If these return <code>TRUE</code> then the relevant support is |
| enabled.</p> |
| <p>In some cases you may find that your system does not have support |
| enabled. The most common case for this occurs on Linux when installing |
| arrow from source. In this situation S3 and GCS support is not always |
| enabled by default, and there are additional system requirements |
| involved. See the <a href="./install.html">installation article</a> for |
| details on how to resolve this.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="connecting-to-cloud-storage">Connecting to cloud storage<a class="anchor" aria-label="anchor" href="#connecting-to-cloud-storage"></a> |
| </h2> |
| <p>One way of working with filesystems is to create |
| <code><a href="../reference/FileSystem.html">?FileSystem</a></code> objects. <code><a href="../reference/FileSystem.html">?S3FileSystem</a></code> objects can |
| be created with the <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code> function, which |
| automatically detects the bucket’s AWS region. Similarly, |
| <code><a href="../reference/FileSystem.html">?GcsFileSystem</a></code> objects can be created with the |
| <code><a href="../reference/gs_bucket.html">gs_bucket()</a></code> function. The resulting <code>FileSystem</code> |
| will consider paths relative to the bucket’s path (so for example you |
| don’t need to prefix the bucket path when listing a directory).</p> |
| <p>With a <code>FileSystem</code> object, you can point to specific |
| files in it with the <code>$path()</code> method and pass the result to |
| file readers and writers (<code><a href="../reference/read_parquet.html">read_parquet()</a></code>, |
| <code><a href="../reference/write_feather.html">write_feather()</a></code>, et al.).</p> |
| <p>Often the reason users work with cloud storage in real world analysis |
| is to access large data sets. An example of this is discussed in the <a href="./dataset.html">datasets article</a>, but new users may prefer to |
| work with a much smaller data set while learning how the arrow cloud |
| storage interface works. To that end, the examples in this article rely |
| on a multi-file Parquet dataset that stores a copy of the |
| <code>diamonds</code> data made available through the <a href="https://ggplot2.tidyverse.org/" class="external-link"><code>ggplot2</code></a> package, |
| documented in <code>help("diamonds", package = "ggplot2")</code>. The |
| cloud storage version of this data set consists of 5 Parquet files |
| totaling less than 1MB in size.</p> |
| <p>The diamonds data set is hosted on both S3 and GCS, in a bucket named |
| <code>voltrondata-labs-datasets</code>. To create an S3FileSystem object |
| that refers to that bucket, use the following command:</p> |
| <div class="sourceCode" id="cb2"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span><span class="op">)</span></span></code></pre></div> |
| <p>To do this for the GCS version of the data, the command is as |
| follows:</p> |
| <div class="sourceCode" id="cb3"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| <p>Note that <code>anonymous = TRUE</code> is required for GCS if |
| credentials have not been configured.</p> |
| <!-- TODO: update GCS note above if ARROW-17097 is addressed --> |
| <p>Within this bucket there is a folder called <code>diamonds</code>. We |
| can call <code>bucket$ls("diamonds")</code> to list the files stored in |
| this folder, or <code>bucket$ls("diamonds", recursive = TRUE)</code> to |
| recursively search subfolders. Note that on GCS, you should always set |
| <code>recursive = TRUE</code> because directories often don’t appear in |
| the results.</p> |
| <p>Here’s what we get when we list the files stored in the GCS |
| bucket:</p> |
| <div class="sourceCode" id="cb4"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span><span class="op">$</span><span class="fu">ls</span><span class="op">(</span><span class="st">"diamonds"</span>, recursive <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| <div class="sourceCode" id="cb5"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="co">## [1] "diamonds/cut=Fair/part-0.parquet" </span></span> |
| <span><span class="co">## [2] "diamonds/cut=Good/part-0.parquet" </span></span> |
| <span><span class="co">## [3] "diamonds/cut=Ideal/part-0.parquet" </span></span> |
| <span><span class="co">## [4] "diamonds/cut=Premium/part-0.parquet" </span></span> |
| <span><span class="co">## [5] "diamonds/cut=Very Good/part-0.parquet"</span></span></code></pre></div> |
| <p>There are 5 Parquet files here, one corresponding to each of the |
| “cut” categories in the <code>diamonds</code> data set. We can specify |
| the path to a specific file by calling <code>bucket$path()</code>:</p> |
| <div class="sourceCode" id="cb6"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">parquet_good</span> <span class="op"><-</span> <span class="va">bucket</span><span class="op">$</span><span class="fu">path</span><span class="op">(</span><span class="st">"diamonds/cut=Good/part-0.parquet"</span><span class="op">)</span></span></code></pre></div> |
| <p>We can use <code><a href="../reference/read_parquet.html">read_parquet()</a></code> to read from this path |
| directly into R:</p> |
| <div class="sourceCode" id="cb7"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">diamonds_good</span> <span class="op"><-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">parquet_good</span><span class="op">)</span></span> |
| <span><span class="va">diamonds_good</span></span></code></pre></div> |
| <div class="sourceCode" id="cb8"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="co">## # A tibble: 4,906 × 9</span></span> |
| <span><span class="co">## carat color clarity depth table price x y z</span></span> |
| <span><span class="co">## <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl></span></span> |
| <span><span class="co">## 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31</span></span> |
| <span><span class="co">## 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75</span></span> |
| <span><span class="co">## 3 0.3 J SI1 64 55 339 4.25 4.28 2.73</span></span> |
| <span><span class="co">## 4 0.3 J SI1 63.4 54 351 4.23 4.29 2.7 </span></span> |
| <span><span class="co">## 5 0.3 J SI1 63.8 56 351 4.23 4.26 2.71</span></span> |
| <span><span class="co">## 6 0.3 I SI2 63.3 56 351 4.26 4.3 2.71</span></span> |
| <span><span class="co">## 7 0.23 F VS1 58.2 59 402 4.06 4.08 2.37</span></span> |
| <span><span class="co">## 8 0.23 E VS1 64.1 59 402 3.83 3.85 2.46</span></span> |
| <span><span class="co">## 9 0.31 H SI1 64 54 402 4.29 4.31 2.75</span></span> |
| <span><span class="co">## 10 0.26 D VS2 65.2 56 403 3.99 4.02 2.61</span></span> |
| <span><span class="co">## # … with 4,896 more rows</span></span> |
| <span><span class="co">## # ℹ Use `print(n = ...)` to see more rows</span></span></code></pre></div> |
| <p>Note that this will be slower to read than if the file were |
| local.</p> |
| <!-- though if you're running on a machine in the same AWS region as the file in S3, |
| the cost of reading the data over the network should be much lower. --> |
| <!-- |
| See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()` |
| and `gs_bucket()`/`GcsFileSystem$create()` can take. |
| |
| The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`, |
| which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be |
| useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere). |
| |
| One way to get a subtree is to call the `$cd()` method on a `FileSystem` |
| |
| ```r |
| june2019 <- bucket$cd("nyc-taxi/year=2019/month=6") |
| df <- read_parquet(june2019$path("part-0.parquet")) |
| ``` |
| |
| `SubTreeFileSystem` can also be made from a URI: |
| |
| ```r |
| june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6") |
| ``` |
| --> |
| </div> |
| <div class="section level2"> |
| <h2 id="connecting-directly-with-a-uri">Connecting directly with a URI<a class="anchor" aria-label="anchor" href="#connecting-directly-with-a-uri"></a> |
| </h2> |
| <p>In most use cases, the easiest and most natural way to connect to |
| cloud storage in arrow is to use the FileSystem objects returned by |
| <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code> and <code><a href="../reference/gs_bucket.html">gs_bucket()</a></code>, especially when |
| multiple file operations are required. However, in some cases you may |
| want to download a file directly by specifying the URI. This is |
| permitted by arrow, and functions like <code><a href="../reference/read_parquet.html">read_parquet()</a></code>, |
| <code><a href="../reference/write_feather.html">write_feather()</a></code>, <code><a href="../reference/open_dataset.html">open_dataset()</a></code> etc will all |
| accept URIs to cloud resources hosted on S3 or GCS. The format of an S3 |
| URI is as follows:</p> |
| <pre><code>s3://[access_key:secret_key@]bucket/path[?region=]</code></pre> |
| <p>For GCS, the URI format looks like this:</p> |
| <pre><code>gs://[access_key:secret_key@]bucket/path |
| gs://anonymous@bucket/path</code></pre> |
| <p>For example, the Parquet file storing the “good cut” diamonds that we |
| downloaded earlier in the article is available on both S3 and CGS. The |
| relevant URIs are as follows:</p> |
| <div class="sourceCode" id="cb11"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">uri</span> <span class="op"><-</span> <span class="st">"s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span></span> |
| <span><span class="va">uri</span> <span class="op"><-</span> <span class="st">"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span></span></code></pre></div> |
| <p>Note that “anonymous” is required on GCS for public buckets. |
| Regardless of which version you use, you can pass this URI to |
| <code><a href="../reference/read_parquet.html">read_parquet()</a></code> as if the file were stored locally:</p> |
| <div class="sourceCode" id="cb12"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">df</span> <span class="op"><-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="va">uri</span><span class="op">)</span></span></code></pre></div> |
| <p>URIs accept additional options in the query parameters (the part |
| after the <code>?</code>) that are passed down to configure the |
| underlying file system. They are separated by <code>&</code>. For |
| example,</p> |
| <pre><code>s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true</code></pre> |
| <p>is equivalent to:</p> |
| <div class="sourceCode" id="cb14"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="va">S3FileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span> |
| <span> endpoint_override<span class="op">=</span><span class="st">"https://storage.googleapis.com"</span>,</span> |
| <span> allow_bucket_creation<span class="op">=</span><span class="cn">TRUE</span></span> |
| <span><span class="op">)</span></span> |
| <span><span class="va">bucket</span><span class="op">$</span><span class="fu">path</span><span class="op">(</span><span class="st">"voltrondata-labs-datasets/"</span><span class="op">)</span></span></code></pre></div> |
| <p>Both tell the <code>S3FileSystem</code> object that it should allow |
| the creation of new buckets and to talk to Google Storage instead of S3. |
| The latter works because GCS implements an S3-compatible API – see <a href="#file-systems-that-emulate-s3">File systems that emulate S3</a> |
| below – but if you want better support for GCS you should refer to a |
| <code>GcsFileSystem</code> but using a URI that starts with |
| <code>gs://</code>.</p> |
| <p>Also note that parameters in the URI need to be <a href="https://en.wikipedia.org/wiki/Percent-encoding" class="external-link">percent |
| encoded</a>, which is why <code>://</code> is written as |
| <code>%3A%2F%2F</code>.</p> |
| <p>For S3, only the following options can be included in the URI as |
| query parameters are <code>region</code>, <code>scheme</code>, |
| <code>endpoint_override</code>, <code>access_key</code>, |
| <code>secret_key</code>, <code>allow_bucket_creation</code>, |
| <code>allow_bucket_deletion</code> and |
| <code>check_directory_existence_before_creation</code>. For GCS, the |
| supported parameters are <code>scheme</code>, |
| <code>endpoint_override</code>, and |
| <code>retry_limit_seconds</code>.</p> |
| <p>In GCS, a useful option is <code>retry_limit_seconds</code>, which |
| sets the number of seconds a request may spend retrying before returning |
| an error. The current default is 15 minutes, so in many interactive |
| contexts it’s nice to set a lower value:</p> |
| <pre><code>gs://anonymous@voltrondata-labs-datasets/diamonds/?retry_limit_seconds=10</code></pre> |
| </div> |
| <div class="section level2"> |
| <h2 id="authentication">Authentication<a class="anchor" aria-label="anchor" href="#authentication"></a> |
| </h2> |
| <div class="section level3"> |
| <h3 id="s3-authentication">S3 Authentication<a class="anchor" aria-label="anchor" href="#s3-authentication"></a> |
| </h3> |
| <p>To access private S3 buckets, you need typically need two secret |
| parameters: a <code>access_key</code>, which is like a user id, and |
| <code>secret_key</code>, which is like a token or password. There are a |
| few options for passing these credentials:</p> |
| <ul> |
| <li><p>Include them in the URI, like |
| <code>s3://access_key:secret_key@bucket-name/path/to/file</code>. Be |
| sure to <a href="https://en.wikipedia.org/wiki/Percent-encoding" class="external-link">URL-encode</a> |
| your secrets if they contain special characters like “/” (e.g., |
| <code>URLencode("123/456", reserved = TRUE)</code>).</p></li> |
| <li><p>Pass them as <code>access_key</code> and <code>secret_key</code> |
| to <code>S3FileSystem$create()</code> or |
| <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code></p></li> |
| <li><p>Set them as environment variables named |
| <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>, |
| respectively.</p></li> |
| <li><p>Define them in a <code>~/.aws/credentials</code> file, according |
| to the <a href="https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html" class="external-link">AWS |
| documentation</a>.</p></li> |
| <li><p>Use an <a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" class="external-link">AccessRole</a> |
| for temporary access by passing the <code>role_arn</code> identifier to |
| <code>S3FileSystem$create()</code> or <code><a href="../reference/s3_bucket.html">s3_bucket()</a></code>.</p></li> |
| </ul> |
| </div> |
| <div class="section level3"> |
| <h3 id="gcs-authentication">GCS Authentication<a class="anchor" aria-label="anchor" href="#gcs-authentication"></a> |
| </h3> |
| <p>The simplest way to authenticate with GCS is to run the <a href="https://cloud.google.com/sdk/docs/" class="external-link">gcloud</a> command to setup |
| application default credentials:</p> |
| <pre><code>gcloud auth application-default login</code></pre> |
| <p>To manually configure credentials, you can pass either |
| <code>access_token</code> and <code>expiration</code>, for using |
| temporary tokens generated elsewhere, or <code>json_credentials</code>, |
| to reference a downloaded credentials file.</p> |
| <p>If you haven’t configured credentials, then to access <em>public</em> |
| buckets, you must pass <code>anonymous = TRUE</code> or |
| <code>anonymous</code> as the user in a URI:</p> |
| <div class="sourceCode" id="cb17"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/gs_bucket.html">gs_bucket</a></span><span class="op">(</span><span class="st">"voltrondata-labs-datasets"</span>, anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span> |
| <span><span class="va">fs</span> <span class="op"><-</span> <span class="va">GcsFileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span>anonymous <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span> |
| <span><span class="va">df</span> <span class="op"><-</span> <span class="fu"><a href="../reference/read_parquet.html">read_parquet</a></span><span class="op">(</span><span class="st">"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"</span><span class="op">)</span></span></code></pre></div> |
| <!-- TODO(ARROW-16880): Describe what credentials to use for particular use cases |
| and how to integrate with gargle library. --> |
| </div> |
| </div> |
| <div class="section level2"> |
| <h2 id="using-a-proxy-server">Using a proxy server<a class="anchor" aria-label="anchor" href="#using-a-proxy-server"></a> |
| </h2> |
| <p>If you need to use a proxy server to connect to an S3 bucket, you can |
| provide a URI in the form <code>http://user:password@host:port</code> to |
| <code>proxy_options</code>. For example, a local proxy server running on |
| port 1316 can be used like this:</p> |
| <div class="sourceCode" id="cb18"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">bucket</span> <span class="op"><-</span> <span class="fu"><a href="../reference/s3_bucket.html">s3_bucket</a></span><span class="op">(</span></span> |
| <span> bucket <span class="op">=</span> <span class="st">"voltrondata-labs-datasets"</span>, </span> |
| <span> proxy_options <span class="op">=</span> <span class="st">"http://localhost:1316"</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| </div> |
| <div class="section level2"> |
| <h2 id="file-systems-that-emulate-s3">File systems that emulate S3<a class="anchor" aria-label="anchor" href="#file-systems-that-emulate-s3"></a> |
| </h2> |
| <p>The <code>S3FileSystem</code> machinery enables you to work with any |
| file system that provides an S3-compatible interface. For example, <a href="https://min.io/" class="external-link">MinIO</a> is and object-storage server that |
| emulates the S3 API. If you were to run <code>minio server</code> |
| locally with its default settings, you could connect to it with arrow |
| using <code>S3FileSystem</code> like this:</p> |
| <div class="sourceCode" id="cb19"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="va">minio</span> <span class="op"><-</span> <span class="va">S3FileSystem</span><span class="op">$</span><span class="fu">create</span><span class="op">(</span></span> |
| <span> access_key <span class="op">=</span> <span class="st">"minioadmin"</span>,</span> |
| <span> secret_key <span class="op">=</span> <span class="st">"minioadmin"</span>,</span> |
| <span> scheme <span class="op">=</span> <span class="st">"http"</span>,</span> |
| <span> endpoint_override <span class="op">=</span> <span class="st">"localhost:9000"</span></span> |
| <span><span class="op">)</span></span></code></pre></div> |
| <p>or, as a URI, it would be</p> |
| <pre><code>s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000</code></pre> |
| <p>(Note the URL escaping of the <code>:</code> in |
| <code>endpoint_override</code>).</p> |
| <p>Among other applications, this can be useful for testing out code |
| locally before running on a remote S3 bucket.</p> |
| </div> |
| <div class="section level2"> |
| <h2 id="disabling-environment-variables">Disabling environment variables<a class="anchor" aria-label="anchor" href="#disabling-environment-variables"></a> |
| </h2> |
| <p>As mentioned above, it is possible to make use of environment |
| variables to configure access. However, if you wish to pass in |
| connection details via a URI or alternative methods but also have |
| existing AWS environment variables defined, these may interfere with |
| your session. For example, you may see an error message like:</p> |
| <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a>Error<span class="sc">:</span> IOError<span class="sc">:</span> When resolving region <span class="cf">for</span> bucket <span class="st">'analysis'</span><span class="sc">:</span> AWS Error [code <span class="dv">99</span>]<span class="sc">:</span> curlCode<span class="sc">:</span> <span class="dv">6</span>, Couldn<span class="st">'t resolve host name </span></span></code></pre></div> |
| <p>You can unset these environment variables using |
| <code><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv()</a></code>, for example:</p> |
| <div class="sourceCode" id="cb22"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv</a></span><span class="op">(</span><span class="st">"AWS_DEFAULT_REGION"</span><span class="op">)</span></span> |
| <span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.unsetenv</a></span><span class="op">(</span><span class="st">"AWS_S3_ENDPOINT"</span><span class="op">)</span></span></code></pre></div> |
| <p>By default, the AWS SDK tries to retrieve metadata about user |
| configuration, which can cause conflicts when passing in connection |
| details via URI (for example when accessing a MINIO bucket). To disable |
| the use of AWS environment variables, you can set environment variable |
| <code>AWS_EC2_METADATA_DISABLED</code> to <code>TRUE</code>.</p> |
| <div class="sourceCode" id="cb23"><pre class="downlit sourceCode r"> |
| <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Sys.setenv.html" class="external-link">Sys.setenv</a></span><span class="op">(</span>AWS_EC2_METADATA_DISABLED <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span></code></pre></div> |
| </div> |
| <div class="section level2"> |
| <h2 id="further-reading">Further reading<a class="anchor" aria-label="anchor" href="#further-reading"></a> |
| </h2> |
| <ul> |
| <li>To learn more about <code>FileSystem</code> classes, including |
| <code>S3FileSystem</code> and <code>GcsFileSystem</code>, see |
| <code><a href="../reference/FileSystem.html">help("FileSystem", package = "arrow")</a></code>.</li> |
| <li>To see a data analysis example that relies on data hosted on cloud |
| storage, see the <a href="./dataset.html">dataset article</a>.</li> |
| </ul> |
| </div> |
| </main><aside class="col-md-3"><nav id="toc" aria-label="Table of contents"><h2>On this page</h2> |
| </nav></aside> |
| </div> |
| |
| |
| |
| <footer><div class="pkgdown-footer-left"> |
| <p><a href="https://arrow.apache.org/docs/r/versions.html">Older versions of these docs</a></p> |
| </div> |
| |
| <div class="pkgdown-footer-right"> |
| <p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.1.3.</p> |
| </div> |
| |
| </footer> |
| </div> |
| |
| |
| |
| |
| |
| </body> |
| </html> |