| --- |
| title: "Working with Cloud Storage (S3)" |
| output: rmarkdown::html_vignette |
| vignette: > |
| %\VignetteIndexEntry{Working with Cloud Storage (S3)} |
| %\VignetteEngine{knitr::rmarkdown} |
| %\VignetteEncoding{UTF-8} |
| --- |
| |
| The Arrow C++ library includes a generic filesystem interface and specific |
| implementations for some cloud storage systems. This setup allows various |
| parts of the project to be able to read and write data with different storage |
| backends. In the `arrow` R package, support has been enabled for AWS S3. |
| This vignette provides an overview of working with S3 data using Arrow. |
| |
| > In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. |
| |
| ## URIs |
| |
| File readers and writers (`read_parquet()`, `write_feather()`, et al.) |
| accept an S3 URI as the source or destination file, |
| as do `open_dataset()` and `write_dataset()`. |
| An S3 URI looks like: |
| |
| ``` |
| s3://[access_key:secret_key@]bucket/path[?region=] |
| ``` |
| |
| For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at |
| |
| ``` |
| s3://ursa-labs-taxi-data/2019/06/data.parquet |
| ``` |
| |
| Given this URI, we can pass it to `read_parquet()` just as if it were a local file path: |
| |
| ```r |
| df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet") |
| ``` |
| |
| Note that this will be slower to read than if the file were local, |
| though if you're running on a machine in the same AWS region as the file in S3, |
| the cost of reading the data over the network should be much lower. |
| |
| ## Creating a FileSystem object |
| |
| Another way to connect to S3 is to create a `FileSystem` object once and pass |
| that to the read/write functions. |
| `S3FileSystem` objects can be created with the `s3_bucket()` function, which |
| automatically detects the bucket's AWS region. Additionally, the resulting |
| `FileSystem` will consider paths relative to the bucket's path (so for example |
| you don't need to prefix the bucket path when listing a directory). |
| This may be convenient when dealing with |
| long URIs, and it's necessary for some options and authentication methods |
| that aren't supported in the URI format. |
| |
| With a `FileSystem` object, we can point to specific files in it with the `$path()` method. |
| In the previous example, this would look like: |
| |
| ```r |
| bucket <- s3_bucket("ursa-labs-taxi-data") |
| df <- read_parquet(bucket$path("2019/06/data.parquet")) |
| ``` |
| |
| See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()` |
| can take. `region`, `scheme`, and `endpoint_override` can be encoded as query |
| parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted). |
| `access_key` and `secret_key` can also be included, |
| but other options are not supported in the URI. |
| |
| The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere. |
| |
| One way to get a subtree is to call the `$cd()` method on a `FileSystem` |
| |
| ```r |
| june2019 <- bucket$cd("2019/06") |
| df <- read_parquet(june2019$path("data.parquet")) |
| ``` |
| |
| `SubTreeFileSystem` can also be made from a URI: |
| |
| ```r |
| june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06") |
| ``` |
| |
| ## Authentication |
| |
| To access private S3 buckets, you need typically need two secret parameters: |
| a `access_key`, which is like a user id, |
| and `secret_key`, like a token. |
| There are a few options for passing these credentials: |
| |
| 1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/". |
| |
| 2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()` |
| |
| 3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively. |
| |
| 4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html). |
| |
| You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) |
| for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`. |
| |
| ## File systems that emulate S3 |
| |
| The `S3FileSystem` machinery enables you to work with any file system that |
| provides an S3-compatible interface. For example, [MinIO](https://min.io/) is |
| and object-storage server that emulates the S3 API. If you were to |
| run `minio server` locally with its default settings, you could connect to |
| it with `arrow` using `S3FileSystem` like this: |
| |
| ```r |
| minio <- S3FileSystem$create( |
| access_key = "minioadmin", |
| secret_key = "minioadmin", |
| scheme = "http", |
| endpoint_override = "localhost:9000" |
| ) |
| ``` |
| |
| or, as a URI, it would be |
| |
| ``` |
| s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000 |
| ``` |
| |
| (note the URL escaping of the `:` in `endpoint_override`). |
| |
| Among other applications, this can be useful for testing out code locally before |
| running on a remote S3 bucket. |