r/vignettes/fs.Rmd - arrow - Git at Google

 ---
 title: "Working with Cloud Storage (S3)"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{Working with Cloud Storage (S3)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---

 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
 backends. In the `arrow` R package, support has been enabled for AWS S3.
 This vignette provides an overview of working with S3 data using Arrow.

 > In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details.

 ## URIs

 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
 accept an S3 URI as the source or destination file,
 as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:

 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```

 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at

 ```
 s3://ursa-labs-taxi-data/2019/06/data.parquet
 ```

 Given this URI, we can pass it to `read_parquet()` just as if it were a local file path:

 ```r
 df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
 ```

 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.

 ## Creating a FileSystem object

 Another way to connect to S3 is to create a `FileSystem` object once and pass
 that to the read/write functions.
 `S3FileSystem` objects can be created with the `s3_bucket()` function, which
 automatically detects the bucket's AWS region. Additionally, the resulting
 `FileSystem` will consider paths relative to the bucket's path (so for example
 you don't need to prefix the bucket path when listing a directory).
 This may be convenient when dealing with
 long URIs, and it's necessary for some options and authentication methods
 that aren't supported in the URI format.

 With a `FileSystem` object, we can point to specific files in it with the `$path()` method.
 In the previous example, this would look like:

 ```r
 bucket <- s3_bucket("ursa-labs-taxi-data")
 df <- read_parquet(bucket$path("2019/06/data.parquet"))
 ```

 See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()`
 can take. `region`, `scheme`, and `endpoint_override` can be encoded as query
 parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted).
 `access_key` and `secret_key` can also be included,
 but other options are not supported in the URI.

 The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere.

 One way to get a subtree is to call the `$cd()` method on a `FileSystem`

 ```r
 june2019 <- bucket$cd("2019/06")
 df <- read_parquet(june2019$path("data.parquet"))
 ```

 `SubTreeFileSystem` can also be made from a URI:

 ```r
 june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
 ```

 ## Authentication

 To access private S3 buckets, you need typically need two secret parameters:
 a `access_key`, which is like a user id,
 and `secret_key`, like a token.
 There are a few options for passing these credentials:

 1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/".

 2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()`

 3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively.

 4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html).

 You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html)
 for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`.

 ## File systems that emulate S3

 The `S3FileSystem` machinery enables you to work with any file system that
 provides an S3-compatible interface. For example, [MinIO](https://min.io/) is
 and object-storage server that emulates the S3 API. If you were to
 run `minio server` locally with its default settings, you could connect to
 it with `arrow` using `S3FileSystem` like this:

 ```r
 minio <- S3FileSystem$create(
   access_key = "minioadmin",
   secret_key = "minioadmin",
   scheme = "http",
   endpoint_override = "localhost:9000"
 )
 ```

 or, as a URI, it would be

 ```
 s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 ```

 (note the URL escaping of the `:` in `endpoint_override`).

 Among other applications, this can be useful for testing out code locally before
 running on a remote S3 bucket.
	---
	title: "Working with Cloud Storage (S3)"
	output: rmarkdown::html_vignette
	vignette: >
	%\VignetteIndexEntry{Working with Cloud Storage (S3)}
	%\VignetteEngine{knitr::rmarkdown}
	%\VignetteEncoding{UTF-8}
	---

	The Arrow C++ library includes a generic filesystem interface and specific
	implementations for some cloud storage systems. This setup allows various
	parts of the project to be able to read and write data with different storage
	backends. In the `arrow` R package, support has been enabled for AWS S3.
	This vignette provides an overview of working with S3 data using Arrow.

	> In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details.

	## URIs

	File readers and writers (`read_parquet()`, `write_feather()`, et al.)
	accept an S3 URI as the source or destination file,
	as do `open_dataset()` and `write_dataset()`.
	An S3 URI looks like:

	```
	s3://[access_key:secret_key@]bucket/path[?region=]
	```

	For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at

	```
	s3://ursa-labs-taxi-data/2019/06/data.parquet
	```

	Given this URI, we can pass it to `read_parquet()` just as if it were a local file path:

	```r
	df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
	```

	Note that this will be slower to read than if the file were local,
	though if you're running on a machine in the same AWS region as the file in S3,
	the cost of reading the data over the network should be much lower.

	## Creating a FileSystem object

	Another way to connect to S3 is to create a `FileSystem` object once and pass
	that to the read/write functions.
	`S3FileSystem` objects can be created with the `s3_bucket()` function, which
	automatically detects the bucket's AWS region. Additionally, the resulting
	`FileSystem` will consider paths relative to the bucket's path (so for example
	you don't need to prefix the bucket path when listing a directory).
	This may be convenient when dealing with
	long URIs, and it's necessary for some options and authentication methods
	that aren't supported in the URI format.

	With a `FileSystem` object, we can point to specific files in it with the `$path()` method.
	In the previous example, this would look like:

	```r
	bucket <- s3_bucket("ursa-labs-taxi-data")
	df <- read_parquet(bucket$path("2019/06/data.parquet"))
	```

	See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()`
	can take. `region`, `scheme`, and `endpoint_override` can be encoded as query
	parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted).
	`access_key` and `secret_key` can also be included,
	but other options are not supported in the URI.

	The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere.

	One way to get a subtree is to call the `$cd()` method on a `FileSystem`

	```r
	june2019 <- bucket$cd("2019/06")
	df <- read_parquet(june2019$path("data.parquet"))
	```

	`SubTreeFileSystem` can also be made from a URI:

	```r
	june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
	```

	## Authentication

	To access private S3 buckets, you need typically need two secret parameters:
	a `access_key`, which is like a user id,
	and `secret_key`, like a token.
	There are a few options for passing these credentials:

	1. Include them in the URI, like `s3://access_key:secret_key@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/".

	2. Pass them as `access_key` and `secret_key` to `S3FileSystem$create()` or `s3_bucket()`

	3. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, respectively.

	4. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html).

	You can also use an [AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html)
	for temporary access by passing the `role_arn` identifier to `S3FileSystem$create()` or `s3_bucket()`.

	## File systems that emulate S3

	The `S3FileSystem` machinery enables you to work with any file system that
	provides an S3-compatible interface. For example, [MinIO](https://min.io/) is
	and object-storage server that emulates the S3 API. If you were to
	run `minio server` locally with its default settings, you could connect to
	it with `arrow` using `S3FileSystem` like this:

	```r
	minio <- S3FileSystem$create(
	access_key = "minioadmin",
	secret_key = "minioadmin",
	scheme = "http",
	endpoint_override = "localhost:9000"
	)
	```

	or, as a URI, it would be

	```
	s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
	```

	(note the URL escaping of the `:` in `endpoint_override`).

	Among other applications, this can be useful for testing out code locally before
	running on a remote S3 bucket.