blob: 1a18085c40244eccb1f50b2e8f6f7262b298bda9 [file] [log] [blame]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Defining Data Types
## Introduction
As discussed in previous chapters, Arrow automatically infers the most
appropriate data type when reading in data or converting R objects to Arrow
objects. However, you might want to manually tell Arrow which data types to
use, for example, to ensure interoperability with databases and data warehouse
systems. This chapter includes recipes for:
* changing the data types of existing Arrow objects
* defining data types during the process of creating Arrow objects
A table showing the default mappings between R and Arrow data types can be found
in [R data type to Arrow data type mappings](https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow).
A table containing Arrow data types, and their R equivalents can be found in
[Arrow data type to R data type mapping](https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r).
## Update data type of an existing Arrow Array
You want to change the data type of an existing Arrow Array.
### Solution
```{r, cast_array}
# Create an Array to cast
integer_arr <- Array$create(1:5)
# Cast to an unsigned int8 type
uint_arr <- integer_arr$cast(target_type = uint8())
uint_arr
```
```{r, test_cast_array, opts.label = "test"}
test_that("cast_array works as expected", {
expect_equal(
uint_arr$type,
uint8()
)
})
```
### Discussion
There are some data types which are not compatible with each other. Errors will
occur if you try to cast between incompatible data types.
```{r, incompat, eval = FALSE}
int_arr <- Array$create(1:5)
int_arr$cast(target_type = binary())
```
```{r}
## Error: NotImplemented: Unsupported cast from int32 to binary using function cast_binary
```
```{r, test_incompat, opts.label = "test"}
test_that("test_incompat works as expected", {
expect_error(
int_arr$cast(target_type = binary())
)
})
```
## Update data type of a field in an existing Arrow Table
You want to change the type of one or more fields in an existing Arrow Table.
### Solution
```{r, cast_table}
# Set up a tibble to use in this example
oscars <- tibble::tibble(
actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Convert tibble to an Arrow table
oscars_arrow <- arrow_table(oscars)
# The default mapping from numeric column "num_awards" is to a double
oscars_arrow
# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())
# Cast to an int16
oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema)
oscars_arrow_int
```
```{r, test_cast_table, opts.label = "test"}
test_that("cast_table works as expected", {
expect_equal(
oscars_arrow_int$schema,
schema(actor = string(), num_awards = int16())
)
})
```
### Discussion {#no-compat-type}
There are some Arrow data types which do not have any R equivalent. Attempting
to cast to these data types or using a schema which contains them will result in
an error.
```{r, float_16_conversion, error=TRUE, eval=FALSE}
# Set up a tibble to use in this example
oscars <- tibble::tibble(
actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Convert tibble to an Arrow table
oscars_arrow <- arrow_table(oscars)
# Set up schema with "num_awards" as float16 which doesn't have an R equivalent
oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
# The default mapping from numeric column "num_awards" is to a double
oscars_arrow$cast(target_schema = oscars_schema_invalid)
```
```{r}
## Error: NotImplemented: Unsupported cast from double to halffloat using function cast_half_float
```
```{r, test_float_16_conversion, opts.label = "test"}
test_that("float_16_conversion works as expected", {
oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
expect_error(
oscars_arrow$cast(target_schema = oscars_schema_invalid),
"NotImplemented: Unsupported cast from double to halffloat using function cast_half_float"
)
})
```
## Specify data types when creating an Arrow table from an R object
You want to manually specify Arrow data types when converting an object from a
data frame to an Arrow object.
### Solution
```{r, use_schema}
# Set up a tibble to use in this example
oscars <- tibble::tibble(
actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())
# create arrow Table containing data and schema
oscars_data_arrow <- arrow_table(oscars, schema = oscars_schema)
oscars_data_arrow
```
```{r, test_use_schema, opts.label = "test"}
test_that("use_schema works as expected", {
expect_s3_class(oscars_data_arrow, "Table")
expect_equal(
oscars_data_arrow$schema,
oscars_schema
)
})
```
## Specify data types when reading in files
You want to manually specify Arrow data types when reading in files.
### Solution
```{r, use_schema_dataset}
# Set up a tibble to use in this example
oscars <- tibble::tibble(
actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
num_awards = c(4, 3, 3)
)
# write dataset to disk
write_dataset(oscars, path = "oscars_data")
# Set up schema with "num_awards" as integer
oscars_schema <- schema(actor = string(), num_awards = int16())
# read the dataset in, using the schema instead of inferring the type automatically
oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema)
oscars_dataset_arrow
```
```{r, test_use_schema_dataset, opts.label = "test"}
test_that("use_schema_dataset works as expected", {
expect_s3_class(oscars_dataset_arrow, "Dataset")
expect_equal(oscars_dataset_arrow$schema,
oscars_schema
)
})
```
```{r, include=FALSE}
unlink("oscars_data", recursive = TRUE)
```