| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Defining Data Types |
| |
| ## Introduction |
| |
| As discussed in previous chapters, Arrow automatically infers the most |
| appropriate data type when reading in data or converting R objects to Arrow |
| objects. However, you might want to manually tell Arrow which data types to |
| use, for example, to ensure interoperability with databases and data warehouse |
| systems. This chapter includes recipes for: |
| |
| * changing the data types of existing Arrow objects |
| * defining data types during the process of creating Arrow objects |
| |
| A table showing the default mappings between R and Arrow data types can be found |
| in [R data type to Arrow data type mappings](https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow). |
| |
| A table containing Arrow data types, and their R equivalents can be found in |
| [Arrow data type to R data type mapping](https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r). |
| |
| ## Update data type of an existing Arrow Array |
| |
| You want to change the data type of an existing Arrow Array. |
| |
| ### Solution |
| |
| ```{r, cast_array} |
| # Create an Array to cast |
| integer_arr <- Array$create(1:5) |
| |
| # Cast to an unsigned int8 type |
| uint_arr <- integer_arr$cast(target_type = uint8()) |
| |
| uint_arr |
| ``` |
| |
| ```{r, test_cast_array, opts.label = "test"} |
| test_that("cast_array works as expected", { |
| expect_equal( |
| uint_arr$type, |
| uint8() |
| ) |
| }) |
| ``` |
| |
| ### Discussion |
| |
| There are some data types which are not compatible with each other. Errors will |
| occur if you try to cast between incompatible data types. |
| |
| ```{r, incompat, eval = FALSE} |
| int_arr <- Array$create(1:5) |
| int_arr$cast(target_type = binary()) |
| ``` |
| |
| ```{r} |
| ## Error: NotImplemented: Unsupported cast from int32 to binary using function cast_binary |
| ``` |
| |
| ```{r, test_incompat, opts.label = "test"} |
| test_that("test_incompat works as expected", { |
| expect_error( |
| int_arr$cast(target_type = binary()) |
| ) |
| }) |
| ``` |
| |
| ## Update data type of a field in an existing Arrow Table |
| |
| You want to change the type of one or more fields in an existing Arrow Table. |
| |
| ### Solution |
| |
| ```{r, cast_table} |
| # Set up a tibble to use in this example |
| oscars <- tibble::tibble( |
| actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), |
| num_awards = c(4, 3, 3) |
| ) |
| |
| # Convert tibble to an Arrow table |
| oscars_arrow <- arrow_table(oscars) |
| |
| # The default mapping from numeric column "num_awards" is to a double |
| oscars_arrow |
| |
| # Set up schema with "num_awards" as integer |
| oscars_schema <- schema(actor = string(), num_awards = int16()) |
| |
| # Cast to an int16 |
| oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema) |
| |
| oscars_arrow_int |
| ``` |
| |
| ```{r, test_cast_table, opts.label = "test"} |
| test_that("cast_table works as expected", { |
| expect_equal( |
| oscars_arrow_int$schema, |
| schema(actor = string(), num_awards = int16()) |
| ) |
| }) |
| ``` |
| |
| ### Discussion {#no-compat-type} |
| |
| There are some Arrow data types which do not have any R equivalent. Attempting |
| to cast to these data types or using a schema which contains them will result in |
| an error. |
| |
| ```{r, float_16_conversion, error=TRUE, eval=FALSE} |
| # Set up a tibble to use in this example |
| oscars <- tibble::tibble( |
| actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), |
| num_awards = c(4, 3, 3) |
| ) |
| |
| # Convert tibble to an Arrow table |
| oscars_arrow <- arrow_table(oscars) |
| |
| # Set up schema with "num_awards" as float16 which doesn't have an R equivalent |
| oscars_schema_invalid <- schema(actor = string(), num_awards = float16()) |
| |
| # The default mapping from numeric column "num_awards" is to a double |
| oscars_arrow$cast(target_schema = oscars_schema_invalid) |
| ``` |
| |
| ```{r} |
| ## Error: NotImplemented: Unsupported cast from double to halffloat using function cast_half_float |
| ``` |
| |
| ```{r, test_float_16_conversion, opts.label = "test"} |
| test_that("float_16_conversion works as expected", { |
| |
| oscars_schema_invalid <- schema(actor = string(), num_awards = float16()) |
| |
| expect_error( |
| oscars_arrow$cast(target_schema = oscars_schema_invalid), |
| "NotImplemented: Unsupported cast from double to halffloat using function cast_half_float" |
| ) |
| }) |
| ``` |
| |
| ## Specify data types when creating an Arrow table from an R object |
| |
| You want to manually specify Arrow data types when converting an object from a |
| data frame to an Arrow object. |
| |
| ### Solution |
| |
| ```{r, use_schema} |
| # Set up a tibble to use in this example |
| oscars <- tibble::tibble( |
| actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), |
| num_awards = c(4, 3, 3) |
| ) |
| |
| # Set up schema with "num_awards" as integer |
| oscars_schema <- schema(actor = string(), num_awards = int16()) |
| |
| # create arrow Table containing data and schema |
| oscars_data_arrow <- arrow_table(oscars, schema = oscars_schema) |
| |
| oscars_data_arrow |
| ``` |
| ```{r, test_use_schema, opts.label = "test"} |
| test_that("use_schema works as expected", { |
| expect_s3_class(oscars_data_arrow, "Table") |
| expect_equal( |
| oscars_data_arrow$schema, |
| oscars_schema |
| ) |
| }) |
| ``` |
| |
| ## Specify data types when reading in files |
| |
| You want to manually specify Arrow data types when reading in files. |
| |
| ### Solution |
| |
| ```{r, use_schema_dataset} |
| # Set up a tibble to use in this example |
| oscars <- tibble::tibble( |
| actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), |
| num_awards = c(4, 3, 3) |
| ) |
| |
| # write dataset to disk |
| write_dataset(oscars, path = "oscars_data") |
| |
| # Set up schema with "num_awards" as integer |
| oscars_schema <- schema(actor = string(), num_awards = int16()) |
| |
| # read the dataset in, using the schema instead of inferring the type automatically |
| oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema) |
| |
| oscars_dataset_arrow |
| ``` |
| ```{r, test_use_schema_dataset, opts.label = "test"} |
| test_that("use_schema_dataset works as expected", { |
| expect_s3_class(oscars_dataset_arrow, "Dataset") |
| expect_equal(oscars_dataset_arrow$schema, |
| oscars_schema |
| ) |
| }) |
| ``` |
| ```{r, include=FALSE} |
| unlink("oscars_data", recursive = TRUE) |
| ``` |
| |