docs/dev/r/articles/developers/data_object_layout.md - arrow-site - Git at Google

 <div id="main" class="col-md-9" role="main">

 # Internal structure of Arrow objects

 This article describes the internal structure of Arrow data objects.
 Users of the arrow R package will not generally need to understand the
 internal structure of Arrow data objects. We include it here to help
 orient those R users and Arrow developers who wish to understand the
 [Arrow
 specification](https://arrow.apache.org/docs/format/Columnar.html). This
 article provides a deeper dive into some of the topics described in the
 [data objects
 article](https://arrow.apache.org/docs/r/articles/data_objects.md), and
 is intended mostly for developers. It is not necessary knowledge for
 using the arrow package.

 We begin by describing two key concepts:

 -   Values in an array are stored in one or more **buffers**. A buffer
     is a sequential virtual address space (i.e., block of memory) with a
     given length. Given a pointer specifying the memory address where
     the buffer starts, you can reach any byte in the buffer with an
     “offset” value that specifies a location relative to the start of
     the buffer.
 -   The **physical layout** of an array is a term used to describe how
     data in an array is laid out in memory, without taking into account
     how that information is interpreted. As an example: a 32-bit signed
     integer and 32-bit floating point number have the same layout: they
     are both 32 bits, represented as 4 contiguous bytes in memory. The
     meaning is different, but the layout is the same.

 We can unpack these ideas using a simple array of integer values:

 <div id="cb1" class="sourceCode">

 ``` r
 integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
 integer_array
 ```

 </div>

     ## Array
     ## <int32>
     ## [
     ##   1,
     ##   null,
     ##   2,
     ##   4,
     ##   8
     ## ]

 We can inspect the `integer_array$type` attribute to see that the values
 in the Array are stored as signed 32 bit integers. When laid out in
 memory by the Arrow C++ library, an integer array consists of two pieces
 of metadata and two buffers that store the data. The metadata specify
 the length of the array and a count of the number of null values, both
 stored as 64-bit integers. These metadata can be viewed from R using
 `integer_array$length()` and `integer_array$null_count` respectively.
 The number of buffers associated with an array depends on the exact type
 of data being stored. For an integer array there are two: a “validity
 bitmap buffer” and a “data value buffer”. Schematically we could depict
 the array as follows:

 ![](array_layout_integer.png)

 This image shows the array as a rectangle subdivided into two parts, one
 for the metadata and the other for the buffers. Underneath the rectangle
 we’ve unpacked the contents of the buffers for you, showing the contents
 of the two buffers in the area enclosed in a dotted line. At the very
 bottom of the figure, you can see the contents of specific bytes.

 <div class="section level2">

 ## Validity bitmap buffer

 The validity bitmap is binary-valued, and contains a 1 whenever the
 corresponding slot in the array contains a valid, non-null value. At an
 abstract level we can assume this contains the following five bits:

     10111

 However this is a slight over-simplification for three reasons. First,
 because memory is allocated in byte-size units there are three trailing
 bits at the end (assumed to be zero), giving us the bitmap `10111000`.
 Second, while we have written this from left-to-right, this written
 format is typically presumed to represent [big endian
 format](https://en.wikipedia.org/wiki/Endianness) where the
 most-significant bit is written first (i.e., to the lowest-valued memory
 address). Arrow adopts a little-endian convention, which would more
 naturally correspond toa right-to-left ordering when written in English.
 To reflect this we write the bits in right-to-left order: `00011101`.
 Finally, Arrow encourages [naturally aligned data
 structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in
 which allocated memory addresses are a multiple of the data block sizes.
 Arrow uses *64 byte alignment*, so each data structure must be a
 multiple of 64 bytes in size. This design feature exists to allow
 efficient use of modern hardware, as discussed in the [Arrow
 specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding).
 This is what the buffer looks like this in memory:

 <div class="grid">

 <div class="g-col-6">

 | Byte 0 (validity bitmap) | Bytes 1-63    |
 |--------------------------|---------------|
 | `00011101`               | `0` (padding) |

 </div>

 </div>

 </div>

 <div class="section level2">

 ## Data buffer

 The data buffer, like the validity bitmap, is padded out to a length of
 64 bytes to preserve natural alignment. Here’s the diagram showing the
 physical layout:

 <div class="grid">

 <div class="g-col-12">

 | Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
 |-----------|-------------|------------|-------------|-------------|-------------|
 | `1`       | unspecified | `2`        | `4`         | `8`         | unspecified |

 </div>

 </div>

 Each integer occupies 4 bytes, as per the requirements of a 32-bit
 signed integer. Notice that the bytes associated with the missing value
 are left unspecified: space is allocated for the value but those bytes
 are not filled.

 </div>

 <div class="section level2">

 ## Offset buffer

 Some types of Arrow array include a third buffer known as the offset
 buffer. This is most frequently encountered in the context of string
 arrays, such as this one:

 <div id="cb4" class="sourceCode">

 ``` r
 string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
 string_array
 ```

 </div>

     ## Array
     ## <string>
     ## [
     ##   "hello",
     ##   "amazing",
     ##   "and",
     ##   "cruel",
     ##   "world"
     ## ]

 Using the same schematic notation as before, this is the structure of
 the object. It has the same metadata as before but as shown below, there
 are now three buffers:

 ![](array_layout_string.png)

 To understand the role of the offset buffer, it helps to note the format
 of the data buffer for a string array: it concatenates all strings end
 to end in one contiguous section of memory. For the `string_array`
 object, the contents of the data buffer would look like one long
 utf8-encoded string:

     helloamazingandcruelworld

 Because individual strings can be of variable length, the role of the
 offset buffer is to specify where the boundaries between the slots are.
 The second slot in our array is the string `"amazing"`. If the positions
 in the data array are indexed like this

 | h   | e   | l   | l   | o   | a   | m   | a   | z   | i   | n   | g   | a   | n   | d   | …   |
 |-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
 | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  | 13  | 14  | …   |

 then we can see that the string of interest begins at position 5 and
 ends at position 11. The offset buffer consists integers that store
 these break point locations. For `string_array` it might look like this:

     0 5 12 15 20 25

 The difference between the `utf8()` data type and the `large_utf8()`
 data type is that these the `utf8()` data type stores these as 32-bit
 integers whereas the `large_utf8()` type stores them as 64-bit integers.

 </div>

 <div class="section level2">

 ## Chunked arrays

 Arrays are immutable objects: once an Array has been initialized the
 values it stores cannot be altered. This ensures that multiple entities
 can safely refer to an Array via pointers, and not run the risk that the
 values will change. Using immutable Arrays makes it possible for Arrow
 to avoid unnecessary copies of data objects.

 There are limitations to immutable Arrays, most notably when new batches
 of data arrive. Because an array is immutable, you can’t add the new
 information to an existing array. The only thing you can do if you don’t
 want to disturb or copy your existing array is create a new array that
 contains the new data. Doing that preserves the immutability of arrays
 and doesn’t lead to any unnecessary copying but now we have a new
 problem: the data are split across two arrays. Each array contains only
 one “chunk” of the data. What would be ideal is an abstraction layer
 that allows us to treat these two Arrays as though they were a single
 “Array-like” object.

 This is the problem that chunked arrays solve. A chunked array is a
 wrapper around a list of arrays, and allows you to index their contents
 “as if” they were a single array. Physically, the data are still stored
 in separate places – each array is one chunk, and these chunks don’t
 have to be adjacent to each other in memory – but the chunked array
 provides us will a layer of abstraction that allows us to pretend that
 they are all one thing.

 To illustrate, let’s use the `chunked_array()` function:

 <div id="cb8" class="sourceCode">

 ``` r
 chunked_string_array <- chunked_array(
   c("hello", "amazing", "and", "cruel", "world"),
   c("I", "love", "you")
 )
 ```

 </div>

 The `chunked_array()` function is just a wrapper around the
 functionality that `ChunkedArray$create()` provides. Let’s take a look
 at the object:

 <div id="cb9" class="sourceCode">

 ``` r
 chunked_string_array
 ```

 </div>

     ## ChunkedArray
     ## <string>
     ## [
     ##   [
     ##     "hello",
     ##     "amazing",
     ##     "and",
     ##     "cruel",
     ##     "world"
     ##   ],
     ##   [
     ##     "I",
     ##     "love",
     ##     "you"
     ##   ]
     ## ]

 The double bracketing in this output is intended to highlight the
 “list-like” nature of chunked arrays. There are three separate arrays,
 wrapped in a container object that is secretly a list of arrays, but
 allows that list to behave just like a regular one-dimensional data
 structure. Schematically it looks like this:

 ![](chunked_array_layout.png)

 As this figure illustrates, there really are three arrays here, each
 with its own validity bitmap, offset buffer, and data buffer.

 </div>

 <div class="section level2">

 ## Record batches

 A record batch is table-like data structure comprised of a sequence of
 arrays. The arrays can be of different types but they must all be the
 same length. Each array is referred to as one of the “fields” or
 “columns” of the record batch. Each field must have a (UTF8-encoded)
 name, and these names form part of the metadata for the record batch.
 When stored in memory, the record batch does not include physical
 storage for the values stored in each field: instead it contains
 pointers to the relevant array objects. It does, however, contain its
 own validity bitmap.

 Here is a record batch containing 5 rows and 3 columns:

 <div id="cb11" class="sourceCode">

 ``` r
 rb <- record_batch(
   strs = c("hello", "amazing", "and", "cruel", "world"),
   ints = c(1L, NA, 2L, 4L, 8L),
   dbls = c(1.1, 3.2, 0.2, NA, 11)
 )
 rb
 ```

 </div>

     ## RecordBatch
     ## 5 rows x 3 columns
     ## $strs <string>
     ## $ints <int32>
     ## $dbls <double>

 At an abstract level the `rb` object behaves like a two dimensional
 structure with rows and columns, but in terms of how it is represented
 in memory it is fundamentally a list of arrays as shown below:

 ![](record_batch_layout.png)

 </div>

 <div class="section level2">

 ## Tables

 To deal with situations where a rectangular data set can grow over time
 (as more data are added), we need a tabular data structure that is
 similar to a record batch with one exception: instead of storing each
 column as an array, we now want to store it as a chunked array. This is
 what the `Table` class in **arrow** does.

 To illustrate, suppose we have a second set of data that arrives as a
 record batch:

 <div id="cb13" class="sourceCode">

 ``` r
 new_rb <- record_batch(
   strs = c("I", "love", "you"),
   ints = c(5L, 0L, 0L),
   dbls = c(7.1, -0.1, 2)
 )

 df <- concat_tables(arrow_table(rb), arrow_table(new_rb))
 df
 ```

 </div>

     ## Table
     ## 8 rows x 3 columns
     ## $strs <string>
     ## $ints <int32>
     ## $dbls <double>

 Here is the underlying structure of this Table:

 ![](table_layout.png)

 </div>

 </div>
	<div id="main" class="col-md-9" role="main">

	# Internal structure of Arrow objects

	This article describes the internal structure of Arrow data objects.
	Users of the arrow R package will not generally need to understand the
	internal structure of Arrow data objects. We include it here to help
	orient those R users and Arrow developers who wish to understand the
	[Arrow
	specification](https://arrow.apache.org/docs/format/Columnar.html). This
	article provides a deeper dive into some of the topics described in the
	[data objects
	article](https://arrow.apache.org/docs/r/articles/data_objects.md), and
	is intended mostly for developers. It is not necessary knowledge for
	using the arrow package.

	We begin by describing two key concepts:

	- Values in an array are stored in one or more buffers. A buffer
	is a sequential virtual address space (i.e., block of memory) with a
	given length. Given a pointer specifying the memory address where
	the buffer starts, you can reach any byte in the buffer with an
	“offset” value that specifies a location relative to the start of
	the buffer.
	- The physical layout of an array is a term used to describe how
	data in an array is laid out in memory, without taking into account
	how that information is interpreted. As an example: a 32-bit signed
	integer and 32-bit floating point number have the same layout: they
	are both 32 bits, represented as 4 contiguous bytes in memory. The
	meaning is different, but the layout is the same.

	We can unpack these ideas using a simple array of integer values:

	<div id="cb1" class="sourceCode">

	``` r
	integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
	integer_array
	```

	</div>

	## Array
	## <int32>
	## [
	## 1,
	## null,
	## 2,
	## 4,
	## 8
	## ]

	We can inspect the `integer_array$type` attribute to see that the values
	in the Array are stored as signed 32 bit integers. When laid out in
	memory by the Arrow C++ library, an integer array consists of two pieces
	of metadata and two buffers that store the data. The metadata specify
	the length of the array and a count of the number of null values, both
	stored as 64-bit integers. These metadata can be viewed from R using
	`integer_array$length()` and `integer_array$null_count` respectively.
	The number of buffers associated with an array depends on the exact type
	of data being stored. For an integer array there are two: a “validity
	bitmap buffer” and a “data value buffer”. Schematically we could depict
	the array as follows:

	![](array_layout_integer.png)

	This image shows the array as a rectangle subdivided into two parts, one
	for the metadata and the other for the buffers. Underneath the rectangle
	we’ve unpacked the contents of the buffers for you, showing the contents
	of the two buffers in the area enclosed in a dotted line. At the very
	bottom of the figure, you can see the contents of specific bytes.

	<div class="section level2">

	## Validity bitmap buffer

	The validity bitmap is binary-valued, and contains a 1 whenever the
	corresponding slot in the array contains a valid, non-null value. At an
	abstract level we can assume this contains the following five bits:

	10111

	However this is a slight over-simplification for three reasons. First,
	because memory is allocated in byte-size units there are three trailing
	bits at the end (assumed to be zero), giving us the bitmap `10111000`.
	Second, while we have written this from left-to-right, this written
	format is typically presumed to represent [big endian
	format](https://en.wikipedia.org/wiki/Endianness) where the
	most-significant bit is written first (i.e., to the lowest-valued memory
	address). Arrow adopts a little-endian convention, which would more
	naturally correspond toa right-to-left ordering when written in English.
	To reflect this we write the bits in right-to-left order: `00011101`.
	Finally, Arrow encourages [naturally aligned data
	structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in
	which allocated memory addresses are a multiple of the data block sizes.
	Arrow uses 64 byte alignment, so each data structure must be a
	multiple of 64 bytes in size. This design feature exists to allow
	efficient use of modern hardware, as discussed in the [Arrow
	specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding).
	This is what the buffer looks like this in memory:

	<div class="grid">

	<div class="g-col-6">

	\| Byte 0 (validity bitmap) \| Bytes 1-63 \|
	\|--------------------------\|---------------\|
	\| `00011101` \| `0` (padding) \|

	</div>

	</div>

	</div>

	<div class="section level2">

	## Data buffer

	The data buffer, like the validity bitmap, is padded out to a length of
	64 bytes to preserve natural alignment. Here’s the diagram showing the
	physical layout:

	<div class="grid">

	<div class="g-col-12">

	\| Bytes 0-3 \| Bytes 4-7 \| Bytes 8-11 \| Bytes 12-15 \| Bytes 16-19 \| Bytes 20-63 \|
	\|-----------\|-------------\|------------\|-------------\|-------------\|-------------\|
	\| `1` \| unspecified \| `2` \| `4` \| `8` \| unspecified \|

	</div>

	</div>

	Each integer occupies 4 bytes, as per the requirements of a 32-bit
	signed integer. Notice that the bytes associated with the missing value
	are left unspecified: space is allocated for the value but those bytes
	are not filled.

	</div>

	<div class="section level2">

	## Offset buffer

	Some types of Arrow array include a third buffer known as the offset
	buffer. This is most frequently encountered in the context of string
	arrays, such as this one:

	<div id="cb4" class="sourceCode">

	``` r
	string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
	string_array
	```

	</div>

	## Array
	## <string>
	## [
	## "hello",
	## "amazing",
	## "and",
	## "cruel",
	## "world"
	## ]

	Using the same schematic notation as before, this is the structure of
	the object. It has the same metadata as before but as shown below, there
	are now three buffers:

	![](array_layout_string.png)

	To understand the role of the offset buffer, it helps to note the format
	of the data buffer for a string array: it concatenates all strings end
	to end in one contiguous section of memory. For the `string_array`
	object, the contents of the data buffer would look like one long
	utf8-encoded string:

	helloamazingandcruelworld

	Because individual strings can be of variable length, the role of the
	offset buffer is to specify where the boundaries between the slots are.
	The second slot in our array is the string `"amazing"`. If the positions
	in the data array are indexed like this

	\| h \| e \| l \| l \| o \| a \| m \| a \| z \| i \| n \| g \| a \| n \| d \| … \|
	\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|
	\| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9 \| 10 \| 11 \| 12 \| 13 \| 14 \| … \|

	then we can see that the string of interest begins at position 5 and
	ends at position 11. The offset buffer consists integers that store
	these break point locations. For `string_array` it might look like this:

	0 5 12 15 20 25

	The difference between the `utf8()` data type and the `large_utf8()`
	data type is that these the `utf8()` data type stores these as 32-bit
	integers whereas the `large_utf8()` type stores them as 64-bit integers.

	</div>

	<div class="section level2">

	## Chunked arrays

	Arrays are immutable objects: once an Array has been initialized the
	values it stores cannot be altered. This ensures that multiple entities
	can safely refer to an Array via pointers, and not run the risk that the
	values will change. Using immutable Arrays makes it possible for Arrow
	to avoid unnecessary copies of data objects.

	There are limitations to immutable Arrays, most notably when new batches
	of data arrive. Because an array is immutable, you can’t add the new
	information to an existing array. The only thing you can do if you don’t
	want to disturb or copy your existing array is create a new array that
	contains the new data. Doing that preserves the immutability of arrays
	and doesn’t lead to any unnecessary copying but now we have a new
	problem: the data are split across two arrays. Each array contains only
	one “chunk” of the data. What would be ideal is an abstraction layer
	that allows us to treat these two Arrays as though they were a single
	“Array-like” object.

	This is the problem that chunked arrays solve. A chunked array is a
	wrapper around a list of arrays, and allows you to index their contents
	“as if” they were a single array. Physically, the data are still stored
	in separate places – each array is one chunk, and these chunks don’t
	have to be adjacent to each other in memory – but the chunked array
	provides us will a layer of abstraction that allows us to pretend that
	they are all one thing.

	To illustrate, let’s use the `chunked_array()` function:

	<div id="cb8" class="sourceCode">

	``` r
	chunked_string_array <- chunked_array(
	c("hello", "amazing", "and", "cruel", "world"),
	c("I", "love", "you")
	)
	```

	</div>

	The `chunked_array()` function is just a wrapper around the
	functionality that `ChunkedArray$create()` provides. Let’s take a look
	at the object:

	<div id="cb9" class="sourceCode">

	``` r
	chunked_string_array
	```

	</div>

	## ChunkedArray
	## <string>
	## [
	## [
	## "hello",
	## "amazing",
	## "and",
	## "cruel",
	## "world"
	## ],
	## [
	## "I",
	## "love",
	## "you"
	## ]
	## ]

	The double bracketing in this output is intended to highlight the
	“list-like” nature of chunked arrays. There are three separate arrays,
	wrapped in a container object that is secretly a list of arrays, but
	allows that list to behave just like a regular one-dimensional data
	structure. Schematically it looks like this:

	![](chunked_array_layout.png)

	As this figure illustrates, there really are three arrays here, each
	with its own validity bitmap, offset buffer, and data buffer.

	</div>

	<div class="section level2">

	## Record batches

	A record batch is table-like data structure comprised of a sequence of
	arrays. The arrays can be of different types but they must all be the
	same length. Each array is referred to as one of the “fields” or
	“columns” of the record batch. Each field must have a (UTF8-encoded)
	name, and these names form part of the metadata for the record batch.
	When stored in memory, the record batch does not include physical
	storage for the values stored in each field: instead it contains
	pointers to the relevant array objects. It does, however, contain its
	own validity bitmap.

	Here is a record batch containing 5 rows and 3 columns:

	<div id="cb11" class="sourceCode">

	``` r
	rb <- record_batch(
	strs = c("hello", "amazing", "and", "cruel", "world"),
	ints = c(1L, NA, 2L, 4L, 8L),
	dbls = c(1.1, 3.2, 0.2, NA, 11)
	)
	rb
	```

	</div>

	## RecordBatch
	## 5 rows x 3 columns
	## $strs <string>
	## $ints <int32>
	## $dbls <double>

	At an abstract level the `rb` object behaves like a two dimensional
	structure with rows and columns, but in terms of how it is represented
	in memory it is fundamentally a list of arrays as shown below:

	![](record_batch_layout.png)

	</div>

	<div class="section level2">

	## Tables

	To deal with situations where a rectangular data set can grow over time
	(as more data are added), we need a tabular data structure that is
	similar to a record batch with one exception: instead of storing each
	column as an array, we now want to store it as a chunked array. This is
	what the `Table` class in arrow does.

	To illustrate, suppose we have a second set of data that arrives as a
	record batch:

	<div id="cb13" class="sourceCode">

	``` r
	new_rb <- record_batch(
	strs = c("I", "love", "you"),
	ints = c(5L, 0L, 0L),
	dbls = c(7.1, -0.1, 2)
	)

	df <- concat_tables(arrow_table(rb), arrow_table(new_rb))
	df
	```

	</div>

	## Table
	## 8 rows x 3 columns
	## $strs <string>
	## $ints <int32>
	## $dbls <double>

	Here is the underlying structure of this Table:

	![](table_layout.png)

	</div>

	</div>