blob: cee2dabd8645593bb2e0821e0ce2f6a751d5ae55 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>User Manual · Arrow.jl</title><script data-outdated-warner src="../assets/warner.js"></script><link rel="canonical" href="https://arrow.juliadata.org/manual/"/><link href="https://cdnjs.cloudflare.com/ajax/libs/lato-font/3.0.0/css/lato-font.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/juliamono/0.045/juliamono.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.24/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL=".."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../assets/documenter.js"></script><script src="../siteinfo.js"></script><script src="../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><div class="docs-package-name"><span class="docs-autofit"><a href="../">Arrow.jl</a></span></div><form class="docs-search" action="../search/"><input class="docs-search-query" id="documenter-search-query" name="q" type="text" placeholder="Search docs"/></form><ul class="docs-menu"><li><a class="tocitem" href="../">Home</a></li><li class="is-active"><a class="tocitem" href>User Manual</a><ul class="internal"><li><a class="tocitem" href="#Support-for-generic-path-like-types"><span>Support for generic path-like types</span></a></li><li><a class="tocitem" href="#Reading-arrow-data"><span>Reading arrow data</span></a></li><li><a class="tocitem" href="#Writing-arrow-data"><span>Writing arrow data</span></a></li></ul></li><li><a class="tocitem" href="../reference/">API Reference</a></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><nav class="breadcrumb"><ul class="is-hidden-mobile"><li class="is-active"><a href>User Manual</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>User Manual</a></li></ul></nav><div class="docs-right"><a class="docs-edit-link" href="https://github.com/apache/arrow-julia/blob/main/docs/src/manual.md#L" title="Edit on GitHub"><span class="docs-icon fab"></span><span class="docs-label is-hidden-touch">Edit on GitHub</span></a><a class="docs-settings-button fas fa-cog" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-sidebar-button fa fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a></div></header><article class="content" id="documenter-page"><p>&lt;!–- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the &quot;License&quot;); you may not use this file except in compliance with the License. You may obtain a copy of the License at</p><pre><code class="nohighlight hljs">http://www.apache.org/licenses/LICENSE-2.0</code></pre><p>Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an &quot;AS IS&quot; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. –&gt;</p><h1 id="User-Manual"><a class="docs-heading-anchor" href="#User-Manual">User Manual</a><a id="User-Manual-1"></a><a class="docs-heading-anchor-permalink" href="#User-Manual" title="Permalink"></a></h1><p>The goal of this documentation is to provide a brief introduction to the arrow data format, then provide a walk-through of the functionality provided in the Arrow.jl Julia package, with an aim to expose a little of the machinery &quot;under the hood&quot; to help explain how things work and how that influences real-world use-cases for the arrow data format.</p><p>The best place to learn about the Apache arrow project is <a href="https://arrow.apache.org/">the website itself</a>, specifically the data format <a href="https://arrow.apache.org/docs/format/Columnar.html">specification</a>. Put briefly, the arrow project provides a formal speficiation for how columnar, &quot;table&quot; data can be laid out efficiently in memory to standardize and maximize the ability to share data across languages/platforms. In the current <a href="https://github.com/apache/arrow">apache/arrow GitHub repository</a>, language implementations exist for C++, Java, Go, Javascript, Rust, to name a few. Other database vendors and data processing frameworks/applications have also built support for the arrow format, allowing for a wide breadth of possibility for applications to &quot;speak the data language&quot; of arrow.</p><p>The <a href="https://github.com/apache/arrow-julia">Arrow.jl</a> Julia package is another implementation, allowing the ability to both read and write data in the arrow format. As a data format, arrow specifies an exact memory layout to be used for columnar table data, and as such, &quot;reading&quot; involves custom Julia objects (<a href="../reference/#Arrow.Table"><code>Arrow.Table</code></a> and <a href="../reference/#Arrow.Stream"><code>Arrow.Stream</code></a>), which read the <em>metadata</em> of an &quot;arrow memory blob&quot;, then <em>wrap</em> the array data contained therein, having learned the type and size, amongst other properties, from the metadata. Let&#39;s take a closer look at what this &quot;reading&quot; of arrow memory really means/looks like.</p><h2 id="Support-for-generic-path-like-types"><a class="docs-heading-anchor" href="#Support-for-generic-path-like-types">Support for generic path-like types</a><a id="Support-for-generic-path-like-types-1"></a><a class="docs-heading-anchor-permalink" href="#Support-for-generic-path-like-types" title="Permalink"></a></h2><p>Arrow.jl attempts to support any path-like type wherever a function takes a path as an argument. The Arrow.jl API should generically work as long as the type supports:</p><ul><li><code>Base.open(path, mode)::I where I &lt;: IO</code></li></ul><p>When a custom <code>IO</code> subtype is returned (<code>I</code>) then the following methods also need to be defined:</p><ul><li><code>Base.read(io::I, ::Type{UInt8})</code> or <code>Base.read(io::I)</code></li><li><code>Base.write(io::I, x)</code></li></ul><h2 id="Reading-arrow-data"><a class="docs-heading-anchor" href="#Reading-arrow-data">Reading arrow data</a><a id="Reading-arrow-data-1"></a><a class="docs-heading-anchor-permalink" href="#Reading-arrow-data" title="Permalink"></a></h2><p>After installing the Arrow.jl Julia package (via <code>] add Arrow</code>), and if you have some arrow data, let&#39;s say a file named <code>data.arrow</code> generated from the <a href="https://arrow.apache.org/docs/python/"><code>pyarrow</code></a> library (a Python library for interfacing with arrow data), you can then read that arrow data into a Julia session by doing:</p><pre><code class="language-julia hljs">using Arrow
table = Arrow.Table(&quot;data.arrow&quot;)</code></pre><h3 id="Arrow.Table"><a class="docs-heading-anchor" href="#Arrow.Table"><code>Arrow.Table</code></a><a id="Arrow.Table-1"></a><a class="docs-heading-anchor-permalink" href="#Arrow.Table" title="Permalink"></a></h3><p>The type of <code>table</code> in this example will be an <code>Arrow.Table</code>. When &quot;reading&quot; the arrow data, <code>Arrow.Table</code> first <a href="https://en.wikipedia.org/wiki/Mmap">&quot;mmapped&quot;</a> the <code>data.arrow</code> file, which is an important technique for dealing with data larger than available RAM on a system. By &quot;mmapping&quot; a file, the OS doesn&#39;t actually load the entire file contents into RAM at the same time, but file contents are &quot;swapped&quot; into RAM as different regions of a file are requested. Once &quot;mmapped&quot;, <code>Arrow.Table</code> then inspected the metadata in the file to determine the number of columns, their names and types, at which byte offset each column begins in the file data, and even how many &quot;batches&quot; are included in this file (arrow tables may be partitioned into one or more &quot;record batches&quot; each containing portions of the data). Armed with all the appropriate metadata, <code>Arrow.Table</code> then created custom array objects (<a href="@ref"><code>ArrowVector</code></a>), which act as &quot;views&quot; into the raw arrow memory bytes. This is a significant point in that no extra memory is allocated for &quot;data&quot; when reading arrow data. This is in contrast to if we wanted to read data from a csv file as columns into Julia structures; we would need to allocate those array structures ourselves, then parse the file, &quot;filling in&quot; each element of the array with the data we parsed from the file. Arrow data, on the other hand, is <em>already laid out in memory or on disk</em> in a binary format, and as long as we have the metadata to interpret the raw bytes, we can figure out whether to treat those bytes as a <code>Vector{Float64}</code>, etc. A sample of the kinds of arrow array types you might see when deserializing arrow data, include:</p><ul><li><a href="../reference/#Arrow.Primitive"><code>Arrow.Primitive</code></a>: the most common array type for simple, fixed-size elements like integers, floats, time types, and decimals</li><li><a href="../reference/#Arrow.List"><code>Arrow.List</code></a>: an array type where its own elements are also arrays of some kind, like string columns, where each element can be thought of as an array of characters</li><li><a href="../reference/#Arrow.FixedSizeList"><code>Arrow.FixedSizeList</code></a>: similar to the <code>List</code> type, but where each array element has a fixed number of elements itself; you can think of this like a <code>Vector{NTuple{N, T}}</code>, where <code>N</code> is the fixed-size width</li><li><a href="../reference/#Arrow.Map"><code>Arrow.Map</code></a>: an array type where each element is like a Julia <code>Dict</code>; a list of key value pairs like a <code>Vector{Dict}</code></li><li><a href="../reference/#Arrow.Struct"><code>Arrow.Struct</code></a>: an array type where each element is an instance of a custom struct, i.e. an ordered collection of named &amp; typed fields, kind of like a <code>Vector{NamedTuple}</code></li><li><a href="../reference/#Arrow.DenseUnion"><code>Arrow.DenseUnion</code></a>: an array type where elements may be of several different types, stored compactly; can be thought of like <code>Vector{Union{A, B}}</code></li><li><a href="../reference/#Arrow.SparseUnion"><code>Arrow.SparseUnion</code></a>: another array type where elements may be of several different types, but stored as if made up of identically lengthed child arrays for each possible type (less memory efficient than <code>DenseUnion</code>)</li><li><a href="../reference/#Arrow.DictEncoded"><code>Arrow.DictEncoded</code></a>: a special array type where values are &quot;dictionary encoded&quot;, meaning the list of unique, possible values for an array are stored internally in an &quot;encoding pool&quot;, whereas each stored element of the array is just an integer &quot;code&quot; to index into the encoding pool for the actual value.</li></ul><p>And while these custom array types do subtype <code>AbstractArray</code>, there is no current support for <code>setindex!</code>. Remember, these arrays are &quot;views&quot; into the raw arrow bytes, so for array types other than <code>Arrow.Primitive</code>, it gets pretty tricky to allow manipulating those raw arrow bytes. Nevetheless, it&#39;s as simple as calling <code>copy(x)</code> where <code>x</code> is any <code>ArrowVector</code> type, and a normal Julia <code>Vector</code> type will be fully materialized (which would then allow mutating/manipulating values).</p><p>So, what can you do with an <code>Arrow.Table</code> full of data? Quite a bit actually!</p><p>Because <code>Arrow.Table</code> implements the <a href="https://juliadata.github.io/Tables.jl/stable/">Tables.jl</a> interface, it opens up a world of integrations for using arrow data. A few examples include:</p><ul><li><code>df = DataFrame(Arrow.Table(file))</code>: Build a <a href="https://juliadata.github.io/DataFrames.jl/stable/"><code>DataFrame</code></a>, using the arrow vectors themselves; this allows utilizing a host of DataFrames.jl functionality directly on arrow data; grouping, joining, selecting, etc.</li><li><code>Tables.datavaluerows(Arrow.Table(file)) |&gt; @map(...) |&gt; @filter(...) |&gt; DataFrame</code>: use <a href="https://www.queryverse.org/Query.jl/stable/standalonequerycommands/"><code>Query.jl</code>&#39;s</a> row-processing utilities to map, group, filter, mutate, etc. directly over arrow data.</li><li><code>Arrow.Table(file) |&gt; SQLite.load!(db, &quot;arrow_table&quot;)</code>: load arrow data directly into an sqlite database/table, where sql queries can be executed on the data</li><li><code>Arrow.Table(file) |&gt; CSV.write(&quot;arrow.csv&quot;)</code>: write arrow data out to a csv file</li></ul><p>A full list of Julia packages leveraging the Tables.jl inteface can be found <a href="https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md">here</a>.</p><p>Apart from letting other packages have all the fun, an <code>Arrow.Table</code> itself can be plenty useful. For example, with <code>tbl = Arrow.Table(file)</code>:</p><ul><li><code>tbl[1]</code>: retrieve the first column via indexing; the number of columns can be queried via <code>length(tbl)</code></li><li><code>tbl[:col1]</code> or <code>tbl.col1</code>: retrieve the column named <code>col1</code>, either via indexing with the column name given as a <code>Symbol</code>, or via &quot;dot-access&quot;</li><li><code>for col in tbl</code>: iterate through columns in the table</li><li><code>AbstractDict</code> methods like <code>haskey(tbl, :col1)</code>, <code>get(tbl, :col1, nothing)</code>, <code>keys(tbl)</code>, or <code>values(tbl)</code></li></ul><h3 id="Arrow-types"><a class="docs-heading-anchor" href="#Arrow-types">Arrow types</a><a id="Arrow-types-1"></a><a class="docs-heading-anchor-permalink" href="#Arrow-types" title="Permalink"></a></h3><p>In the arrow data format, specific logical types are supported, a list of which can be found <a href="https://arrow.apache.org/docs/status.html#data-types">here</a>. These include booleans, integers of various bit widths, floats, decimals, time types, and binary/string. While most of these map naturally to types builtin to Julia itself, there are a few cases where the definitions are slightly different, and in these cases, by default, they are converted to more &quot;friendly&quot; Julia types (this auto conversion can be avoided by passing <code>convert=false</code> to <code>Arrow.Table</code>, like <code>Arrow.Table(file; convert=false)</code>). Examples of arrow to julia type mappings include:</p><ul><li><code>Date</code>, <code>Time</code>, <code>Timestamp</code>, and <code>Duration</code> all have natural Julia defintions in <code>Dates.Date</code>, <code>Dates.Time</code>, <code>TimeZones.ZonedDateTime</code>, and <code>Dates.Period</code> subtypes, respectively.</li><li><code>Char</code> and <code>Symbol</code> Julia types are mapped to arrow string types, with additional metadata of the original Julia type; this allows deserializing directly to <code>Char</code> and <code>Symbol</code> in Julia, while other language implementations will see these columns as just strings</li><li>Similarly to the above, the <code>UUID</code> Julia type is mapped to a 128-bit <code>FixedSizeBinary</code> arrow type.</li><li><code>Decimal128</code> and <code>Decimal256</code> have no corresponding builtin Julia types, so they&#39;re deserialized using a compatible type definition in Arrow.jl itself: <code>Arrow.Decimal</code></li></ul><p>Note that when <code>convert=false</code> is passed, data will be returned in Arrow.jl-defined types that exactly match the arrow definitions of those types; the authoritative source for how each type represents its data can be found in the arrow <a href="https://github.com/apache/arrow/blob/master/format/Schema.fbs"><code>Schema.fbs</code></a> file.</p><p>One note on performance: when writing <code>TimeZones.ZonedDateTime</code> columns to the arrow format (via <code>Arrow.write</code>), it is preferrable to &quot;wrap&quot; the columns in <code>Arrow.ToTimestamp(col)</code>, as long as the column has <code>ZonedDateTime</code> elements that all share a common timezone. This ensures the writing process can know &quot;upfront&quot; which timezone will be encoded and is thus much more efficient and performant.</p><h4 id="Custom-types"><a class="docs-heading-anchor" href="#Custom-types">Custom types</a><a id="Custom-types-1"></a><a class="docs-heading-anchor-permalink" href="#Custom-types" title="Permalink"></a></h4><p>To support writing your custom Julia struct, Arrow.jl utilizes the format&#39;s mechanism for &quot;extension types&quot; by allowing the storing of Julia type name and metadata in the field metadata. To &quot;hook in&quot; to this machinery, custom types can utilize the interface methods defined in the <code>Arrow.ArrowTypes</code> submodule. For example:</p><pre><code class="language-julia hljs">using Arrow
struct Person
id::Int
name::String
end
# overload interface method for custom type Person; return a symbol as the &quot;name&quot;
# this instructs Arrow.write what &quot;label&quot; to include with a column with this custom type
ArrowTypes.arrowname(::Type{Person}) = :Person
# overload JuliaType on `Val{:Person}`, which is like a dispatchable string
# return our custom *type* Person; this enables Arrow.Table to know how the &quot;label&quot;
# on a custom column should be mapped to a Julia type and deserialized
ArrowTypes.JuliaType(::Val{:Person}) = Person
table = (col1=[Person(1, &quot;Bob&quot;), Person(2, &quot;Jane&quot;)],)
io = IOBuffer()
Arrow.write(io, table)
seekstart(io)
table2 = Arrow.Table(io)</code></pre><p>In this example, we&#39;re writing our <code>table</code>, which is a NamedTuple with one column named <code>col1</code>, which has two elements which are instances of our custom <code>Person</code> struct. We overload <code>Arrowtypes.arrowname</code> so that Arrow.jl knows how to serialize our <code>Person</code> struct. We then overload <code>ArrowTypes.JuliaType</code> so the deserialization process knows how to map from our type label back to our <code>Person</code> struct type. We can then write our data in the arrow format to an in-memory <code>IOBuffer</code>, then read the table back in using <code>Arrow.Table</code>. The table we get back will be an <code>Arrow.Table</code>, with a single <code>Arrow.Struct</code> column with element type <code>Person</code>.</p><p>Note that without calling <code>Arrowtypes.JuliaType</code>, we may get into a weird limbo state where we&#39;ve written our table with <code>Person</code> structs out as a table, but when reading back in, Arrow.jl doesn&#39;t know what a <code>Person</code> is; deserialization won&#39;t fail, but we&#39;ll just get a <code>Namedtuple{(:id, :name), Tuple{Int, String}}</code> back instead of <code>Person</code>.</p><p>While this example is very simple, it shows the basics to allow a custom type to be serialized/deserialized. But the <code>ArrowTypes</code> module offers even more powerful functionality for &quot;hooking&quot; non-native arrow types into the serialization/deserialization processes. Let&#39;s walk through a couple more examples; if you&#39;ve had enough custom type shenanigans, feel free to skip to the next section.</p><p>Let&#39;s take a look at how Arrow.jl allows serializing the <code>nothing</code> value, which is often referred to as the &quot;software engineer&#39;s NULL&quot; in Julia. While Arrow.jl treats <code>missing</code> as the default arrow NULL value, <code>nothing</code> is pretty similar, but we&#39;d still like to treat it separately if possible. Here&#39;s how we enable serialization/deserialization in the <code>ArrowTypes</code> module:</p><pre><code class="language-julia hljs">ArrowTypes.ArrowKind(::Type{Nothing}) = ArrowTypes.NullKind()
ArrowTypes.ArrowType(::Type{Nothing}) = Missing
ArrowTypes.toarrow(::Nothing) = missing
const NOTHING = Symbol(&quot;JuliaLang.Nothing&quot;)
ArrowTypes.arrowname(::Type{Nothing}) = NOTHING
ArrowTypes.JuliaType(::Val{NOTHING}) = Nothing
ArrowTypes.fromarrow(::Type{Nothing}, ::Missing) = nothing</code></pre><p>Let&#39;s walk through what&#39;s going on here, line-by-line:</p><ul><li><code>ArrowKind</code> overload: <code>ArrowKind</code>s are generic &quot;categories&quot; of types supported by the arrow format, like <code>PrimitiveKind</code>, <code>ListKind</code>, etc. They each correspond to a different data layout strategy supported in the arrow format. Here, we define <code>nothing</code>&#39;s kind to be <code>NullKind</code>, which means no actual memory is needed for storage, it&#39;s strictly a &quot;metadata&quot; type where we store the type and # of elements. In our <code>Person</code> example, we didn&#39;t need to overload this since types declared like <code>struct T</code> or <code>mutable struct T</code> are defined as <code>ArrowTypes.StructKind</code> by default</li><li><code>ArrowType</code> overload: here we&#39;re signaling that our type (<code>Nothing</code>) maps to the natively supported arrow type of <code>Missing</code>; this is important for the serializer so it knows which arrow type it will be serializing. Again, we didn&#39;t need to overload this for <code>Person</code> since the serializer knows how to serialize custom structs automatically by using reflection methods like <code>fieldnames(T)</code> and <code>getfield(x, i)</code>.</li><li><code>ArrowTypes.toarrow</code> overload: this is a sister method to <code>ArrowType</code>; we said our type will map to the <code>Missing</code> arrow type, so here we actually define ___how___ it converts to the arrow type; and in this case, it just returns <code>missing</code>. This is yet another method that didn&#39;t show up for <code>Person</code>; why? Well, as we noted in <code>ArrowType</code>, the serializer already knows how to serialize custom structs by using all their fields; if, for some reason, we wanted to omit some fields or otherwise transform things, then we could define corresponding <code>ArrowType</code> and <code>toarrow</code> methods</li><li><code>arrowname</code> overload: similar to our <code>Person</code> example, we need to instruct the serializer how to label our custom type in the arrow type metadata; here we give it the symbol <code>Symbol(&quot;JuliaLang.Nothing&quot;)</code>. Note that while this will ultimately allow us to disambiguate <code>nothing</code> from <code>missing</code> when reading arrow data, if we pass this data to other language implementations, they will only treat the data as <code>missing</code> since they (probably) won&#39;t know how to &quot;understand&quot; the <code>JuliaLang.Nothing</code> type label</li><li><code>JuliaType</code> overload: again, like our <code>Person</code> example, we instruct the deserializer that when it encounters the <code>JuliaLang.Nothing</code> type label, it should treat those values as <code>Nothing</code> type.</li><li>And finally, <code>fromarrow</code> overload: this allows specifying how the native-arrow data should be converted back to our custom type. <code>fromarrow(T, x...)</code> by default will call <code>T(x...)</code>, which is why we didn&#39;t need this overload for <code>Person</code>, but in this example, <code>Nothing(missing)</code> won&#39;t work, so we define our own custom conversion.</li></ul><p>Let&#39;s run through one more complex example, just for fun and to really see how far the system can be pushed:</p><pre><code class="language-julia hljs">using Intervals
table = (col = [
Interval{Closed,Unbounded}(1,nothing),
],)
const NAME = Symbol(&quot;JuliaLang.Interval&quot;)
ArrowTypes.arrowname(::Type{Interval{T, L, R}}) where {T, L, R} = NAME
const LOOKUP = Dict(
&quot;Closed&quot; =&gt; Closed,
&quot;Unbounded&quot; =&gt; Unbounded
)
ArrowTypes.arrowmetadata(::Type{Interval{T, L, R}}) where {T, L, R} = string(L, &quot;.&quot;, R)
function ArrowTypes.JuliaType(::Val{NAME}, ::Type{NamedTuple{names, types}}, meta) where {names, types}
L, R = split(meta, &quot;.&quot;)
return Interval{fieldtype(types, 1), LOOKUP[L], LOOKUP[R]}
end
ArrowTypes.fromarrow(::Type{Interval{T, L, R}}, first, last) where {T, L, R} = Interval{L, R}(first, R == Unbounded ? nothing : last)
io = Arrow.tobuffer(table)
tbl = Arrow.Table(io)</code></pre><p>Again, let&#39;s break down what&#39;s going on here:</p><ul><li>Here we&#39;re trying to save an <code>Interval</code> type in the arrow format; this type is unique in that it has two type parameters (<code>Closed</code> and <code>Unbounded</code>) that are not inferred/based on fields, but are just &quot;type tags&quot; on the type itself</li><li>Note that we define a generic <code>arrowname</code> method on all <code>Interval</code>s, regardless of type parameters. We just want to let arrow know which general type we&#39;re dealing with here</li><li>Next we use a new method <code>ArrowTypes.arrowmetadata</code> to encode the two non-field-based type parameters as a string with a dot delimiter; we encode this information here because remember, we have to match our <code>arrowname</code> Symbol typename in our <code>JuliaType(::Val(name))</code> definition in order to dispatch correctly; if we encoded the type parameters in <code>arrowname</code>, we would need separate <code>arrowname</code> definitions for each unique combination of those two type parameters, and corresponding <code>JuliaType</code> definitions for each as well; yuck. Instead, we let <code>arrowname</code> be generic to our type, and store the type parameters <em>for this specific column</em> using <code>arrowmetadata</code></li><li>Now in <code>JuliaType</code>, note we&#39;re using the 3-argument overload; we want the <code>NamedTuple</code> type that is the native arrow type our <code>Interval</code> is being serialized as; we use this to retrieve the 1st type parameter for our <code>Interval</code>, which is simply the type of the two <code>first</code> and <code>last</code> fields. Then we use the 3rd argument, which is whatever string we returned from <code>arrowmetadata</code>. We call <code>L, R = split(meta, &quot;.&quot;)</code> to parse the two type parameters (in this case <code>Closed</code> and <code>Unbounded</code>), then do a lookup on those strings from a predefined <code>LOOKUP</code> Dict that matches the type parameter name as string to the actual type. We then have all the information to recreate the full <code>Interval</code> type. Neat!</li><li>The one final wrinkle is in our <code>fromarrow</code> method; <code>Interval</code>s that are <code>Unbounded</code>, actually take <code>nothing</code> as the 2nd argument. So letting the default <code>fromarrow</code> definition call <code>Interval{T, L, R}(first, last)</code>, where <code>first</code> and <code>last</code> are both integers isn&#39;t going to work. Instead, we check if the <code>R</code> type parameter is <code>Unbounded</code> and if so, pass <code>nothing</code> as the 2nd arg, otherwise we can pass <code>last</code>.</li></ul><p>This stuff can definitely make your eyes glaze over if you stare at it long enough. As always, don&#39;t hesitate to reach out for quick questions on the <a href="https://julialang.slack.com/messages/data/">#data</a> slack channel, or <a href="https://github.com/apache/arrow-julia/issues/new">open a new issue</a> detailing what you&#39;re trying to do.</p><h3 id="Arrow.Stream"><a class="docs-heading-anchor" href="#Arrow.Stream"><code>Arrow.Stream</code></a><a id="Arrow.Stream-1"></a><a class="docs-heading-anchor-permalink" href="#Arrow.Stream" title="Permalink"></a></h3><p>In addition to <code>Arrow.Table</code>, the Arrow.jl package also provides <code>Arrow.Stream</code> for processing arrow data. While <code>Arrow.Table</code> will iterate all record batches in an arrow file/stream, concatenating columns, <code>Arrow.Stream</code> provides a way to <em>iterate</em> through record batches, one at a time. Each iteration yields an <code>Arrow.Table</code> instance, with columns/data for a single record batch. This allows, if so desired, &quot;batch processing&quot; of arrow data, one record batch at a time, instead of creating a single long table via <code>Arrow.Table</code>.</p><h3 id="Custom-application-metadata"><a class="docs-heading-anchor" href="#Custom-application-metadata">Custom application metadata</a><a id="Custom-application-metadata-1"></a><a class="docs-heading-anchor-permalink" href="#Custom-application-metadata" title="Permalink"></a></h3><p>The Arrow format allows data producers to <a href="https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata">attach custom metadata</a> to various Arrow objects.</p><p>Arrow.jl provides a convenient accessor for this metadata via <a href="../reference/#Arrow.getmetadata-Tuple{Arrow.Table}"><code>Arrow.getmetadata</code></a>. <code>Arrow.getmetadata(t::Arrow.Table)</code> will return an immutable <code>AbstractDict{String,String}</code> that represents the <a href="https://github.com/apache/arrow/blob/85d8175ea24b4dd99f108a673e9b63996d4f88cc/format/Schema.fbs#L515"><code>custom_metadata</code> of the table&#39;s associated <code>Schema</code></a> (or <code>nothing</code> if no such metadata exists), while <code>Arrow.getmetadata(c::Arrow.ArrowVector)</code> will return a similar representation of <a href="https://github.com/apache/arrow/blob/85d8175ea24b4dd99f108a673e9b63996d4f88cc/format/Schema.fbs#L480">the column&#39;s associated <code>Field</code> <code>custom_metadata</code></a> (or <code>nothing</code> if no such metadata exists).</p><p>To attach custom schema/column metadata to Arrow tables at serialization time, see the <code>metadata</code> and <code>colmetadata</code> keyword arguments to <a href="../reference/#Arrow.write"><code>Arrow.write</code></a>.</p><h2 id="Writing-arrow-data"><a class="docs-heading-anchor" href="#Writing-arrow-data">Writing arrow data</a><a id="Writing-arrow-data-1"></a><a class="docs-heading-anchor-permalink" href="#Writing-arrow-data" title="Permalink"></a></h2><p>Ok, so that&#39;s a pretty good rundown of <em>reading</em> arrow data, but how do you <em>produce</em> arrow data? Enter <code>Arrow.write</code>.</p><h3 id="Arrow.write"><a class="docs-heading-anchor" href="#Arrow.write"><code>Arrow.write</code></a><a id="Arrow.write-1"></a><a class="docs-heading-anchor-permalink" href="#Arrow.write" title="Permalink"></a></h3><p>With <code>Arrow.write</code>, you provide either an <code>io::IO</code> argument or a <a href="#support-for-generic-path-like-types"><code>file_path</code></a> to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.</p><p>What are some examples of Tables.jl-compatible sources? A few examples include:</p><ul><li><code>Arrow.write(io, df::DataFrame)</code>: A <code>DataFrame</code> is a collection of indexable columns</li><li><code>Arrow.write(io, CSV.File(file))</code>: read data from a csv file and write out to arrow format</li><li><code>Arrow.write(io, DBInterface.execute(db, sql_query))</code>: Execute an SQL query against a database via the <a href="https://github.com/JuliaDatabases/DBInterface.jl"><code>DBInterface.jl</code></a> interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include <a href="https://juliadatabases.github.io/SQLite.jl/stable/">SQLite.jl</a>, <a href="https://juliadatabases.github.io/MySQL.jl/dev/">MySQL.jl</a>, and <a href="http://juliadatabases.github.io/ODBC.jl/latest/">ODBC.jl</a>.</li><li><code>df |&gt; @map(...) |&gt; Arrow.write(io)</code>: Write the results of a <a href="https://www.queryverse.org/Query.jl/stable/">Query.jl</a> chain of operations directly out as arrow data</li><li><code>jsontable(json) |&gt; Arrow.write(io)</code>: Treat a json array of objects or object of arrays as a &quot;table&quot; and write it out as arrow data using the <a href="https://github.com/JuliaData/JSONTables.jl">JSONTables.jl</a> package</li><li><code>Arrow.write(io, (col1=data1, col2=data2, ...))</code>: a <code>NamedTuple</code> of <code>AbstractVector</code>s or an <code>AbstractVector</code> of <code>NamedTuple</code>s are both considered tables by default, so they can be quickly constructed for easy writing of arrow data if you already have columns of data</li></ul><p>And these are just a few examples of the numerous <a href="https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md">integrations</a>.</p><p>In addition to just writing out a single &quot;table&quot; of data as a single arrow record batch, <code>Arrow.write</code> also supports writing out multiple record batches when the input supports the <code>Tables.partitions</code> functionality. One immediate, though perhaps not incredibly useful example, is <code>Arrow.Stream</code>. <code>Arrow.Stream</code> implements <code>Tables.partitions</code> in that it iterates &quot;tables&quot; (specifically <code>Arrow.Table</code>), and as such, <code>Arrow.write</code> will iterate an <code>Arrow.Stream</code>, and write out each <code>Arrow.Table</code> as a separate record batch. Another important point for why this example works is because an <code>Arrow.Stream</code> iterates <code>Arrow.Table</code>s that all have the same schema. This is important because when writing arrow data, a &quot;schema&quot; message is always written first, with all subsequent record batches written with data matching the initial schema.</p><p>In addition to inputs that support <code>Tables.partitions</code>, note that the Tables.jl itself provides the <code>Tables.partitioner</code> function, which allows providing your own separate instances of similarly-schema-ed tables as &quot;partitions&quot;, like:</p><pre><code class="language-julia hljs"># treat 2 separate NamedTuples of vectors with same schema as 1 table, 2 partitions
tbl_parts = Tables.partitioner([(col1=data1, col2=data2), (col1=data3, col2=data4)])
Arrow.write(io, tbl_parts)
# treat an array of csv files with same schema where each file is a partition
# in this form, a function `CSV.File` is applied to each element of 2nd argument
csv_parts = Tables.partitioner(CSV.File, csv_files)
Arrow.write(io, csv_parts)</code></pre><h3 id="Arrow.Writer"><a class="docs-heading-anchor" href="#Arrow.Writer"><code>Arrow.Writer</code></a><a id="Arrow.Writer-1"></a><a class="docs-heading-anchor-permalink" href="#Arrow.Writer" title="Permalink"></a></h3><p>With <code>Arrow.Writer</code>, you instantiate an <code>Arrow.Writer</code> object, write sources using it, and then close it. This allows for incrmental writes to the same sink. It is similar to <code>Arrow.append</code> without having to close and re-open the sink in between writes and without the limitation of only supporting the IPC stream format.</p><h3 id="Multithreaded-writing"><a class="docs-heading-anchor" href="#Multithreaded-writing">Multithreaded writing</a><a id="Multithreaded-writing-1"></a><a class="docs-heading-anchor-permalink" href="#Multithreaded-writing" title="Permalink"></a></h3><p>By default, <code>Arrow.write</code> will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with <code>julia -t 8</code> or the <code>JULIA_NUM_THREADS</code> environment variable is set). The number of concurrent tasks to use when writing can be controlled by passing the <code>ntasks</code> keyword argument to <code>Arrow.write</code>. Passing <code>ntasks=1</code> avoids any multithreading when writing.</p><h3 id="Compression"><a class="docs-heading-anchor" href="#Compression">Compression</a><a id="Compression-1"></a><a class="docs-heading-anchor-permalink" href="#Compression" title="Permalink"></a></h3><p>Compression is supported when writing via the <code>compress</code> keyword argument. Possible values include <code>:lz4</code>, <code>:zstd</code>, or your own initialized <code>LZ4FrameCompressor</code> or <code>ZstdCompressor</code> objects; will cause all buffers in each record batch to use the respective compression encoding or compressor.</p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../">« Home</a><a class="docs-footer-nextpage" href="../reference/">API Reference »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> version 0.27.24 on <span class="colophon-date" title="Tuesday 13 June 2023 13:49">Tuesday 13 June 2023</span>. Using Julia version 1.9.0.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>