blob: 2712db03fd58192ebf23b40970d17434c2b43fcc [file] [log] [blame]
var documenterSearchIndex = {"docs":
[{"location":"manual/","page":"User Manual","title":"User Manual","text":"<!–- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"http://www.apache.org/licenses/LICENSE-2.0","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. –>","category":"page"},{"location":"manual/#User-Manual","page":"User Manual","title":"User Manual","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"The goal of this documentation is to provide a brief introduction to the arrow data format, then provide a walk-through of the functionality provided in the Arrow.jl Julia package, with an aim to expose a little of the machinery \"under the hood\" to help explain how things work and how that influences real-world use-cases for the arrow data format.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"The best place to learn about the Apache arrow project is the website itself, specifically the data format specification. Put briefly, the arrow project provides a formal speficiation for how columnar, \"table\" data can be laid out efficiently in memory to standardize and maximize the ability to share data across languages/platforms. In the current apache/arrow GitHub repository, language implementations exist for C++, Java, Go, Javascript, Rust, to name a few. Other database vendors and data processing frameworks/applications have also built support for the arrow format, allowing for a wide breadth of possibility for applications to \"speak the data language\" of arrow.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"The Arrow.jl Julia package is another implementation, allowing the ability to both read and write data in the arrow format. As a data format, arrow specifies an exact memory layout to be used for columnar table data, and as such, \"reading\" involves custom Julia objects (Arrow.Table and Arrow.Stream), which read the metadata of an \"arrow memory blob\", then wrap the array data contained therein, having learned the type and size, amongst other properties, from the metadata. Let's take a closer look at what this \"reading\" of arrow memory really means/looks like.","category":"page"},{"location":"manual/#Support-for-generic-path-like-types","page":"User Manual","title":"Support for generic path-like types","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Arrow.jl attempts to support any path-like type wherever a function takes a path as an argument. The Arrow.jl API should generically work as long as the type supports:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Base.open(path, mode)::I where I <: IO","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"When a custom IO subtype is returned (I) then the following methods also need to be defined:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Base.read(io::I, ::Type{UInt8}) or Base.read(io::I)\nBase.write(io::I, x)","category":"page"},{"location":"manual/#Reading-arrow-data","page":"User Manual","title":"Reading arrow data","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"After installing the Arrow.jl Julia package (via ] add Arrow), and if you have some arrow data, let's say a file named data.arrow generated from the pyarrow library (a Python library for interfacing with arrow data), you can then read that arrow data into a Julia session by doing:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"using Arrow\n\ntable = Arrow.Table(\"data.arrow\")","category":"page"},{"location":"manual/#Arrow.Table","page":"User Manual","title":"Arrow.Table","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"The type of table in this example will be an Arrow.Table. When \"reading\" the arrow data, Arrow.Table first \"mmapped\" the data.arrow file, which is an important technique for dealing with data larger than available RAM on a system. By \"mmapping\" a file, the OS doesn't actually load the entire file contents into RAM at the same time, but file contents are \"swapped\" into RAM as different regions of a file are requested. Once \"mmapped\", Arrow.Table then inspected the metadata in the file to determine the number of columns, their names and types, at which byte offset each column begins in the file data, and even how many \"batches\" are included in this file (arrow tables may be partitioned into one or more \"record batches\" each containing portions of the data). Armed with all the appropriate metadata, Arrow.Table then created custom array objects (ArrowVector), which act as \"views\" into the raw arrow memory bytes. This is a significant point in that no extra memory is allocated for \"data\" when reading arrow data. This is in contrast to if we wanted to read data from a csv file as columns into Julia structures; we would need to allocate those array structures ourselves, then parse the file, \"filling in\" each element of the array with the data we parsed from the file. Arrow data, on the other hand, is already laid out in memory or on disk in a binary format, and as long as we have the metadata to interpret the raw bytes, we can figure out whether to treat those bytes as a Vector{Float64}, etc. A sample of the kinds of arrow array types you might see when deserializing arrow data, include:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Arrow.Primitive: the most common array type for simple, fixed-size elements like integers, floats, time types, and decimals\nArrow.List: an array type where its own elements are also arrays of some kind, like string columns, where each element can be thought of as an array of characters\nArrow.FixedSizeList: similar to the List type, but where each array element has a fixed number of elements itself; you can think of this like a Vector{NTuple{N, T}}, where N is the fixed-size width\nArrow.Map: an array type where each element is like a Julia Dict; a list of key value pairs like a Vector{Dict}\nArrow.Struct: an array type where each element is an instance of a custom struct, i.e. an ordered collection of named & typed fields, kind of like a Vector{NamedTuple}\nArrow.DenseUnion: an array type where elements may be of several different types, stored compactly; can be thought of like Vector{Union{A, B}}\nArrow.SparseUnion: another array type where elements may be of several different types, but stored as if made up of identically lengthed child arrays for each possible type (less memory efficient than DenseUnion)\nArrow.DictEncoded: a special array type where values are \"dictionary encoded\", meaning the list of unique, possible values for an array are stored internally in an \"encoding pool\", whereas each stored element of the array is just an integer \"code\" to index into the encoding pool for the actual value.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"And while these custom array types do subtype AbstractArray, there is no current support for setindex!. Remember, these arrays are \"views\" into the raw arrow bytes, so for array types other than Arrow.Primitive, it gets pretty tricky to allow manipulating those raw arrow bytes. Nevetheless, it's as simple as calling copy(x) where x is any ArrowVector type, and a normal Julia Vector type will be fully materialized (which would then allow mutating/manipulating values).","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"So, what can you do with an Arrow.Table full of data? Quite a bit actually!","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Because Arrow.Table implements the Tables.jl interface, it opens up a world of integrations for using arrow data. A few examples include:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"df = DataFrame(Arrow.Table(file)): Build a DataFrame, using the arrow vectors themselves; this allows utilizing a host of DataFrames.jl functionality directly on arrow data; grouping, joining, selecting, etc.\nTables.datavaluerows(Arrow.Table(file)) |> @map(...) |> @filter(...) |> DataFrame: use Query.jl's row-processing utilities to map, group, filter, mutate, etc. directly over arrow data.\nArrow.Table(file) |> SQLite.load!(db, \"arrow_table\"): load arrow data directly into an sqlite database/table, where sql queries can be executed on the data\nArrow.Table(file) |> CSV.write(\"arrow.csv\"): write arrow data out to a csv file","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"A full list of Julia packages leveraging the Tables.jl inteface can be found here.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Apart from letting other packages have all the fun, an Arrow.Table itself can be plenty useful. For example, with tbl = Arrow.Table(file):","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"tbl[1]: retrieve the first column via indexing; the number of columns can be queried via length(tbl)\ntbl[:col1] or tbl.col1: retrieve the column named col1, either via indexing with the column name given as a Symbol, or via \"dot-access\"\nfor col in tbl: iterate through columns in the table\nAbstractDict methods like haskey(tbl, :col1), get(tbl, :col1, nothing), keys(tbl), or values(tbl)","category":"page"},{"location":"manual/#Arrow-types","page":"User Manual","title":"Arrow types","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"In the arrow data format, specific logical types are supported, a list of which can be found here. These include booleans, integers of various bit widths, floats, decimals, time types, and binary/string. While most of these map naturally to types builtin to Julia itself, there are a few cases where the definitions are slightly different, and in these cases, by default, they are converted to more \"friendly\" Julia types (this auto conversion can be avoided by passing convert=false to Arrow.Table, like Arrow.Table(file; convert=false)). Examples of arrow to julia type mappings include:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Date, Time, Timestamp, and Duration all have natural Julia defintions in Dates.Date, Dates.Time, TimeZones.ZonedDateTime, and Dates.Period subtypes, respectively.\nChar and Symbol Julia types are mapped to arrow string types, with additional metadata of the original Julia type; this allows deserializing directly to Char and Symbol in Julia, while other language implementations will see these columns as just strings\nSimilarly to the above, the UUID Julia type is mapped to a 128-bit FixedSizeBinary arrow type.\nDecimal128 and Decimal256 have no corresponding builtin Julia types, so they're deserialized using a compatible type definition in Arrow.jl itself: Arrow.Decimal","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Note that when convert=false is passed, data will be returned in Arrow.jl-defined types that exactly match the arrow definitions of those types; the authoritative source for how each type represents its data can be found in the arrow Schema.fbs file.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"One note on performance: when writing TimeZones.ZonedDateTime columns to the arrow format (via Arrow.write), it is preferrable to \"wrap\" the columns in Arrow.ToTimestamp(col), as long as the column has ZonedDateTime elements that all share a common timezone. This ensures the writing process can know \"upfront\" which timezone will be encoded and is thus much more efficient and performant.","category":"page"},{"location":"manual/#Custom-types","page":"User Manual","title":"Custom types","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"To support writing your custom Julia struct, Arrow.jl utilizes the format's mechanism for \"extension types\" by allowing the storing of Julia type name and metadata in the field metadata. To \"hook in\" to this machinery, custom types can utilize the interface methods defined in the Arrow.ArrowTypes submodule. For example:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"using Arrow\n\nstruct Person\n id::Int\n name::String\nend\n\n# overload interface method for custom type Person; return a symbol as the \"name\"\n# this instructs Arrow.write what \"label\" to include with a column with this custom type\nArrowTypes.arrowname(::Type{Person}) = :Person\n# overload JuliaType on `Val{:Person}`, which is like a dispatchable string\n# return our custom *type* Person; this enables Arrow.Table to know how the \"label\"\n# on a custom column should be mapped to a Julia type and deserialized\nArrowTypes.JuliaType(::Val{:Person}) = Person\n\ntable = (col1=[Person(1, \"Bob\"), Person(2, \"Jane\")],)\nio = IOBuffer()\nArrow.write(io, table)\nseekstart(io)\ntable2 = Arrow.Table(io)","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"In this example, we're writing our table, which is a NamedTuple with one column named col1, which has two elements which are instances of our custom Person struct. We overload Arrowtypes.arrowname so that Arrow.jl knows how to serialize our Person struct. We then overload ArrowTypes.JuliaType so the deserialization process knows how to map from our type label back to our Person struct type. We can then write our data in the arrow format to an in-memory IOBuffer, then read the table back in using Arrow.Table. The table we get back will be an Arrow.Table, with a single Arrow.Struct column with element type Person.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Note that without calling Arrowtypes.JuliaType, we may get into a weird limbo state where we've written our table with Person structs out as a table, but when reading back in, Arrow.jl doesn't know what a Person is; deserialization won't fail, but we'll just get a Namedtuple{(:id, :name), Tuple{Int, String}} back instead of Person.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"While this example is very simple, it shows the basics to allow a custom type to be serialized/deserialized. But the ArrowTypes module offers even more powerful functionality for \"hooking\" non-native arrow types into the serialization/deserialization processes. Let's walk through a couple more examples; if you've had enough custom type shenanigans, feel free to skip to the next section.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Let's take a look at how Arrow.jl allows serializing the nothing value, which is often referred to as the \"software engineer's NULL\" in Julia. While Arrow.jl treats missing as the default arrow NULL value, nothing is pretty similar, but we'd still like to treat it separately if possible. Here's how we enable serialization/deserialization in the ArrowTypes module:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"ArrowTypes.ArrowKind(::Type{Nothing}) = ArrowTypes.NullKind()\nArrowTypes.ArrowType(::Type{Nothing}) = Missing\nArrowTypes.toarrow(::Nothing) = missing\nconst NOTHING = Symbol(\"JuliaLang.Nothing\")\nArrowTypes.arrowname(::Type{Nothing}) = NOTHING\nArrowTypes.JuliaType(::Val{NOTHING}) = Nothing\nArrowTypes.fromarrow(::Type{Nothing}, ::Missing) = nothing","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Let's walk through what's going on here, line-by-line:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"ArrowKind overload: ArrowKinds are generic \"categories\" of types supported by the arrow format, like PrimitiveKind, ListKind, etc. They each correspond to a different data layout strategy supported in the arrow format. Here, we define nothing's kind to be NullKind, which means no actual memory is needed for storage, it's strictly a \"metadata\" type where we store the type and # of elements. In our Person example, we didn't need to overload this since types declared like struct T or mutable struct T are defined as ArrowTypes.StructKind by default\nArrowType overload: here we're signaling that our type (Nothing) maps to the natively supported arrow type of Missing; this is important for the serializer so it knows which arrow type it will be serializing. Again, we didn't need to overload this for Person since the serializer knows how to serialize custom structs automatically by using reflection methods like fieldnames(T) and getfield(x, i).\nArrowTypes.toarrow overload: this is a sister method to ArrowType; we said our type will map to the Missing arrow type, so here we actually define ___how___ it converts to the arrow type; and in this case, it just returns missing. This is yet another method that didn't show up for Person; why? Well, as we noted in ArrowType, the serializer already knows how to serialize custom structs by using all their fields; if, for some reason, we wanted to omit some fields or otherwise transform things, then we could define corresponding ArrowType and toarrow methods\narrowname overload: similar to our Person example, we need to instruct the serializer how to label our custom type in the arrow type metadata; here we give it the symbol Symbol(\"JuliaLang.Nothing\"). Note that while this will ultimately allow us to disambiguate nothing from missing when reading arrow data, if we pass this data to other language implementations, they will only treat the data as missing since they (probably) won't know how to \"understand\" the JuliaLang.Nothing type label\nJuliaType overload: again, like our Person example, we instruct the deserializer that when it encounters the JuliaLang.Nothing type label, it should treat those values as Nothing type.\nAnd finally, fromarrow overload: this allows specifying how the native-arrow data should be converted back to our custom type. fromarrow(T, x...) by default will call T(x...), which is why we didn't need this overload for Person, but in this example, Nothing(missing) won't work, so we define our own custom conversion.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Let's run through one more complex example, just for fun and to really see how far the system can be pushed:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"using Intervals\ntable = (col = [\n Interval{Closed,Unbounded}(1,nothing),\n],)\nconst NAME = Symbol(\"JuliaLang.Interval\")\nArrowTypes.arrowname(::Type{Interval{T, L, R}}) where {T, L, R} = NAME\nconst LOOKUP = Dict(\n \"Closed\" => Closed,\n \"Unbounded\" => Unbounded\n)\nArrowTypes.arrowmetadata(::Type{Interval{T, L, R}}) where {T, L, R} = string(L, \".\", R)\nfunction ArrowTypes.JuliaType(::Val{NAME}, ::Type{NamedTuple{names, types}}, meta) where {names, types}\n L, R = split(meta, \".\")\n return Interval{fieldtype(types, 1), LOOKUP[L], LOOKUP[R]}\nend\nArrowTypes.fromarrow(::Type{Interval{T, L, R}}, first, last) where {T, L, R} = Interval{L, R}(first, R == Unbounded ? nothing : last)\nio = Arrow.tobuffer(table)\ntbl = Arrow.Table(io)","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Again, let's break down what's going on here:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Here we're trying to save an Interval type in the arrow format; this type is unique in that it has two type parameters (Closed and Unbounded) that are not inferred/based on fields, but are just \"type tags\" on the type itself\nNote that we define a generic arrowname method on all Intervals, regardless of type parameters. We just want to let arrow know which general type we're dealing with here\nNext we use a new method ArrowTypes.arrowmetadata to encode the two non-field-based type parameters as a string with a dot delimiter; we encode this information here because remember, we have to match our arrowname Symbol typename in our JuliaType(::Val(name)) definition in order to dispatch correctly; if we encoded the type parameters in arrowname, we would need separate arrowname definitions for each unique combination of those two type parameters, and corresponding JuliaType definitions for each as well; yuck. Instead, we let arrowname be generic to our type, and store the type parameters for this specific column using arrowmetadata\nNow in JuliaType, note we're using the 3-argument overload; we want the NamedTuple type that is the native arrow type our Interval is being serialized as; we use this to retrieve the 1st type parameter for our Interval, which is simply the type of the two first and last fields. Then we use the 3rd argument, which is whatever string we returned from arrowmetadata. We call L, R = split(meta, \".\") to parse the two type parameters (in this case Closed and Unbounded), then do a lookup on those strings from a predefined LOOKUP Dict that matches the type parameter name as string to the actual type. We then have all the information to recreate the full Interval type. Neat!\nThe one final wrinkle is in our fromarrow method; Intervals that are Unbounded, actually take nothing as the 2nd argument. So letting the default fromarrow definition call Interval{T, L, R}(first, last), where first and last are both integers isn't going to work. Instead, we check if the R type parameter is Unbounded and if so, pass nothing as the 2nd arg, otherwise we can pass last.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"This stuff can definitely make your eyes glaze over if you stare at it long enough. As always, don't hesitate to reach out for quick questions on the #data slack channel, or open a new issue detailing what you're trying to do.","category":"page"},{"location":"manual/#Arrow.Stream","page":"User Manual","title":"Arrow.Stream","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"In addition to Arrow.Table, the Arrow.jl package also provides Arrow.Stream for processing arrow data. While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns, Arrow.Stream provides a way to iterate through record batches, one at a time. Each iteration yields an Arrow.Table instance, with columns/data for a single record batch. This allows, if so desired, \"batch processing\" of arrow data, one record batch at a time, instead of creating a single long table via Arrow.Table.","category":"page"},{"location":"manual/#Custom-application-metadata","page":"User Manual","title":"Custom application metadata","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"The Arrow format allows data producers to attach custom metadata to various Arrow objects.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Arrow.jl provides a convenient accessor for this metadata via Arrow.getmetadata. Arrow.getmetadata(t::Arrow.Table) will return an immutable AbstractDict{String,String} that represents the custom_metadata of the table's associated Schema (or nothing if no such metadata exists), while Arrow.getmetadata(c::Arrow.ArrowVector) will return a similar representation of the column's associated Field custom_metadata (or nothing if no such metadata exists).","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"To attach custom schema/column metadata to Arrow tables at serialization time, see the metadata and colmetadata keyword arguments to Arrow.write.","category":"page"},{"location":"manual/#Writing-arrow-data","page":"User Manual","title":"Writing arrow data","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Ok, so that's a pretty good rundown of reading arrow data, but how do you produce arrow data? Enter Arrow.write.","category":"page"},{"location":"manual/#Arrow.write","page":"User Manual","title":"Arrow.write","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"With Arrow.write, you provide either an io::IO argument or a file_path to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"What are some examples of Tables.jl-compatible sources? A few examples include:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Arrow.write(io, df::DataFrame): A DataFrame is a collection of indexable columns\nArrow.write(io, CSV.File(file)): read data from a csv file and write out to arrow format\nArrow.write(io, DBInterface.execute(db, sql_query)): Execute an SQL query against a database via the DBInterface.jl interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include SQLite.jl, MySQL.jl, and ODBC.jl.\ndf |> @map(...) |> Arrow.write(io): Write the results of a Query.jl chain of operations directly out as arrow data\njsontable(json) |> Arrow.write(io): Treat a json array of objects or object of arrays as a \"table\" and write it out as arrow data using the JSONTables.jl package\nArrow.write(io, (col1=data1, col2=data2, ...)): a NamedTuple of AbstractVectors or an AbstractVector of NamedTuples are both considered tables by default, so they can be quickly constructed for easy writing of arrow data if you already have columns of data","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"And these are just a few examples of the numerous integrations.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"In addition to just writing out a single \"table\" of data as a single arrow record batch, Arrow.write also supports writing out multiple record batches when the input supports the Tables.partitions functionality. One immediate, though perhaps not incredibly useful example, is Arrow.Stream. Arrow.Stream implements Tables.partitions in that it iterates \"tables\" (specifically Arrow.Table), and as such, Arrow.write will iterate an Arrow.Stream, and write out each Arrow.Table as a separate record batch. Another important point for why this example works is because an Arrow.Stream iterates Arrow.Tables that all have the same schema. This is important because when writing arrow data, a \"schema\" message is always written first, with all subsequent record batches written with data matching the initial schema.","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"In addition to inputs that support Tables.partitions, note that the Tables.jl itself provides the Tables.partitioner function, which allows providing your own separate instances of similarly-schema-ed tables as \"partitions\", like:","category":"page"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"# treat 2 separate NamedTuples of vectors with same schema as 1 table, 2 partitions\ntbl_parts = Tables.partitioner([(col1=data1, col2=data2), (col1=data3, col2=data4)])\nArrow.write(io, tbl_parts)\n\n# treat an array of csv files with same schema where each file is a partition\n# in this form, a function `CSV.File` is applied to each element of 2nd argument\ncsv_parts = Tables.partitioner(CSV.File, csv_files)\nArrow.write(io, csv_parts)","category":"page"},{"location":"manual/#Arrow.Writer","page":"User Manual","title":"Arrow.Writer","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"With Arrow.Writer, you instantiate an Arrow.Writer object, write sources using it, and then close it. This allows for incrmental writes to the same sink. It is similar to Arrow.append without having to close and re-open the sink in between writes and without the limitation of only supporting the IPC stream format.","category":"page"},{"location":"manual/#Multithreaded-writing","page":"User Manual","title":"Multithreaded writing","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"By default, Arrow.write will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set). The number of concurrent tasks to use when writing can be controlled by passing the ntasks keyword argument to Arrow.write. Passing ntasks=1 avoids any multithreading when writing.","category":"page"},{"location":"manual/#Compression","page":"User Manual","title":"Compression","text":"","category":"section"},{"location":"manual/","page":"User Manual","title":"User Manual","text":"Compression is supported when writing via the compress keyword argument. Possible values include :lz4, :zstd, or your own initialized LZ4FrameCompressor or ZstdCompressor objects; will cause all buffers in each record batch to use the respective compression encoding or compressor.","category":"page"},{"location":"reference/","page":"API Reference","title":"API Reference","text":"<!–- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at","category":"page"},{"location":"reference/","page":"API Reference","title":"API Reference","text":"http://www.apache.org/licenses/LICENSE-2.0","category":"page"},{"location":"reference/","page":"API Reference","title":"API Reference","text":"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. –>","category":"page"},{"location":"reference/#API-Reference","page":"API Reference","title":"API Reference","text":"","category":"section"},{"location":"reference/","page":"API Reference","title":"API Reference","text":"Modules = [Arrow]\nOrder = [:type, :function]","category":"page"},{"location":"reference/#Arrow.ArrowVector","page":"API Reference","title":"Arrow.ArrowVector","text":"Arrow.ArrowVector\n\nAn abstract type that subtypes AbstractVector. Each specific arrow array type subtypes ArrowVector. See BoolVector, Primitive, List, Map, FixedSizeList, Struct, DenseUnion, SparseUnion, and DictEncoded for more details.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.BoolVector","page":"API Reference","title":"Arrow.BoolVector","text":"Arrow.BoolVector\n\nA bit-packed array type, similar to ValidityBitmap, but which holds boolean values, true or false.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Compressed","page":"API Reference","title":"Arrow.Compressed","text":"Arrow.Compressed\n\nRepresents the compressed version of an ArrowVector. Holds a reference to the original column. May have Compressed children for nested array types.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.DenseUnion","page":"API Reference","title":"Arrow.DenseUnion","text":"Arrow.DenseUnion\n\nAn ArrowVector where the type of each element is one of a fixed set of types, meaning its eltype is like a julia Union{type1, type2, ...}. An Arrow.DenseUnion, in comparison to Arrow.SparseUnion, stores elements in a set of arrays, one array per possible type, and an \"offsets\" array, where each offset element is the index into one of the typed arrays. This allows a sort of \"compression\", where no extra space is used/allocated to store all the elements.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.DictEncode","page":"API Reference","title":"Arrow.DictEncode","text":"Arrow.DictEncode(::AbstractVector, id::Integer=nothing)\n\nSignals that a column/array should be dictionary encoded when serialized to the arrow streaming/file format. An optional id number may be provided to signal that multiple columns should use the same pool when being dictionary encoded.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.DictEncoded","page":"API Reference","title":"Arrow.DictEncoded","text":"Arrow.DictEncoded\n\nA dictionary encoded array type (similar to a PooledArray). Behaves just like a normal array in most respects; internally, possible values are stored in the encoding::DictEncoding field, while the indices::Vector{<:Integer} field holds the \"codes\" of each element for indexing into the encoding pool. Any column/array can be dict encoding when serializing to the arrow format either by passing the dictencode=true keyword argument to Arrow.write (which causes all columns to be dict encoded), or wrapping individual columns/ arrays in Arrow.DictEncode(x).\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.DictEncoding","page":"API Reference","title":"Arrow.DictEncoding","text":"Arrow.DictEncoding\n\nRepresents the \"pool\" of possible values for a DictEncoded array type. Whether the order of values is significant can be checked by looking at the isOrdered boolean field.\n\nThe S type parameter, while not tied directly to any field, is the signed integer \"index type\" of the parent DictEncoded. We keep track of this in the DictEncoding in order to validate the length of the pool doesn't exceed the index type limit. The general workflow of writing arrow data means the initial schema will typically be based off the data in the first record batch, and subsequent record batches need to match the same schema exactly. For example, if a non-first record batch dict encoded column were to cause a DictEncoding pool to overflow on unique values, a fatal error should be thrown.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.FixedSizeList","page":"API Reference","title":"Arrow.FixedSizeList","text":"Arrow.FixedSizeList\n\nAn ArrowVector where each element is a \"fixed size\" list of some kind, like a NTuple{N, T}.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.List","page":"API Reference","title":"Arrow.List","text":"Arrow.List\n\nAn ArrowVector where each element is a variable sized list of some kind, like an AbstractVector or AbstractString.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Map","page":"API Reference","title":"Arrow.Map","text":"Arrow.Map\n\nAn ArrowVector where each element is a \"map\" of some kind, like a Dict.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Primitive","page":"API Reference","title":"Arrow.Primitive","text":"Arrow.Primitive\n\nAn ArrowVector where each element is a \"fixed size\" scalar of some kind, like an integer, float, decimal, or time type.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.SparseUnion","page":"API Reference","title":"Arrow.SparseUnion","text":"Arrow.SparseUnion\n\nAn ArrowVector where the type of each element is one of a fixed set of types, meaning its eltype is like a julia Union{type1, type2, ...}. An Arrow.SparseUnion, in comparison to Arrow.DenseUnion, stores elements in a set of arrays, one array per possible type, and each typed array has the same length as the full array. This ends up with \"wasted\" space, since only one slot among the typed arrays is valid per full array element, but can allow for certain optimizations when each typed array has the same length.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Stream","page":"API Reference","title":"Arrow.Stream","text":"Arrow.Stream(io::IO; convert::Bool=true)\nArrow.Stream(file::String; convert::Bool=true)\nArrow.Stream(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)\nArrow.Stream(inputs::Vector; convert::Bool=true)\n\nStart reading an arrow formatted table, from:\n\nio, bytes will be read all at once via read(io)\nfile, bytes will be read via Mmap.mmap(file)\nbytes, a byte vector directly, optionally allowing specifying the starting byte position pos and len\nA Vector of any of the above, in which each input should be an IPC or arrow file and must match schema\n\nReads the initial schema message from the arrow stream/file, then returns an Arrow.Stream object which will iterate over record batch messages, producing an Arrow.Table on each iteration.\n\nBy iterating Arrow.Table, Arrow.Stream satisfies the Tables.partitions interface, and as such can be passed to Tables.jl-compatible sink functions.\n\nThis allows iterating over extremely large \"arrow tables\" in chunks represented as record batches.\n\nSupports the convert keyword argument which controls whether certain arrow primitive types will be lazily converted to more friendly Julia defaults; by default, convert=true.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Struct","page":"API Reference","title":"Arrow.Struct","text":"Arrow.Struct\n\nAn ArrowVector where each element is a \"struct\" of some kind with ordered, named fields, like a NamedTuple{names, types} or regular julia struct.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Table","page":"API Reference","title":"Arrow.Table","text":"Arrow.Table(io::IO; convert::Bool=true)\nArrow.Table(file::String; convert::Bool=true)\nArrow.Table(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)\nArrow.Table(inputs::Vector; convert::Bool=true)\n\nRead an arrow formatted table, from:\n\nio, bytes will be read all at once via read(io)\nfile, bytes will be read via Mmap.mmap(file)\nbytes, a byte vector directly, optionally allowing specifying the starting byte position pos and len\nA Vector of any of the above, in which each input should be an IPC or arrow file and must match schema\n\nReturns a Arrow.Table object that allows column access via table.col1, table[:col1], or table[1].\n\nNOTE: the columns in an Arrow.Table are views into the original arrow memory, and hence are not easily modifiable (with e.g. push!, append!, etc.). To mutate arrow columns, call copy(x) to materialize the arrow data as a normal Julia array.\n\nArrow.Table also satisfies the Tables.jl interface, and so can easily be materialied via any supporting sink function: e.g. DataFrame(Arrow.Table(file)), SQLite.load!(db, \"table\", Arrow.Table(file)), etc.\n\nSupports the convert keyword argument which controls whether certain arrow primitive types will be lazily converted to more friendly Julia defaults; by default, convert=true.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.ToTimestamp","page":"API Reference","title":"Arrow.ToTimestamp","text":"Arrow.ToTimestamp(x::AbstractVector{ZonedDateTime})\n\nWrapper array that provides a more efficient encoding of ZonedDateTime elements to the arrow format. In the arrow format, timestamp columns with timezone information are encoded as the arrow equivalent of a Julia type parameter, meaning an entire column should have elements all with the same timezone. If a ZonedDateTime column is passed to Arrow.write, for correctness, it must scan each element to check each timezone. Arrow.ToTimestamp provides a \"bypass\" of this process by encoding the timezone of the first element of the AbstractVector{ZonedDateTime}, which in turn allows Arrow.write to avoid costly checking/conversion and can encode the ZonedDateTime as Arrow.Timestamp directly.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.ValidityBitmap","page":"API Reference","title":"Arrow.ValidityBitmap","text":"Arrow.ValidityBitmap\n\nA bit-packed array type where each bit corresponds to an element in an ArrowVector, indicating whether that element is \"valid\" (bit == 1), or not (bit == 0). Used to indicate element missingness (whether it's null).\n\nIf the null count of an array is zero, the ValidityBitmap will be \"empty\" and all elements are treated as \"valid\"/non-null.\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.Writer","page":"API Reference","title":"Arrow.Writer","text":"Arrow.Writer{T<:IO}\n\nAn object that can be used to incrementally write Arrow partitions\n\nExamples\n\njulia> writer = open(Arrow.Writer, tempname())\n\njulia> partition1 = (col1 = [1, 2], col2 = [\"A\", \"B\"])\n(col1 = [1, 2], col2 = [\"A\", \"B\"])\n\njulia> Arrow.write(writer, partition1)\n\njulia> partition2 = (col1 = [3, 4], col2 = [\"C\", \"D\"])\n(col1 = [3, 4], col2 = [\"C\", \"D\"])\n\njulia> Arrow.write(writer, partition2)\n\njulia> close(writer)\n\nIt's also possible to automatically close the Writer using a do-block:\n\njulia> open(Arrow.Writer, tempname()) do writer\n partition2 = (col1 = [1, 2], col2 = [\"A\", \"B\"])\n Arrow.write(writer, partition1)\n partition2 = (col1 = [3, 4], col2 = [\"C\", \"D\"])\n Arrow.write(writer, partition1)\n end\n\n\n\n\n\n","category":"type"},{"location":"reference/#Arrow.append","page":"API Reference","title":"Arrow.append","text":"Arrow.append(io::IO, tbl)\nArrow.append(file::String, tbl)\ntbl |> Arrow.append(file)\n\nAppend any Tables.jl-compatible tbl to an existing arrow formatted file or IO. The existing arrow data must be in IPC stream format. Note that appending to the \"feather formatted file\" is not allowed, as this file format doesn't support appending. That means files written like Arrow.write(filename::String, tbl) cannot be appended to; instead, you should write like Arrow.write(filename::String, tbl; file=false).\n\nWhen an IO object is provided to be written on to, it must support seeking. For example, a file opened in r+ mode or an IOBuffer that is readable, writable and seekable can be appended to, but not a network stream.\n\nMultiple record batches will be written based on the number of Tables.partitions(tbl) that are provided; by default, this is just one for a given table, but some table sources support automatic partitioning. Note you can turn multiple table objects into partitions by doing Tables.partitioner([tbl1, tbl2, ...]), but note that each table must have the exact same Tables.Schema.\n\nBy default, Arrow.append will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set).\n\nSupported keyword arguments to Arrow.append include:\n\nalignment::Int=8: specify the number of bytes to align buffers to when written in messages; strongly recommended to only use alignment values of 8 or 64 for modern memory cache line optimization\ncolmetadata=nothing: the metadata that should be written as the table's columns' custom_metadata fields; must either be nothing or an AbstractDict of column_name::Symbol => column_metadata where column_metadata is an iterable of <:AbstractString pairs.\ndictencode::Bool=false: whether all columns should use dictionary encoding when being written; to dict encode specific columns, wrap the column/array in Arrow.DictEncode(col)\ndictencodenested::Bool=false: whether nested data type columns should also dict encode nested arrays/buffers; other language implementations may not support this\ndenseunions::Bool=true: whether Julia Vector{<:Union} arrays should be written using the dense union layout; passing false will result in the sparse union layout\nlargelists::Bool=false: causes list column types to be written with Int64 offset arrays; mainly for testing purposes; by default, Int64 offsets will be used only if needed\nmaxdepth::Int=6: deepest allowed nested serialization level; this is provided by default to prevent accidental infinite recursion with mutually recursive data structures\nmetadata=Arrow.getmetadata(tbl): the metadata that should be written as the table's schema's custom_metadata field; must either be nothing or an iterable of <:AbstractString pairs.\nntasks::Int: number of concurrent threaded tasks to allow while writing input partitions out as arrow record batches; default is no limit; to disable multithreaded writing, pass ntasks=1\nconvert::Bool: whether certain arrow primitive types in the schema of file should be converted to Julia defaults for matching them to the schema of tbl; by default, convert=true.\nfile::Bool: applicable when an IO is provided, whether it is a file; by default file=false.\n\n\n\n\n\n","category":"function"},{"location":"reference/#Arrow.arrowtype","page":"API Reference","title":"Arrow.arrowtype","text":"Given a FlatBuffers.Builder and a Julia column or column eltype, Write the field.type flatbuffer definition of the eltype\n\n\n\n\n\n","category":"function"},{"location":"reference/#Arrow.getmetadata-Tuple{Arrow.Table}","page":"API Reference","title":"Arrow.getmetadata","text":"Arrow.getmetadata(x)\n\nIf x isa Arrow.Table return a Base.ImmutableDict{String,String} representation of x's Schema custom_metadata, or nothing if no such metadata exists.\n\nIf x isa Arrow.ArrowVector, return a Base.ImmutableDict{String,String} representation of x's Field custom_metadata, or nothing if no such metadata exists.\n\nOtherwise, return nothing.\n\nSee the official Arrow documentation for more details on custom application metadata.\n\n\n\n\n\n","category":"method"},{"location":"reference/#Arrow.juliaeltype","page":"API Reference","title":"Arrow.juliaeltype","text":"Given a flatbuffers metadata type definition (a Field instance from Schema.fbs), translate to the appropriate Julia storage eltype\n\n\n\n\n\n","category":"function"},{"location":"reference/#Arrow.write","page":"API Reference","title":"Arrow.write","text":"Arrow.write(io::IO, tbl)\nArrow.write(file::String, tbl)\ntbl |> Arrow.write(io_or_file)\n\nWrite any Tables.jl-compatible tbl out as arrow formatted data. Providing an io::IO argument will cause the data to be written to it in the \"streaming\" format, unless file=true keyword argument is passed. Providing a file::String argument will result in the \"file\" format being written.\n\nMultiple record batches will be written based on the number of Tables.partitions(tbl) that are provided; by default, this is just one for a given table, but some table sources support automatic partitioning. Note you can turn multiple table objects into partitions by doing Tables.partitioner([tbl1, tbl2, ...]), but note that each table must have the exact same Tables.Schema.\n\nBy default, Arrow.write will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8 or the JULIA_NUM_THREADS environment variable is set).\n\nSupported keyword arguments to Arrow.write include:\n\ncolmetadata=nothing: the metadata that should be written as the table's columns' custom_metadata fields; must either be nothing or an AbstractDict of column_name::Symbol => column_metadata where column_metadata is an iterable of <:AbstractString pairs.\ncompress: possible values include :lz4, :zstd, or your own initialized LZ4FrameCompressor or ZstdCompressor objects; will cause all buffers in each record batch to use the respective compression encoding\nalignment::Int=8: specify the number of bytes to align buffers to when written in messages; strongly recommended to only use alignment values of 8 or 64 for modern memory cache line optimization\ndictencode::Bool=false: whether all columns should use dictionary encoding when being written; to dict encode specific columns, wrap the column/array in Arrow.DictEncode(col)\ndictencodenested::Bool=false: whether nested data type columns should also dict encode nested arrays/buffers; other language implementations may not support this\ndenseunions::Bool=true: whether Julia Vector{<:Union} arrays should be written using the dense union layout; passing false will result in the sparse union layout\nlargelists::Bool=false: causes list column types to be written with Int64 offset arrays; mainly for testing purposes; by default, Int64 offsets will be used only if needed\nmaxdepth::Int=6: deepest allowed nested serialization level; this is provided by default to prevent accidental infinite recursion with mutually recursive data structures\nmetadata=Arrow.getmetadata(tbl): the metadata that should be written as the table's schema's custom_metadata field; must either be nothing or an iterable of <:AbstractString pairs.\nntasks::Int: number of buffered threaded tasks to allow while writing input partitions out as arrow record batches; default is no limit; for unbuffered writing, pass ntasks=0\nfile::Bool=false: if a an io argument is being written to, passing file=true will cause the arrow file format to be written instead of just IPC streaming\n\n\n\n\n\n","category":"function"},{"location":"","page":"Home","title":"Home","text":"<!–- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at","category":"page"},{"location":"","page":"Home","title":"Home","text":"http://www.apache.org/licenses/LICENSE-2.0","category":"page"},{"location":"","page":"Home","title":"Home","text":"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. –>","category":"page"},{"location":"#Arrow.jl","page":"Home","title":"Arrow.jl","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Pages = [\"manual.md\", \"reference.md\"]\nDepth = 3","category":"page"},{"location":"","page":"Home","title":"Home","text":"Arrow","category":"page"},{"location":"#Arrow","page":"Home","title":"Arrow","text":"Arrow.jl\n\nA pure Julia implementation of the apache arrow memory format specification.\n\nThis implementation supports the 1.0 version of the specification, including support for:\n\nAll primitive data types\nAll nested data types\nDictionary encodings, nested dictionary encodings, and messages\nExtension types\nStreaming, file, record batch, and replacement and isdelta dictionary messages\nBuffer compression/decompression via the standard LZ4 frame and Zstd formats\n\nIt currently doesn't include support for:\n\nTensors or sparse tensors\nFlight RPC\nC data interface\n\nThird-party data formats:\n\ncsv and parquet support via the existing CSV.jl and Parquet.jl packages\nOther Tables.jl-compatible packages automatically supported (DataFrames.jl, JSONTables.jl, JuliaDB.jl, SQLite.jl, MySQL.jl, JDBC.jl, ODBC.jl, XLSX.jl, etc.)\nNo current Julia packages support ORC or Avro data formats\n\nSee docs for official Arrow.jl API with the User Manual and reference docs for Arrow.Table, Arrow.write, and Arrow.Stream.\n\n\n\n\n\n","category":"module"}]
}