commit	6d0ac4946f062414e2b60aa3d67c2875bb2e7958	[log] [tgz]
author	Jacob Quinn <quinn.jacobd@gmail.com>	Thu Nov 03 10:19:32 2022 -0600
committer	GitHub <noreply@github.com>	Thu Nov 03 10:19:32 2022 -0600
tree	c9f38929209a76c927f2de4ab04eb9269341f129
parent	0be54a9799ca3346737d547ee79aa3a2c28a2a96 [diff]

Ensure Julia types have alignment respected (#357)

Fixes https://github.com/apache/arrow-julia/issues/345.

Alternative to https://github.com/apache/arrow-julia/pull/350.

Whew, a bit of a rabbit hole here. It turns out this issue doesn't have
anything to do w/ the _arrow_ alignment when writing, but actually goes
back to a core Julia issue, specifically the inconsistency reported in
https://github.com/JuliaLang/julia/issues/42326. Essentially, the issue
is that different platforms (intel vs. arm64) are requiring different
alignments for types like `UInt128` (8 for intel, 16 for arm64).

And it turns out this isn't even a Julia issue, but all the way back to LLVM.

So why does this crop up in Arrow.jl? Because when you try to serialize/deserialize
a `Arrow.Primitive{Int128}` array, we're going to write it out in the proper arrow
format, but when you read it back in, we've been using the zero-copy technique of:

```julia
unsafe_wrap(Array, Ptr{Int128}(arrow_ptr), len)
```
in order to give the user a Julia `Vector{Int128}` to the underlying arrow data.

_BUT_, in Julia when you allocate any `Vector` with an eltype size of less than
4, then only 8 bytes of alignment are specified. So the fact that most users
will pass arrow data as a file (which is mmapped as `Vector{UInt8}`), a byte
vector directly (`Vector{UInt8}`), or an IO (which we read as `Vector{UInt8}`),
means these vectors are only 8-byte aligned. This then throws the fatal error reported
in the original #350 issue about the pointer not being 16-byte aligned.

So one option to consider is allowing users to pass in a 16-byte aligned arrow "source"
and then that would just work, right? Well, except Julia doesn't expose any way of
"upcasting" an array's alignment, so it's purely based off the array eltype. Which in
turn is a struggle because the current Arrow.jl architecture assumes the source will be
exactly a `Vector{UInt8}` everywhere. So essentially it requires some heruclean effort
to try and pass your own aligned array; I think the only option I can think of is if you
did something like:
* Mmap an arrow file into Vector{UInt8}
* Allocate your own Vector{UInt32} and copy arrow data into it
* Use unsafe_wrap to make a Vector{UInt8} of the Vector{UInt32} data
* But you _MUST_ keep a reference to the original Vector{UInt32} array
otherwise, your new Vector{UInt8} gets corrupted
* Pass the new Vector{UInt8} into Arrow.Table to read

Anyway, not easy to say the least.

The proposed solution here is as follows:
* We check `sizeof(T)` to see if it's > 16 bytes, which we're using as a proxy
for the alignment, regardless of platform (core devs are leaning towards the 8-byte
alignment on intel actually being a bug for 16-byte primitives and I agree)
* We use sizeof since the `jl_datatype_t` alignment field isn't part of the public Julia
API and thus subject to change (and in fact I think it did just change location in Julia#master)
* It's a good enough proxy for our purposes anyway
* If the original arrow pointer _isn't_ 16-byte aligned, then we'll allocate a new `Vector{T}`,
which _will_ be aligned, then copy the arrow data directly into it via pointers. Simple, easy,
just one extra allocation/copy.

If Julia _does_ get the ability in the future to specify a custom larger-than-eltype-required alignemnt
for arrays, then we could potentially do that ourselves when reading, but it's a little tricky because
we really only need to do that if there are 16-byte primitives we'll be deserializing and we don't know that
until we read a schema message. So *shrug*.

src/table.jl[diff]

1 file changed

tree: c9f38929209a76c927f2de4ab04eb9269341f129

README.md

Arrow

This is a pure Julia implementation of the Apache Arrow data standard. This package provides Julia AbstractVector objects for referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.

Please see this document for a description of the Arrow memory layout.

Installation

The package can be installed by typing in the following in a Julia REPL:

julia> using Pkg; Pkg.add("Arrow")

or to use the official-apache code that follows the official apache release process, you can do:

julia> using Pkg; Pkg.add(url="https://github.com/apache/arrow", subdir="julia/Arrow.jl")

Local Development

When developing on Arrow.jl it is recommended that you run the following to ensure that any changes to ArrowTypes.jl are immediately available to Arrow.jl without requiring a release:

julia --project -e 'using Pkg; Pkg.develop(path="src/ArrowTypes")'

Format Support

This implementation supports the 1.0 version of the specification, including support for:

All primitive data types
All nested data types
Dictionary encodings and messages
Extension types
Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:

Tensors or sparse tensors
Flight RPC
C data interface

Third-party data formats:

CSV, parquet and avro support via the existing CSV.jl, Parquet.jl and Avro.jl packages
Other Tables.jl-compatible packages automatically supported (DataFrames.jl, JSONTables.jl, JuliaDB.jl, SQLite.jl, MySQL.jl, JDBC.jl, ODBC.jl, XLSX.jl, etc.)
No current Julia packages support ORC

See the full documentation for details on reading and writing arrow data.