The Apache Arrow team is pleased to announce the 0.7.0 release. It includes 133 resolved JIRAs many new features and bug fixes to the various language implementations. The Arrow memory format remains stable since the 0.3.x release.
See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.
We include some highlights from the release in this post.
Since the last release we have added Kou to the Arrow Project Management Committee. He is also a PMC for Apache Subversion, and a major contributor to many other open source projects.
As an active member of the Ruby community in Japan, Kou has been developing the GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby users to benefit from the work that's happening in Apache Arrow.
We are excited to be collaborating with the Ruby community on shared infrastructure for in-memory analytics and data science.
Paul Taylor from the Falcor and ReactiveX projects has worked to expand the JavaScript implementation (which is written in TypeScript), using the latest in modern JavaScript build and packaging technology. We are looking forward to building out the JS implementation and bringing it up to full functionality with the C++ and Java implementations.
We are looking for more JavaScript developers to join the project and work together to make Arrow for JS work well with many kinds of front end use cases, like real time data visualization.
As part of longer-term efforts to build an Arrow-native in-memory analytics library, we implemented a variety of type conversion functions. These functions are essential in ETL tasks when conforming one table schema to another. These are similar to the astype
function in NumPy.
In [17]: import pyarrow as pa In [18]: arr = pa.array([True, False, None, True]) In [19]: arr Out[19]: <pyarrow.lib.BooleanArray object at 0x7ff6fb069b88> [ True, False, NA, True ] In [20]: arr.cast(pa.int32()) Out[20]: <pyarrow.lib.Int32Array object at 0x7ff6fb0383b8> [ 1, 0, NA, 1 ]
Over time these will expand to support as many input-and-output type combinations with optimized conversions.
To help with GPU-related projects using Arrow, like the GPU Open Analytics Initiative, we have started a C++ add-on library to simplify Arrow memory management on CUDA-enabled graphics cards. We would like to expand this to include a library of reusable CUDA kernel functions for GPU analytics on Arrow columnar memory.
For example, we could write a record batch from CPU memory to GPU device memory like so (some error checking omitted):
#include <arrow/api.h> #include <arrow/gpu/cuda_api.h> using namespace arrow; gpu::CudaDeviceManager* manager; std::shared_ptr<gpu::CudaContext> context; gpu::CudaDeviceManager::GetInstance(&manager) manager_->GetContext(kGpuNumber, &context); std::shared_ptr<RecordBatch> batch = GetCpuData(); std::shared_ptr<gpu::CudaBuffer> device_serialized; gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized));
We can then “read” the GPU record batch, but the returned arrow::RecordBatch
internally will contain GPU device pointers that you can use for CUDA kernel calls:
std::shared_ptr<RecordBatch> device_batch; gpu::ReadRecordBatch(batch->schema(), device_serialized, default_memory_pool(), &device_batch)); // Now run some CUDA kernels on device_batch
Phillip Cloud has been working on decimal support in C++ to enable Parquet read/write support in C++ and Python, and also end-to-end testing against the Arrow Java libraries.
In the upcoming releases, we hope to complete the remaining data types that need end-to-end testing between Java and C++:
Some highlights of Python development outside of bug fixes and general API improvements include:
put
and get
arbitrary Python objects in Plasma objectsflavor='spark'
option to pyarrow.parquet.write_table
to enable easy writing of Parquet files maximized for Spark compatibilityparquet.write_to_dataset
function with support for partitioned writesUpcoming Arrow releases will continue to expand the project to cover more use cases. In addition to completing end-to-end testing for all the major data types, some of us will be shifting attention to building Arrow-native in-memory analytics libraries.
We are looking for more JavaScript, R, and other programming language developers to join the project and expand the available implementations and bindings to more languages.