layout: default title: Apache Arrow 0.8.0 Release permalink: /release/0.8.0.html

Apache Arrow 0.8.0 (18 December 2017)

This is a major release.

Download

Contributors

$ git shortlog -sn apache-arrow-0.7.1..apache-arrow-0.8.0
    90  Wes McKinney
    23  Phillip Cloud
    21  Kouhei Sutou
    13  Licht-T
    12  Korn, Uwe
    12  Philipp Moritz
    12  Uwe L. Korn
    10  Bryan Cutler
     5  Li Jin
     5  Robert Nishihara
     4  Paul Taylor
     4  siddharth
     3  Max Risuhin
     3  Stephanie
     2  Rene Sugar
     2  Heimir Sverrisson
     2  Brian Hulette
     2  Yuliya Feldman
     2  dhirschf
     2  Matthias Vallentin
     1  vkorukanti
     1  Andrew Andrade
     1  Benjamin Goldberg
     1  Ivan Sadikov
     1  John Jenkins
     1  Joris Van den Bossche
     1  Lewis John McGibbney
     1  Lu Qi
     1  Manuel
     1  Nick White
     1  Ofek Lev
     1  Shixiong Zhu
     1  Siddharth Teotia
     1  Stephen G
     1  Victor Uriarte
     1  Wataru Shimizu
     1  ksdevlife
     1  lmeyerov
     1  rvernica
     1  Amir Malekpour

Patch Committers

The following Apache committers committed contributed patches to the repository.

$ git shortlog -csn apache-arrow-0.7.0..apache-arrow-0.8.0
   236  Wes McKinney
    35  Uwe L. Korn
    10  Philipp Moritz
     5  Kouhei Sutou
     1  Steven Phillips

Changelog

New Features and Improvements

  • ARROW-1032 - [JS] Support custom_metadata
  • ARROW-1047 - [Java] Add generalized stream writer and reader interfaces that are decoupled from IO / message framing
  • ARROW-1087 - [Python] add get_include to expose directory containing header files
  • ARROW-1114 - [C++] Create Record Batch Builder class as a reusable and efficient way to transpose row-by-row data to columns
  • ARROW-1134 - [C++] Allow C++/CLI projects to build with Arrow
  • ARROW-1178 - [Python] Create alternative to Table.from_pandas that yields a list of RecordBatch objects with a given chunk size
  • ARROW-1226 - [C++] Improve / correct doxygen function documentation in arrow::ipc
  • ARROW-1250 - [Python] Define API for user type checking of array types
  • ARROW-1369 - Support boolean types in the javascript arrow reader library
  • ARROW-1371 - [Website] Add “Powered By” page to the website
  • ARROW-1455 - [Python] Add Dockerfile for validating Dask integration outside of usual CI
  • ARROW-1471 - [JAVA] Document requirements and non/requirements for ValueVector updates
  • ARROW-1472 - [JAVA] Design updated ValueVector Object Hierarchy
  • ARROW-1473 - [JAVA] Create Prototype Code Hierarchy (Implementation Phase 1)
  • ARROW-1474 - [JAVA] ValueVector hierarchy (Implementation Phase 2)
  • ARROW-1476 - [JAVA] Implement final ValueVector updates
  • ARROW-1482 - [C++] Implement casts between date32 and date64
  • ARROW-1483 - [C++] Implement casts between time32 and time64
  • ARROW-1484 - [C++] Implement (safe and unsafe) casts between timestamps and times of different units
  • ARROW-1486 - [C++] Decide if arrow::RecordBatch needs to be copyable
  • ARROW-1487 - [C++] Implement casts from List<A> to List<B>, where a cast function is defined from any A to B
  • ARROW-1488 - [C++] Implement ArrayBuilder::Finish in terms of internal::ArrayData
  • ARROW-1498 - [GitHub] Add CONTRIBUTING.md and ISSUE_TEMPLATE.md
  • ARROW-1503 - [Python] Add serialization callbacks for pandas objects in pyarrow.serialize
  • ARROW-1522 - [C++] Support pyarrow.Buffer as built-in type in pyarrow.serialize
  • ARROW-1523 - [C++] Add helper data struct with methods for reading a validity bitmap possibly having a non-zero offset
  • ARROW-1524 - [C++] More graceful solution for handling non-zero offsets on inputs and outputs in compute library
  • ARROW-1525 - [C++] Change functions in arrow/compare.h to not return Status
  • ARROW-1526 - [Python] Unit tests to exercise code path in PARQUET-1100
  • ARROW-1535 - [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system
  • ARROW-1538 - [C++] Support Ubuntu 14.04 in .deb packaging automation
  • ARROW-1539 - [C++] Remove functions deprecated as of 0.7.0 and prior releases
  • ARROW-1556 - [C++] Incorporate AssertArraysEqual function from PARQUET-1100 patch
  • ARROW-1559 - [C++] Kernel implementations for “unique” (compute distinct elements of array)
  • ARROW-1573 - [C++] Implement stateful kernel function that uses DictionaryBuilder to compute dictionary indices
  • ARROW-1575 - [Python] Add pyarrow.column factory function
  • ARROW-1577 - [JS] Package release script for NPM modules
  • ARROW-1588 - [C++/Format] Harden Decimal Format
  • ARROW-1593 - [PYTHON] serialize_pandas should pass through the preserve_index keyword
  • ARROW-1594 - [Python] Enable multi-threaded conversions in Table.from_pandas
  • ARROW-1600 - [C++] Zero-copy Buffer constructor from std::string
  • ARROW-1602 - [C++] Add IsValid/IsNotNull method to arrow::Array
  • ARROW-1603 - [C++] Add BinaryArray method to get a value as a std::string
  • ARROW-1604 - [Python] Support common type aliases in cast(...) and various type= arguments
  • ARROW-1605 - [Python] pyarrow.array should be able to yield smaller integer types without an explicit cast
  • ARROW-1607 - [C++] Implement DictionaryBuilder for Decimals
  • ARROW-1613 - [Java] ArrowReader should not close the input ReadChannel
  • ARROW-1616 - [Python] Add “write” method to RecordBatchStreamWriter that dispatches to write_table/write_back as appropriate
  • ARROW-1626 - Add make targets to run the inter-procedural static analysis tool called “infer”.
  • ARROW-1627 - [JAVA] Reduce heap usage(Phase 2) - memory footprint in AllocationManager.BufferLedger
  • ARROW-1630 - [Serialization] Support Python datetime objects
  • ARROW-1631 - [C++] Add GRPC to ThirdpartyToolchain.cmake
  • ARROW-1635 - Add release management guide for PMCs
  • ARROW-1637 - [C++] IPC round-trip for null type
  • ARROW-1641 - [C++] Do not include in public headers
  • ARROW-1648 - C++: Add cast from Dictionary[NullType] to NullType
  • ARROW-1649 - C++: Print number of nulls in PrettyPrint for NullArray
  • ARROW-1651 - [JS] Lazy row accessor in Table
  • ARROW-1652 - [JS] Separate Vector into BatchVector and CompositeVector
  • ARROW-1654 - [Python] pa.DataType cannot be pickled
  • ARROW-1662 - Move OSX Dependency management into brew bundle Brewfiles
  • ARROW-1665 - [Serialization] Support more custom datatypes in the default serialization context
  • ARROW-1666 - [GLib] Enable gtk-doc on Travis CI Mac environment
  • ARROW-1667 - [GLib] Support Meson
  • ARROW-1671 - [C++] Change arrow::MakeArray to not return Status
  • ARROW-1675 - [Python] Use RecordBatch.from_pandas in FeatherWriter.write
  • ARROW-1677 - [Blog] Add blog post on Ray and Arrow Python serialization
  • ARROW-1679 - [GLib] Add garrow_record_batch_reader_read_next()
  • ARROW-1683 - [Python] Restore “TimestampType” to pyarrow namespace
  • ARROW-1684 - [Python] Simplify user API for reading nested Parquet columns
  • ARROW-1685 - [GLib] Add GArrowTableReader
  • ARROW-1689 - [Python] Categorical Indices Should Be Zero-Copy
  • ARROW-1690 - [GLib] Add garrow_array_is_valid()
  • ARROW-1691 - [Java] Conform Java Decimal type implementation to format decisions in ARROW-1588
  • ARROW-1697 - [GitHub] Add ISSUE_TEMPLATE.md
  • ARROW-1701 - [Serialization] Support zero copy PyTorch Tensor serialization
  • ARROW-1702 - Update jemalloc in manylinux1 build
  • ARROW-1703 - [C++] Vendor exact version of jemalloc we depend on
  • ARROW-1707 - Update dev README after movement to GitBox
  • ARROW-1710 - [Java] Remove non-nullable vectors in new vector class hierarchy
  • ARROW-1716 - [Format/JSON] Use string integer value for Decimals in JSON
  • ARROW-1717 - [Java] Remove public static helper method in vector classes for JSONReader/Writer
  • ARROW-1718 - [Python] Implement casts from timestamp to date32/date64 and support in Array.from_pandas
  • ARROW-1719 - [Java] Remove accessor/mutator
  • ARROW-1721 - [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
  • ARROW-1724 - [Packaging] Support Ubuntu 17.10
  • ARROW-1725 - [Packaging] Upload .deb for Ubuntu 17.10
  • ARROW-1726 - [GLib] Add setup description to verify C GLib build
  • ARROW-1727 - [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
  • ARROW-1728 - [C++] Run clang-format checks in Travis CI
  • ARROW-1734 - C++/Python: Add cast function on Column-level
  • ARROW-1736 - [GLib] Add GArrowCastOptions:allow-time-truncate
  • ARROW-1737 - [GLib] Use G_DECLARE_DERIVABLE_TYPE
  • ARROW-1746 - [Python] Add build dependencies for Arch Linux
  • ARROW-1747 - [C++] Don't export symbols of statically linked libraries
  • ARROW-1748 - [GLib] Add GArrowRecordBatchBuilder
  • ARROW-1750 - [C++] Remove the need for arrow/util/random.h
  • ARROW-1752 - [Packaging] Add GPU packages for Debian and Ubuntu
  • ARROW-1753 - [Python] Provide for matching subclasses with register_type in serialization context
  • ARROW-1755 - [C++] Add build options for MSVC to use static runtime libraries
  • ARROW-1758 - [Python] Remove pickle=True option for object serialization
  • ARROW-1763 - [Python] DataType should be hashable
  • ARROW-1765 - [Doc] Use dependencies from conda in C++ docker build
  • ARROW-1767 - [C++] Support file reads and writes over 2GB on Windows
  • ARROW-1772 - [C++] Add public-api-test module in style of parquet-cpp
  • ARROW-1773 - [C++] Add casts from date/time types to compatible signed integers
  • ARROW-1775 - Ability to abort created but unsealed Plasma objects
  • ARROW-1777 - [C++] Add static ctor ArrayData::Make for nicer syntax in places
  • ARROW-1779 - [Java] Integration test breaks without zeroing out validity vectors
  • ARROW-1782 - [Python] Expose compressors as pyarrow.compress, pyarrow.decompress
  • ARROW-1783 - [Python] Convert SerializedPyObject to/from sequence of component buffers with minimal memory allocation / copying
  • ARROW-1784 - [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format
  • ARROW-1785 - [Format/C++/Java] Remove VectorLayout metadata from Flatbuffers metadata
  • ARROW-1787 - [Python] Support reading parquet files into DataFrames in a backward compatible way
  • ARROW-1794 - [C++/Python] Rename DecimalArray to Decimal128Array
  • ARROW-1801 - [Docs] Update install instructions to use red-data-tools repos
  • ARROW-1802 - [GLib] Add Arrow GPU support
  • ARROW-1806 - [GLib] Add garrow_record_batch_writer_write_table()
  • ARROW-1808 - [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns
  • ARROW-1809 - [GLib] Use .xml instead of .sgml for GTK-Doc main file
  • ARROW-1810 - [Plasma] Remove test shell scripts
  • ARROW-1817 - Configure JsonFileReader to read NaN for floats
  • ARROW-1818 - Examine Java Dependencies
  • ARROW-1819 - [Java] Remove legacy vector classes
  • ARROW-1826 - [JAVA] Avoid branching at cell level (copyFrom)
  • ARROW-1827 - [Java] Add checkstyle config file and header file
  • ARROW-1828 - [C++] Implement hash kernel specialization for BooleanType
  • ARROW-1834 - [Doc] Build documentation in separate build folders
  • ARROW-1838 - [C++] Use compute::Datum uniformly for input argument to kernels
  • ARROW-1841 - [JS] Update text-encoding-utf-8 and tslib for node ESModules support
  • ARROW-1844 - [C++] Basic benchmark suite for hash kernels
  • ARROW-1849 - [GLib] Add input checks to GArrowRecordBatch
  • ARROW-1850 - [C++] Use const void* in Writable::Write instead of const uint8_t*
  • ARROW-1854 - [Python] Improve performance of serializing object dtype ndarrays
  • ARROW-1855 - [GLib] Add workaround for build failure on macOS
  • ARROW-1857 - [Python] Add switch for boost linkage with static parquet in wheels
  • ARROW-1859 - [GLib] Add GArrowDictionaryDataType
  • ARROW-1862 - [GLib] Add GArrowDictionaryArray
  • ARROW-1864 - [Java] Upgrade Netty to 4.1.x
  • ARROW-1867 - [Java] Add BitVector APIs from old vector class
  • ARROW-1874 - [GLib] Add garrow_array_unique()
  • ARROW-1878 - [GLib] Add garrow_array_dictionary_encode()
  • ARROW-1884 - [C++] Make JsonReader/JsonWriter classes internal APIs
  • ARROW-1885 - [Java] Restore previous MapVector class names
  • ARROW-1901 - [Python] Support recursive mkdir for DaskFilesystem
  • ARROW-1902 - [Python] Remove mkdir race condition from write_to_dataset
  • ARROW-1905 - [Python] Add more functions for checking exact types in pyarrow.types
  • ARROW-1911 - Add Graphistry to Arrow JS proof points
  • ARROW-480 - [Python] Add accessors for Parquet column statistics
  • ARROW-504 - [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format
  • ARROW-507 - [C++/Python] Construct List container from offsets and values subarrays
  • ARROW-541 - [JS] Implement JavaScript-compatible implementation
  • ARROW-571 - [Python] Add APIs to build Parquet files incrementally from Arrow tables
  • ARROW-587 - Add JIRA fix version to merge tool
  • ARROW-609 - [C++] Function for casting from days since UNIX epoch to int64 date
  • ARROW-838 - [Python] Efficient construction of arrays from non-pandas 1D NumPy arrays
  • ARROW-905 - [Docs] Add Dockerfile for reproducible documentation generation
  • ARROW-942 - Support integration testing on Python 2.7
  • ARROW-950 - [Site] Add Google Analytics tag
  • ARROW-972 - [Python] Add test cases and basic APIs for UnionArray

Bug Fixes

  • ARROW-1282 - Large memory reallocation by Arrow causes hang in jemalloc
  • ARROW-1341 - [C++] Deprecate arrow::MakeTable in favor of new ctor from ARROW-1334
  • ARROW-1347 - [JAVA] List null type should use consistent name for inner field
  • ARROW-1398 - [Python] No support reading columns of type decimal(19,4)
  • ARROW-1409 - [Format] Use for “page” attribute in Buffer in metadata
  • ARROW-1540 - [C++] Fix valgrind warnings in cuda-test if possible
  • ARROW-1541 - [C++] Race condition with arrow_gpu
  • ARROW-1543 - [C++] row_wise_conversion example doesn't correspond to ListBuilder constructor arguments
  • ARROW-1549 - [JS] Integrate auto-generated Arrow test files
  • ARROW-1555 - [Python] write_to_dataset on s3
  • ARROW-1584 - [PYTHON] serialize_pandas on empty dataframe
  • ARROW-1585 - serialize_pandas round trip fails on integer columns
  • ARROW-1586 - [PYTHON] serialize_pandas roundtrip loses columns name
  • ARROW-1609 - Plasma: Build fails with Xcode 9.0
  • ARROW-1615 - CXX flags for development more permissive than Travis CI builds
  • ARROW-1617 - [Python] Do not use symlinks in python/cmake_modules
  • ARROW-1620 - Python: Download Boost in manylinux1 build from bintray
  • ARROW-1624 - [C++] Follow up fixes / tweaks to compiler warnings for Plasma / LLVM 4.0, add to readme
  • ARROW-1625 - [Serialization] Support OrderedDict properly
  • ARROW-1629 - [C++] Fix problematic code paths identified by infer tool
  • ARROW-1633 - [Python] numpy “unicode” arrays not understood
  • ARROW-1640 - Resolve OpenSSL issues in Travis CI
  • ARROW-1647 - [Plasma] Potential bug when reading/writing messages.
  • ARROW-1653 - [Plasma] Use static cast to avoid compiler warning.
  • ARROW-1656 - [C++] Endianness Macro is Incorrect on Windows And Mac
  • ARROW-1657 - [C++] Multithreaded Read Test Failing on Arch Linux
  • ARROW-1658 - [Python] Out of bounds dictionary indices causes segfault after converting to pandas
  • ARROW-1663 - [Java] Follow up on ARROW-1347 and make schema backward compatible
  • ARROW-1670 - [Python] Speed up deserialization code path
  • ARROW-1672 - [Python] Failure to write Feather bytes column
  • ARROW-1673 - [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip
  • ARROW-1676 - [C++] Correctly truncate oversized validity bitmaps when writing Feather format
  • ARROW-1678 - [Python] Incorrect serialization of numpy.float16
  • ARROW-1680 - [Python] Timestamp unit change not done in from_pandas() conversion
  • ARROW-1686 - Documentation generation script creates “apidocs” directory under site/java
  • ARROW-1693 - [JS] Error reading dictionary-encoded integration test files
  • ARROW-1695 - [Serialization] Fix reference counting of numpy arrays created in custom serialializer
  • ARROW-1698 - [JS] File reader attempts to load the same dictionary batch more than once
  • ARROW-1704 - [GLib] Go example in test suite is broken
  • ARROW-1708 - [JS] Linter problem breaks master build
  • ARROW-1709 - [C++] Decimal.ToString is incorrect for negative scale
  • ARROW-1711 - [Python] flake8 checks still not failing builds
  • ARROW-1714 - [Python] No named pd.Series name serialized as u‘None’
  • ARROW-1720 - [Python] Segmentation fault while trying to access an out-of-bound chunk
  • ARROW-1723 - Windows: __declspec(dllexport) specified when building arrow static library
  • ARROW-1730 - [Python] Incorrect result from pyarrow.array when passing timestamp type
  • ARROW-1732 - [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
  • ARROW-1735 - [C++] Cast kernels cannot write into sliced output array
  • ARROW-1738 - [Python] Wrong datetime conversion when pa.array with unit
  • ARROW-1739 - [Python] Fix usages of assertRaises causing broken build
  • ARROW-1742 - C++: clang-format is not detected correct on OSX anymore
  • ARROW-1743 - [Python] Table to_pandas fails when index contains categorical column
  • ARROW-1745 - Compilation failure on Mac OS in plasma tests
  • ARROW-1749 - [C++] Handle range of Decimal128 values that require 39 digits to be displayed
  • ARROW-1751 - [Python] Pandas 0.21.0 introduces a breaking API change for MultiIndex construction
  • ARROW-1754 - [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
  • ARROW-1756 - [Python] Observed int32 overflow in Feather write/read path
  • ARROW-1762 - [C++] unittest failure for language environment
  • ARROW-1764 - [Python] Add -c conda-forge for Windows dev installation instructions
  • ARROW-1766 - [GLib] Fix failing builds on OSX
  • ARROW-1768 - [Python] Fix suppressed exception in ParquetWriter.del
  • ARROW-1770 - [GLib] Fix GLib compiler warning
  • ARROW-1771 - [C++] ARROW-1749 Breaks Public API test in parquet-cpp
  • ARROW-1776 - [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined
  • ARROW-1778 - [Python] Link parquet-cpp statically, privately in manylinux1 wheels
  • ARROW-1781 - [CI] OSX Builds on Travis-CI time out often
  • ARROW-1788 - Plasma store crashes when trying to abort objects for disconnected client
  • ARROW-1791 - Integration tests generate date[DAY] values outside of reasonable range
  • ARROW-1793 - [Integration] fix a typo for README.md
  • ARROW-1800 - [C++] Fix and simplify random_decimals
  • ARROW-1805 - [Python] ignore non-parquet files when exploring dataset
  • ARROW-1811 - [C++/Python] Rename all Decimal based APIs to Decimal128
  • ARROW-1812 - Plasma store modifies hash table while iterating during client disconnect
  • ARROW-1821 - Add integration test case to explicitly check for optional validity buffer
  • ARROW-1829 - [Plasma] Clean up eviction policy bookkeeping
  • ARROW-1830 - [Python] Error when loading all the files in a dictionary
  • ARROW-1836 - [C++] Fix C4996 warning from arrow/util/variant.h on MSVC builds
  • ARROW-1839 - [C++/Python] Add Decimal Parquet Read/Write Tests
  • ARROW-1840 - [Website] The installation command failed on Windows10 anaconda environment.
  • ARROW-1845 - [Python] Expose Decimal128Type
  • ARROW-1852 - [Plasma] Make retrieving manager file descriptor const
  • ARROW-1853 - [Plasma] Fix off-by-one error in retry processing
  • ARROW-1863 - [Python] PyObjectStringify could render bytes-like output for more types of objects
  • ARROW-1865 - [C++] Adding a column to an empty Table fails
  • ARROW-1869 - Fix typo in LowCostIdentityHashMap
  • ARROW-1871 - [Python/C++] Appending Python Decimals with different scales requires rescaling
  • ARROW-1873 - [Python] Segmentation fault when loading total 2GB of parquet files
  • ARROW-1877 - Incorrect comparison in JsonStringArrayList.equals
  • ARROW-1879 - [Python] Dask integration tests are not skipped if dask is not installed
  • ARROW-1881 - [Python] setuptools_scm picks up JS version tags
  • ARROW-1882 - [C++] Reintroduce DictionaryBuilder
  • ARROW-1883 - [Python] BUG: Table.to_pandas metadata checking fails if columns are not present
  • ARROW-1889 - [Python] --exclude is not available in older git versions
  • ARROW-1890 - [Python] Masking for date32 arrays not working
  • ARROW-1891 - [Python] NaT date32 values are only converted to nulls if from_pandas is used
  • ARROW-1892 - [Python] Unknown list item type: binary
  • ARROW-1893 - [Python] test_primitive_serialization fails on Python 2.7.3
  • ARROW-1895 - [Python] Add field_name to pandas index metadata
  • ARROW-1897 - [Python] Incorrect numpy_type for pandas metadata of Categoricals
  • ARROW-1904 - [C++] Deprecate PrimitiveArray::raw_values
  • ARROW-1906 - [Python] Creating a pyarrow.Array with timestamp of different unit is not casted
  • ARROW-1908 - [Python] Construction of arrow table from pandas DataFrame with duplicate column names crashes
  • ARROW-1910 - CPP README Brewfile link incorrect
  • ARROW-1914 - [C++] make -j may fail to build with -DARROW_GPU=on
  • ARROW-1915 - [Python] Parquet tests should be optional
  • ARROW-1916 - [Java] Do not exclude java/dev/checkstyle from source releases
  • ARROW-1917 - [GLib] Must set GI_TYPELIB_PATH in verify-release-candidate.sh
  • ARROW-226 - [C++] libhdfs: feedback to help determining cause of failure in opening file path
  • ARROW-641 - [C++] Do not build/run io-hdfs-test if ARROW_HDFS=off