The Arrow community would like to introduce version 1.0.0 of the Arrow Database Connectivity (ADBC) specification. ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. Or in other words: ADBC is a single API for getting Arrow data in and out of different databases.
Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. That way, they can code to the same API regardless of the underlying database, saving on development time. Roughly speaking, when an application executes a query with these APIs:
When columnar data comes into play, however, problems arise. JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. So generally, columnar data must be converted to rows in step 5, spending resources without performing “useful” work.
This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. Otherwise, they‘re leaving performance on the table. At the same time, that conversion isn’t always avoidable. Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them.
Developers have a few options:
As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better.
ADBC is an Arrow-based, vendor-neutral API for interacting with databases. Applications that use ADBC simply receive Arrow data. They don‘t have to do any conversions themselves, and they don’t have to integrate each database's specific SDK.
Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases.
The application only deals with one API, and only works with Arrow data.
ADBC API and driver implementations are under development. For example, in Python, the ADBC packages offer a familiar [DBAPI 2.0 (PEP 249)][pep-249]-style interface, with extensions to get Arrow data. We can get Arrow data out of PostgreSQL easily:
import adbc_driver_postgresql.dbapi uri = "postgresql://localhost:5432/postgres?user=postgres&password=password" with adbc_driver_postgresql.dbapi.connect(uri) as conn: with conn.cursor() as cur: cur.execute("SELECT * FROM customer") table = cur.fetch_arrow_table() # Process the results
Or SQLite:
import adbc_driver_sqlite.dbapi uri = "file:mydb.sqlite" with adbc_driver_sqlite.dbapi.connect(uri) as conn: with conn.cursor() as cur: cur.execute("SELECT * FROM customer") table = cur.fetch_arrow_table() # Process the results
Note: implementations are still under development. See the documentation for up-to-date examples.
ADBC fills a specific niche that related projects do not address. It is both:
ADBC doesn't intend to replace JDBC or ODBC in general. But for applications that just want bulk columnar data access, ADBC lets them avoid data conversion overhead and tedious integration work.
Similarly, within the Arrow project, ADBC does not replace Flight SQL, but instead complements it. ADBC is an API that lets clients work with different databases easily. Meanwhile, Flight SQL is a wire protocol that database servers can implement to simultaneously support ADBC, [JDBC][flight-sql-jdbc], and ODBC users.
ADBC works as part of the Arrow ecosystem to “cover the bases” for database interaction:
To start using ADBC, see the documentation for build instructions and a short tutorial. (A formal release of the packages is still under way.) If you're interested in learning more or contributing, please reach out on the [mailing list][dev@arrow.apache.org] or on GitHub Issues.
ADBC was only possible with the help and involvement of several Arrow community members and projects. In particular, we would like to thank members of the [DuckDB project][duckdb] and the [R DBI project][dbi], who constructed prototypes based on early revisions of the standard and provided feedback on the design. And ADBC builds on existing Arrow projects, including the [Arrow C Data Interface][c-data-interface] and [nanoarrow][nanoarrow].
Thanks to Fernanda Foertter for assistance with some of the diagrams.
[c-data-interface]: {% link _posts/2020-05-04-introducing-arrow-c-data-interface.md %} [dbi]: https://www.r-dbi.org/ [dev@arrow.apache.org]: https://arrow.apache.org/community/ [duckdb]: https://duckdb.org/ [flight-sql]: {% link _posts/2022-02-16-introducing-arrow-flight-sql.md %} [flight-sql-jdbc]: {% link _posts/2022-11-01-arrow-flight-sql-jdbc.md %} [jdbc]: https://docs.oracle.com/javase/tutorial/jdbc/overview/index.html [nanoarrow]: https://github.com/apache/arrow-nanoarrow [odbc]: https://learn.microsoft.com/en-us/sql/odbc/reference/what-is-odbc?view=sql-server-ver16 [pep-249]: https://www.python.org/dev/peps/pep-0249/ [substrait]: https://substrait.io/ [turbodbc]: https://turbodbc.readthedocs.io/en/latest/ [why-arrow]: {% link faq.md %}#why-define-a-standard-for-columnar-in-memory