| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| Integrating PyArrow with Java |
| ============================= |
| |
| Arrow supports exchanging data within the same process through the |
| :ref:`c-data-interface`. |
| |
| This can be used to exchange data between Python and Java functions and |
| methods so that the two languages can interact without any cost of |
| marshaling and unmarshaling data. |
| |
| .. note:: |
| |
| The article takes for granted that you have a ``Python`` environment |
| with ``pyarrow`` correctly installed and a ``Java`` environment with |
| ``arrow`` library correctly installed. |
| The ``Arrow Java`` version must have been compiled with ``mvn -Parrow-c-data`` to |
| ensure CData exchange support is enabled. |
| See `Python Install Instructions <https://arrow.apache.org/docs/python/install.html>`_ |
| and `Java Documentation <https://arrow.apache.org/docs/java/>`_ |
| for further details. |
| |
| Invoking Java methods from Python |
| --------------------------------- |
| |
| Suppose we have a simple Java class providing a number as its output: |
| |
| .. code-block:: Java |
| |
| public class Simple { |
| public static int getNumber() { |
| return 4; |
| } |
| } |
| |
| We would save such class in the ``Simple.java`` file and proceed with |
| compiling it to ``Simple.class`` using ``javac Simple.java``. |
| |
| Once the ``Simple.class`` file is created we can use the class |
| from Python using the |
| `JPype <https://jpype.readthedocs.io/>`_ library which |
| enables a Java runtime within the Python interpreter. |
| |
| ``jpype1`` can be installed using ``pip`` like most Python libraries |
| |
| .. code-block:: console |
| |
| $ pip install jpype1 |
| |
| The most basic thing we can do with our ``Simple`` class is to |
| use the ``Simple.getNumber`` method from Python and see |
| if it will return the result. |
| |
| To do so, we can create a ``simple.py`` file which uses ``jpype`` to |
| import the ``Simple`` class from ``Simple.class`` file and invoke |
| the ``Simple.getNumber`` method: |
| |
| .. code-block:: python |
| |
| import jpype |
| from jpype.types import * |
| |
| jpype.startJVM(classpath=["./"]) |
| |
| Simple = JClass('Simple') |
| |
| print(Simple.getNumber()) |
| |
| Running the ``simple.py`` file will show how our Python code is able |
| to access the ``Java`` method and print the expected result: |
| |
| .. code-block:: console |
| |
| $ python simple.py |
| 4 |
| |
| Java to Python using pyarrow.jvm |
| -------------------------------- |
| |
| PyArrow provides a ``pyarrow.jvm`` module that makes easier to |
| interact with Java classes and convert the Java objects to actual |
| Python objects. |
| |
| To showcase ``pyarrow.jvm`` we could create a more complex |
| class, named ``FillTen.java`` |
| |
| .. code-block:: java |
| |
| import org.apache.arrow.memory.RootAllocator; |
| import org.apache.arrow.vector.BigIntVector; |
| |
| |
| public class FillTen { |
| static RootAllocator allocator = new RootAllocator(); |
| |
| public static BigIntVector createArray() { |
| BigIntVector intVector = new BigIntVector("ints", allocator); |
| intVector.allocateNew(10); |
| intVector.setValueCount(10); |
| FillTen.fillVector(intVector); |
| return intVector; |
| } |
| |
| private static void fillVector(BigIntVector iv) { |
| iv.setSafe(0, 1); |
| iv.setSafe(1, 2); |
| iv.setSafe(2, 3); |
| iv.setSafe(3, 4); |
| iv.setSafe(4, 5); |
| iv.setSafe(5, 6); |
| iv.setSafe(6, 7); |
| iv.setSafe(7, 8); |
| iv.setSafe(8, 9); |
| iv.setSafe(9, 10); |
| } |
| } |
| |
| This class provides a public ``createArray`` method that anyone can invoke |
| to get back an array containing numbers from 1 to 10. |
| |
| Given that this class now has a dependency on a bunch of packages, |
| compiling it with ``javac`` is not enough anymore. We need to create |
| a dedicated ``pom.xml`` file where we can collect the dependencies: |
| |
| .. code-block:: xml |
| |
| <project> |
| <modelVersion>4.0.0</modelVersion> |
| |
| <groupId>org.apache.arrow.py2java</groupId> |
| <artifactId>FillTen</artifactId> |
| <version>1</version> |
| |
| <properties> |
| <maven.compiler.source>8</maven.compiler.source> |
| <maven.compiler.target>8</maven.compiler.target> |
| </properties> |
| |
| <dependencies> |
| <dependency> |
| <groupId>org.apache.arrow</groupId> |
| <artifactId>arrow-memory</artifactId> |
| <version>8.0.0</version> |
| <type>pom</type> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.arrow</groupId> |
| <artifactId>arrow-memory-netty</artifactId> |
| <version>8.0.0</version> |
| <type>jar</type> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.arrow</groupId> |
| <artifactId>arrow-vector</artifactId> |
| <version>8.0.0</version> |
| <type>pom</type> |
| </dependency> |
| <dependency> |
| <groupId>org.apache.arrow</groupId> |
| <artifactId>arrow-c-data</artifactId> |
| <version>8.0.0</version> |
| <type>jar</type> |
| </dependency> |
| </dependencies> |
| </project> |
| |
| Once the ``FillTen.java`` file with the class is created |
| as ``src/main/java/FillTen.java`` we can use ``maven`` to |
| compile the project with ``mvn package`` and get it |
| available in the ``target`` directory. |
| |
| .. code-block:: console |
| |
| $ mvn package |
| [INFO] Scanning for projects... |
| [INFO] |
| [INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------ |
| [INFO] Building FillTen 1 |
| [INFO] --------------------------------[ jar ]--------------------------------- |
| [INFO] |
| [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ FillTen --- |
| [INFO] Changes detected - recompiling the module! |
| [INFO] Compiling 1 source file to /experiments/java2py/target/classes |
| [INFO] |
| [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ FillTen --- |
| [INFO] Building jar: /experiments/java2py/target/FillTen-1.jar |
| [INFO] ------------------------------------------------------------------------ |
| [INFO] BUILD SUCCESS |
| [INFO] ------------------------------------------------------------------------ |
| |
| Now that we have the package built, we can make it available to Python. |
| To do so, we need to make sure that not only the package itself is available, |
| but that also its dependencies are. |
| |
| We can use ``maven`` to collect all dependencies and make them available in a single place |
| (the ``dependencies`` directory) so that we can more easily load them from Python: |
| |
| .. code-block:: console |
| |
| $ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies |
| [INFO] Scanning for projects... |
| [INFO] |
| [INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------ |
| [INFO] Building FillTen 1 |
| [INFO] --------------------------------[ jar ]--------------------------------- |
| [INFO] |
| [INFO] --- maven-dependency-plugin:2.7:copy-dependencies (default-cli) @ FillTen --- |
| [INFO] Copying jsr305-3.0.2.jar to /experiments/java2py/dependencies/jsr305-3.0.2.jar |
| [INFO] Copying netty-common-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-common-4.1.72.Final.jar |
| [INFO] Copying arrow-memory-core-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-core-8.0.0-SNAPSHOT.jar |
| [INFO] Copying arrow-vector-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.jar |
| [INFO] Copying arrow-c-data-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-c-data-8.0.0-SNAPSHOT.jar |
| [INFO] Copying arrow-vector-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.pom |
| [INFO] Copying jackson-core-2.11.4.jar to /experiments/java2py/dependencies/jackson-core-2.11.4.jar |
| [INFO] Copying jackson-annotations-2.11.4.jar to /experiments/java2py/dependencies/jackson-annotations-2.11.4.jar |
| [INFO] Copying slf4j-api-1.7.25.jar to /experiments/java2py/dependencies/slf4j-api-1.7.25.jar |
| [INFO] Copying arrow-memory-netty-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-netty-8.0.0-SNAPSHOT.jar |
| [INFO] Copying arrow-format-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-format-8.0.0-SNAPSHOT.jar |
| [INFO] Copying flatbuffers-java-1.12.0.jar to /experiments/java2py/dependencies/flatbuffers-java-1.12.0.jar |
| [INFO] Copying arrow-memory-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-memory-8.0.0-SNAPSHOT.pom |
| [INFO] Copying netty-buffer-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-buffer-4.1.72.Final.jar |
| [INFO] Copying jackson-databind-2.11.4.jar to /experiments/java2py/dependencies/jackson-databind-2.11.4.jar |
| [INFO] Copying commons-codec-1.10.jar to /experiments/java2py/dependencies/commons-codec-1.10.jar |
| [INFO] ------------------------------------------------------------------------ |
| [INFO] BUILD SUCCESS |
| [INFO] ------------------------------------------------------------------------ |
| |
| .. note:: |
| |
| Instead of manually collecting dependencies, you could also rely on the |
| ``maven-assembly-plugin`` to build a single ``jar`` with all dependencies. |
| |
| Once our package and all its dependencies are available, |
| we can invoke it from ``fillten_pyarrowjvm.py`` script that will |
| import the ``FillTen`` class and print out the result of invoking ``FillTen.createArray`` |
| |
| .. code-block:: python |
| |
| import jpype |
| import jpype.imports |
| from jpype.types import * |
| |
| # Start a JVM making available all dependencies we collected |
| # and our class from target/FillTen-1.jar |
| jpype.startJVM(classpath=["./dependencies/*", "./target/*"]) |
| |
| FillTen = JClass('FillTen') |
| |
| array = FillTen.createArray() |
| print("ARRAY", type(array), array) |
| |
| # Convert the proxied BigIntVector to an actual pyarrow array |
| import pyarrow.jvm |
| pyarray = pyarrow.jvm.array(array) |
| print("ARRAY", type(pyarray), pyarray) |
| del pyarray |
| |
| Running the python script will lead to two lines getting printed: |
| |
| .. code-block:: |
| |
| ARRAY <java class 'org.apache.arrow.vector.BigIntVector'> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| ARRAY <class 'pyarrow.lib.Int64Array'> [ |
| 1, |
| 2, |
| 3, |
| 4, |
| 5, |
| 6, |
| 7, |
| 8, |
| 9, |
| 10 |
| ] |
| |
| The first line is the raw result of invoking the ``FillTen.createArray`` method. |
| The resulting object is a proxy to the actual Java object, so it's not really a pyarrow |
| Array, it will lack most of its capabilities and methods. |
| That's why we subsequently use ``pyarrow.jvm.array`` to convert it to an actual |
| ``pyarrow`` array. That allows us to treat it like any other ``pyarrow`` array. |
| The result is the second line in the output where the array is correctly reported |
| as being of type ``pyarrow.lib.Int64Array`` and is printed using the ``pyarrow`` style. |
| |
| .. note:: |
| |
| At the moment the ``pyarrow.jvm`` module is fairly limited in capabilities, |
| nested types like structs are not supported and it only works on a JVM running |
| within the same process like JPype. |
| |
| Java to Python communication using the C Data Interface |
| ------------------------------------------------------- |
| |
| The C Data Interface is a protocol implemented in Arrow to exchange data within different |
| environments without the cost of marshaling and copying data. |
| |
| This allows to expose data coming from Python or Java to functions that are implemented |
| in the other language. |
| |
| .. note:: |
| |
| In the future the ``pyarrow.jvm`` will be implemented to leverage the C Data |
| interface, at the moment is instead specifically written for JPype |
| |
| To showcase how C Data works, we are going to tweak a bit both our ``FillTen`` Java |
| class and our ``fillten.py`` Python script. Given a PyArrow array, we are going to |
| expose a function in Java that sets its content to by the numbers from 1 to 10. |
| |
| Using C Data interface in ``pyarrow`` at the moment requires installing ``cffi`` |
| explicitly, like most Python distributions it can be installed with |
| |
| .. code-block:: console |
| |
| $ pip install cffi |
| |
| The first thing we would have to do is to tweak the Python script so that it |
| sends to Java the exported references to the Array and its Schema according to the |
| C Data interface: |
| |
| .. code-block:: python |
| |
| import jpype |
| import jpype.imports |
| from jpype.types import * |
| |
| # Init the JVM and make FillTen class available to Python. |
| jpype.startJVM(classpath=["./dependencies/*", "./target/*"]) |
| FillTen = JClass('FillTen') |
| |
| # Create a Python array of 10 elements |
| import pyarrow as pa |
| array = pa.array([0]*10) |
| |
| from pyarrow.cffi import ffi as arrow_c |
| |
| # Export the Python array through C Data |
| c_array = arrow_c.new("struct ArrowArray*") |
| c_array_ptr = int(arrow_c.cast("uintptr_t", c_array)) |
| array._export_to_c(c_array_ptr) |
| |
| # Export the Schema of the Array through C Data |
| c_schema = arrow_c.new("struct ArrowSchema*") |
| c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema)) |
| array.type._export_to_c(c_schema_ptr) |
| |
| # Send Array and its Schema to the Java function |
| # that will populate the array with numbers from 1 to 10 |
| FillTen.fillCArray(c_array_ptr, c_schema_ptr) |
| |
| # See how the content of our Python array was changed from Java |
| # while it remained of the Python type. |
| print("ARRAY", type(array), array) |
| |
| .. note:: |
| |
| Changing content of arrays is not a safe operation, it was done |
| for the purpose of creating this example, and it mostly works only |
| because the array hasn't changed size, type or nulls. |
| |
| In the FillTen Java class, we already have the ``fillVector`` |
| method, but that method is private and even if we made it public it |
| would only accept a ``BigIntVector`` object and not the C Data array |
| and schema references. |
| |
| So we have to expand our ``FillTen`` class adding a ``fillCArray`` |
| method that is able to perform the work of ``fillVector`` but |
| on the C Data exchanged entities instead of the ``BigIntVector`` one: |
| |
| .. code-block:: java |
| |
| import org.apache.arrow.c.ArrowArray; |
| import org.apache.arrow.c.ArrowSchema; |
| import org.apache.arrow.c.Data; |
| import org.apache.arrow.memory.RootAllocator; |
| import org.apache.arrow.vector.FieldVector; |
| import org.apache.arrow.vector.BigIntVector; |
| |
| |
| public class FillTen { |
| static RootAllocator allocator = new RootAllocator(); |
| |
| public static void fillCArray(long c_array_ptr, long c_schema_ptr) { |
| ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr); |
| ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr); |
| |
| FieldVector v = Data.importVector(allocator, arrow_array, arrow_schema, null); |
| FillTen.fillVector((BigIntVector)v); |
| } |
| |
| private static void fillVector(BigIntVector iv) { |
| iv.setSafe(0, 1); |
| iv.setSafe(1, 2); |
| iv.setSafe(2, 3); |
| iv.setSafe(3, 4); |
| iv.setSafe(4, 5); |
| iv.setSafe(5, 6); |
| iv.setSafe(6, 7); |
| iv.setSafe(7, 8); |
| iv.setSafe(8, 9); |
| iv.setSafe(9, 10); |
| } |
| } |
| |
| The goal of the ``fillCArray`` method is to get the Array and Schema received in |
| C Data exchange format and turn them back to an object of type ``FieldVector`` |
| so that Arrow Java knows how to deal with it. |
| |
| If we run again ``mvn package``, update the maven dependencies |
| and then our Python script, we should be able to see how the |
| values printed by the Python script have been properly changed by the Java code: |
| |
| .. code-block:: console |
| |
| $ mvn package |
| $ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies |
| $ python fillten.py |
| ARRAY <class 'pyarrow.lib.Int64Array'> [ |
| 1, |
| 2, |
| 3, |
| 4, |
| 5, |
| 6, |
| 7, |
| 8, |
| 9, |
| 10 |
| ] |
| |
| We can also use the C Stream Interface to exchange |
| :py:class:`pyarrow.RecordBatchReader` between Java and Python. We'll |
| use this Java class as a demo, which lets you read an Arrow IPC file |
| via Java's implementation, or write data to a JSON file: |
| |
| .. code-block:: java |
| |
| import java.io.File; |
| import java.nio.file.Files; |
| import java.nio.file.Paths; |
| |
| import org.apache.arrow.c.ArrowArrayStream; |
| import org.apache.arrow.c.Data; |
| import org.apache.arrow.memory.BufferAllocator; |
| import org.apache.arrow.memory.RootAllocator; |
| import org.apache.arrow.vector.ipc.ArrowFileReader; |
| import org.apache.arrow.vector.ipc.ArrowReader; |
| import org.apache.arrow.vector.ipc.JsonFileWriter; |
| |
| public class PythonInteropDemo implements AutoCloseable { |
| private final BufferAllocator allocator; |
| |
| public PythonInteropDemo() { |
| this.allocator = new RootAllocator(); |
| } |
| |
| public void exportStream(String path, long cStreamPointer) throws Exception { |
| try (final ArrowArrayStream stream = ArrowArrayStream.wrap(cStreamPointer)) { |
| ArrowFileReader reader = new ArrowFileReader(Files.newByteChannel(Paths.get(path)), allocator); |
| Data.exportArrayStream(allocator, reader, stream); |
| } |
| } |
| |
| public void importStream(String path, long cStreamPointer) throws Exception { |
| try (final ArrowArrayStream stream = ArrowArrayStream.wrap(cStreamPointer); |
| final ArrowReader input = Data.importArrayStream(allocator, stream); |
| JsonFileWriter writer = new JsonFileWriter(new File(path))) { |
| writer.start(input.getVectorSchemaRoot().getSchema(), input); |
| while (input.loadNextBatch()) { |
| writer.write(input.getVectorSchemaRoot()); |
| } |
| } |
| } |
| |
| @Override |
| public void close() throws Exception { |
| allocator.close(); |
| } |
| } |
| |
| On the Python side, we'll use JPype as before, except this time we'll |
| send RecordBatchReaders back and forth: |
| |
| .. code-block:: python |
| |
| import tempfile |
| |
| import jpype |
| import jpype.imports |
| from jpype.types import * |
| |
| # Init the JVM and make demo class available to Python. |
| jpype.startJVM(classpath=["./dependencies/*", "./target/*"]) |
| PythonInteropDemo = JClass("PythonInteropDemo") |
| demo = PythonInteropDemo() |
| |
| # Create a Python record batch reader |
| import pyarrow as pa |
| schema = pa.schema([ |
| ("ints", pa.int64()), |
| ("strs", pa.string()) |
| ]) |
| batches = [ |
| pa.record_batch([ |
| [0, 2, 4, 8], |
| ["a", "b", "c", None], |
| ], schema=schema), |
| pa.record_batch([ |
| [None, 32, 64, None], |
| ["e", None, None, "h"], |
| ], schema=schema), |
| ] |
| reader = pa.RecordBatchReader.from_batches(schema, batches) |
| |
| from pyarrow.cffi import ffi as arrow_c |
| |
| # Export the Python reader through C Data |
| c_stream = arrow_c.new("struct ArrowArrayStream*") |
| c_stream_ptr = int(arrow_c.cast("uintptr_t", c_stream)) |
| reader._export_to_c(c_stream_ptr) |
| |
| # Send reader to the Java function that writes a JSON file |
| with tempfile.NamedTemporaryFile() as temp: |
| demo.importStream(temp.name, c_stream_ptr) |
| |
| # Read the JSON file back |
| with open(temp.name) as source: |
| print("JSON file written by Java:") |
| print(source.read()) |
| |
| |
| # Write an Arrow IPC file for Java to read |
| with tempfile.NamedTemporaryFile() as temp: |
| with pa.ipc.new_file(temp.name, schema) as sink: |
| for batch in batches: |
| sink.write_batch(batch) |
| |
| demo.exportStream(temp.name, c_stream_ptr) |
| with pa.RecordBatchReader._import_from_c(c_stream_ptr) as source: |
| print("IPC file read by Java:") |
| print(source.read_all()) |
| |
| .. code-block:: console |
| |
| $ mvn package |
| $ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies |
| $ python demo.py |
| JSON file written by Java: |
| {"schema":{"fields":[{"name":"ints","nullable":true,"type":{"name":"int","bitWidth":64,"isSigned":true},"children":[]},{"name":"strs","nullable":true,"type":{"name":"utf8"},"children":[]}]},"batches":[{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[1,1,1,1],"DATA":["0","2","4","8"]},{"name":"strs","count":4,"VALIDITY":[1,1,1,0],"OFFSET":[0,1,2,3,3],"DATA":["a","b","c",""]}]},{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[0,1,1,0],"DATA":["0","32","64","0"]},{"name":"strs","count":4,"VALIDITY":[1,0,0,1],"OFFSET":[0,1,1,1,2],"DATA":["e","","","h"]}]}]} |
| IPC file read by Java: |
| pyarrow.Table |
| ints: int64 |
| strs: string |
| ---- |
| ints: [[0,2,4,8],[null,32,64,null]] |
| strs: [["a","b","c",null],["e",null,null,"h"]] |