blob: ece07d0a7771cad44c07ef650c1e13c8ea99c604 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
===========
ValueVector
===========
:class:`ValueVector` interface (which called Array in C++ implementation and
the :doc:`the specification <../format/Columnar>`) is an abstraction that is used to store a
sequence of values having the same type in an individual column. Internally, those values are
represented by one or several buffers, the number and meaning of which depend on the vector’s data type.
There are concrete subclasses of :class:`ValueVector` for each primitive data type
and nested type described in the specification. There are a few differences in naming
with the type names described in the specification:
Table with non-intuitive names (BigInt = 64 bit integer, etc).
It is important that vector is allocated before attempting to read or write,
:class:`ValueVector` "should" strive to guarantee this order of operation:
create > allocate > mutate > set value count > access > clear (or allocate to start the process over).
We will go through a concrete example to demonstrate each operation in the next section.
Vector Life Cycle
=================
As discussed above, each vector goes through several steps in its life cycle,
and each step is triggered by a vector operation. In particular, we have the following vector operations:
1. **Vector creation**: we create a new vector object by, for example, the vector constructor.
The following code creates a new ``IntVector`` by the constructor:
.. code-block:: Java
RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
...
IntVector vector = new IntVector("int vector", allocator);
By now, a vector object is created. However, no underlying memory has been allocated, so we need the
following step.
2. **Vector allocation**: in this step, we allocate memory for the vector. For most vectors, we
have two options: 1) if we know the maximum vector capacity, we can specify it by calling the
``allocateNew(int)`` method; 2) otherwise, we should call the ``allocateNew()`` method, and a default
capacity will be allocated for it. For our running example, we assume that the vector capacity never
exceeds 10:
.. code-block:: Java
vector.allocateNew(10);
3. **Vector mutation**: now we can populate the vector with values we desire. For all vectors, we can populate
vector values through vector writers (An example will be given in the next section). For primitive types,
we can also mutate the vector by the set methods. There are two classes of set methods: 1) if we can
be sure the vector has enough capacity, we can call the ``set(index, value)`` method. 2) if we are not sure
about the vector capacity, we should call the ``setSafe(index, value)`` method, which will automatically
take care of vector reallocation, if the capacity is not sufficient. For our running example, we know the
vector has enough capacity, so we can call
.. code-block:: Java
vector.set(/*index*/5, /*value*/25);
4. **Set value count**: for this step, we set the value count of the vector by calling the
``setValueCount(int)`` method:
.. code-block:: Java
vector.setValueCount(10);
After this step, the vector enters an immutable state. In other words, we should no longer mutate it.
(Unless we reuse the vector by allocating it again. This will be discussed shortly.)
5. **Vector access**: it is time to access vector values. Similarly, we have two options to access values:
1) get methods and 2) vector reader. Vector reader works for all types of vectors, while get methods are
only available for primitive vectors. A concrete example for vector reader will be given in the next section.
Below is an example of vector access by get method:
.. code-block:: Java
int value = vector.get(5); // value == 25
6. **Vector clear**: when we are done with the vector, we should clear it to release its memory. This is done by
calling the ``close()`` method:
.. code-block:: Java
vector.close();
Some points to note about the steps above:
* The steps are not necessarily performed in a linear sequence. Instead, they can be in a loop. For example,
when a vector enters the access step, we can also go back to the vector mutation step, and then set value
count, access vector, and so on.
* We should try to make sure the above steps are carried out in order. Otherwise, the vector
may be in an undefined state, and some unexpected behavior may occur. However, this restriction
is not strict. That means it is possible that we violates the order above, but still get
correct results.
* When mutating vector values through set methods, we should prefer ``set(index, value)`` methods to
``setSafe(index, value)`` methods whenever possible, to avoid unnecessary performance overhead of handling
vector capacity.
* All vectors implement the ``AutoCloseable`` interface. So they must be closed explicitly when they are
no longer used, to avoid resource leak. To make sure of this, it is recommended to place vector related operations
into a try-with-resources block.
* For fixed width vectors (e.g. IntVector), we can set values at different indices in arbitrary orders.
For variable width vectors (e.g. VarCharVector), however, we must set values in non-decreasing order of the
indices. Otherwise, the values after the set position will become invalid. For example, suppose we use the
following statements to populate a variable width vector:
.. code-block:: Java
VarCharVector vector = new VarCharVector("vector", allocator);
vector.allocateNew();
vector.setSafe(0, "zero");
vector.setSafe(1, "one");
...
vector.setSafe(9, "nine");
Then we set the value at position 5 again:
.. code-block:: Java
vector.setSafe(5, "5");
After that, the values at positions 6, 7, 8, and 9 of the vector will become invalid.
Building ValueVector
====================
Note that the current implementation doesn't enforce the rule that Arrow objects are immutable.
:class:`ValueVector` instances could be created directly by using new keyword, there are
set/setSafe APIs and concrete subclasses of FieldWriter for populating values.
For example, the code below shows how to build a :class:`BigIntVector`, in this case, we build a
vector of the range 0 to 7 where the element that should hold the fourth value is nulled
.. code-block:: Java
try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
BigIntVector vector = new BigIntVector("vector", allocator)) {
vector.allocateNew(8);
vector.set(0, 1);
vector.set(1, 2);
vector.set(2, 3);
vector.setNull(3);
vector.set(4, 5);
vector.set(5, 6);
vector.set(6, 7);
vector.set(7, 8);
vector.setValueCount(8); // this will finalizes the vector by convention.
...
}
The :class:`BigIntVector` holds two ArrowBufs. The first buffer holds the null bitmap, which consists
here of a single byte with the bits 1|1|1|1|0|1|1|1 (the bit is 1 if the value is non-null).
The second buffer contains all the above values. As the fourth entry is null, the value at that position
in the buffer is undefined. Note compared with set API, setSafe API would check value capacity before setting
values and reallocate buffers if necessary.
Here is how to build a vector using writer
.. code-block:: Java
try (BigIntVector vector = new BigIntVector("vector", allocator);
BigIntWriter writer = new BigIntWriterImpl(vector)) {
writer.setPosition(0);
writer.writeBigInt(1);
writer.setPosition(1);
writer.writeBigInt(2);
writer.setPosition(2);
writer.writeBigInt(3);
// writer.setPosition(3) is not called which means the forth value is null.
writer.setPosition(4);
writer.writeBigInt(5);
writer.setPosition(5);
writer.writeBigInt(6);
writer.setPosition(6);
writer.writeBigInt(7);
writer.setPosition(7);
writer.writeBigInt(8);
}
There are get API and concrete subclasses of :class:`FieldReader` for accessing vector values, what needs
to be declared is that writer/reader is not as efficient as direct access
.. code-block:: Java
// access via get API
for (int i = 0; i < vector.getValueCount(); i++) {
if (!vector.isNull(i)) {
System.out.println(vector.get(i));
}
}
// access via reader
BigIntReader reader = vector.getReader();
for (int i = 0; i < vector.getValueCount(); i++) {
reader.setPosition(i);
if (reader.isSet()) {
System.out.println(reader.readLong());
}
}
Building ListVector
===================
A :class:`ListVector` is a vector that holds a list of values for each index. Working with one you need to handle the same steps as mentioned above (create > allocate > mutate > set value count > access > clear), but the details of how you accomplish this are slightly different since you need to both create the vector and set the list of values for each index.
For example, the code below shows how to build a :class:`ListVector` of int's using the writer :class:`UnionListWriter`. We build a vector from 0 to 9 and each index contains a list with values [[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8], …, [0, 9, 18, 27, 36]]. List values can be added in any order so writing a list such as [3, 1, 2] would be just as valid.
.. code-block:: Java
try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
ListVector listVector = ListVector.empty("vector", allocator)) {
UnionListWriter writer = listVector.getWriter();
for (int i = 0; i < 10; i++) {
writer.startList();
writer.setPosition(i);
for (int j = 0; j < 5; j++) {
writer.writeInt(j * i);
}
writer.setValueCount(5);
writer.endList();
}
listVector.setValueCount(10);
}
:class:`ListVector` values can be accessed either through the get API or through the reader class :class:`UnionListReader`. To read all the values, first enumerate through the indexes, and then enumerate through the inner list values.
.. code-block:: Java
// access via get API
for (int i = 0; i < listVector.getValueCount(); i++) {
if (!listVector.isNull(i)) {
ArrayList<Integer> elements = (ArrayList<Integer>) listVector.getObject(i);
for (Integer element : elements) {
System.out.println(element);
}
}
}
// access via reader
UnionListReader reader = listVector.getReader();
for (int i = 0; i < listVector.getValueCount(); i++) {
reader.setPosition(i);
while (reader.next()) {
IntReader intReader = reader.reader();
if (intReader.isSet()) {
System.out.println(intReader.readInteger());
}
}
}
Slicing
=======
Similar with C++ implementation, it is possible to make zero-copy slices of vectors to obtain a vector
referring to some logical sub-sequence of the data through :class:`TransferPair`
.. code-block:: Java
IntVector vector = new IntVector("intVector", allocator);
for (int i = 0; i < 10; i++) {
vector.setSafe(i, i);
}
vector.setValueCount(10);
TransferPair tp = vector.getTransferPair(allocator);
tp.splitAndTransfer(0, 5);
IntVector sliced = (IntVector) tp.getTo();
// In this case, the vector values are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and the sliceVector values are [0, 1, 2, 3, 4].