blob: 1ca1f5e252a720f6ba0bb777409179c6c19aa14c [file] [log] [blame]
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.drill.exec.vector.accessor.writer;
import org.apache.drill.exec.vector.BaseDataValueVector;
import org.apache.drill.exec.vector.UInt4Vector;
import org.apache.drill.exec.vector.accessor.InvalidConversionError;
import org.apache.drill.exec.vector.accessor.ValueType;
import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter;
/**
* Specialized column writer for the (hidden) offset vector used
* with variable-length or repeated vectors. See comments in the
* <tt>ColumnAccessors.java</tt> template file for more details.
* <p>
* Note that the <tt>lastWriteIndex</tt> tracked here corresponds
* to the data values; it is one less than the actual offset vector
* last write index due to the nature of offset vector layouts. The selection
* of last write index basis makes roll-over processing easier as only this
* writer need know about the +1 translation required for writing.
* <p>
* The states illustrated in the base class apply here as well,
* remembering that the end offset for a row (or array position)
* is written one ahead of the vector index.
* <p>
* The vector index does create an interesting dynamic for the child
* writers. From the child writer's perspective, the states described in
* the super class are the only states of interest. Here we want to
* take the perspective of the parent.
* <p>
* The offset vector is an implementation of a repeat level. A repeat
* level can occur for a single array, or for a collection of columns
* within a repeated map. (A repeat level also occurs for variable-width
* fields, but this is a bit harder to see, so let's ignore that for
* now.)
* <p>
* The key point to realize is that each repeat level introduces an
* isolation level in terms of indexing. That is, empty values in the
* outer level have no affect on indexing in the inner level. In fact,
* the nature of a repeated outer level means that there are no empties
* in the inner level.
* <p>
* To illustrate:<pre><code>
* Offset Vector Data Vector Indexes
* lw, v > | 10 | - - - - - > | X | 10
* | 12 | - - + | X | < lw' 11
* | | + - - > | | < v' 12
* </code></pre>
* In the above, the client has just written an array of two elements
* at the current write position. The data starts at offset 10 in
* the data vector, and the next write will be at 12. The end offset
* is written one ahead of the vector index.
* <p>
* From the data vector's perspective, its last-write (lw') reflects
* the last element written. If this is an array of scalars, then the
* write index is automatically incremented, as illustrated by v'.
* (For map arrays, the index must be incremented by calling
* <tt>save()</tt> on the map array writer.)
* <p>
* Suppose the client now skips some arrays:<pre><code>
* Offset Vector Data Vector
* lw > | 10 | - - - - - > | X | 10
* | 12 | - - + | X | < lw' 11
* | | + - - > | | < v' 12
* | | | | 13
* v > | | | | 14
* </code></pre>
* The last write position does not move and there are gaps in the
* offset vector. The vector index points to the current row. Note
* that the data vector last write and vector indexes do not change,
* this reflects the fact that the the data vector's vector index
* (v') matches the tail offset
* <p>
* The
* client now writes a three-element vector:<pre><code>
* Offset Vector Data Vector
* | 10 | - - - - - > | X | 10
* | 12 | - - + | X | 11
* | 12 | - - + - - > | Y | 12
* | 12 | - - + | Y | 13
* lw, v > | 12 | - - + | Y | < lw' 14
* | 15 | - - - - - > | | < v' 15
* </code></pre>
* Quite a bit just happened. The empty offset slots were back-filled
* with the last write offset in the data vector. The client wrote
* three values, which advanced the last write and vector indexes
* in the data vector. And, the last write index in the offset
* vector also moved to reflect the update of the offset vector.
* Note that as a result, multiple positions in the offset vector
* point to the same location in the data vector. This is fine; we
* compute the number of entries as the difference between two successive
* offset vector positions, so the empty positions have become 0-length
* arrays.
* <p>
* Note that, for an array of scalars, when overflow occurs,
* we need only worry about two
* states in the data vector. Either data has been written for the
* row (as in the third example above), and so must be moved to the
* roll-over vector, or no data has been written and no move is
* needed. We never have to worry about missing values because the
* cannot occur in the data vector.
* <p>
* See {@link ObjectArrayWriter} for information about arrays of
* maps (arrays of multiple columns.)
*
* <h4>Empty Slots</h4>
*
* The offset vector writer handles empty slots in two distinct ways.
* First, the writer handles its own empties. Suppose that this is the offset
* vector for a VarChar column. Suppose we write "Foo" in the first slot. Now
* we have an offset vector with the values <tt>[ 0 3 ]</tt>. Suppose the client
* skips several rows and next writes at slot 5. We must copy the latest
* offset (3) into all the skipped slots: <tt>[ 0 3 3 3 3 3 ]</tt>. The result
* is a set of four empty VarChars in positions 1, 2, 3 and 4. (Here, remember
* that the offset vector always has one more value than the the number of rows.)
* <p>
* The second way to fill empties is in the data vector. The data vector may choose
* to fill the four "empty" slots with a value, say "X". In this case, it is up to
* the data vector to fill in the values, calling into this vector to set each
* offset. Note that when doing this, the calls are a bit different than for writing
* a regular value because we want to write at the "last write position", not the
* current row position. See {@link BaseVarWidthWriter} for an example.
*/
public class OffsetVectorWriterImpl extends AbstractFixedWidthWriter implements OffsetVectorWriter {
private static final int VALUE_WIDTH = UInt4Vector.VALUE_WIDTH;
private final UInt4Vector vector;
/**
* Offset of the first value for the current row. Used during
* overflow or if the row is restarted.
*/
private int rowStartOffset;
/**
* Cached value of the end offset for the current value. Used
* primarily for variable-width columns to allow the column to be
* rewritten multiple times within the same row. The start offset
* value is updated with the end offset only when the value is
* committed in {@link @endValue()}.
*/
protected int nextOffset;
public OffsetVectorWriterImpl(UInt4Vector vector) {
this.vector = vector;
}
@Override public BaseDataValueVector vector() { return vector; }
@Override public int width() { return VALUE_WIDTH; }
@Override
protected void realloc(int size) {
vector.reallocRaw(size);
setBuffer();
}
@Override
public ValueType valueType() { return ValueType.INTEGER; }
@Override
public void startWrite() {
super.startWrite();
nextOffset = 0;
rowStartOffset = 0;
// Special handling for first value. Alloc vector if needed.
// Offset vectors require a 0 at position 0. The (end) offset
// for row 0 starts at position 1, which is handled in
// writeOffset() below.
if (capacity * VALUE_WIDTH < MIN_BUFFER_SIZE) {
realloc(MIN_BUFFER_SIZE);
}
drillBuf.setInt(0, 0);
}
@Override
public int nextOffset() {return nextOffset; }
@Override
public int rowStartOffset() { return rowStartOffset; }
@Override
public void startRow() { rowStartOffset = nextOffset; }
/**
* Return the write offset, which is one greater than the index reported
* by the vector index.
*
* @return the offset in which to write the current offset of the end
* of the current data value
*/
protected final int prepareWrite() {
// This is performance critical code; every operation counts.
// Please be thoughtful when changing the code.
int valueIndex = prepareFill();
int fillCount = valueIndex - lastWriteIndex - 1;
if (fillCount > 0) {
fillEmpties(fillCount);
}
// Track the last write location for zero-fill use next time around.
lastWriteIndex = valueIndex;
return valueIndex + 1;
}
public final int prepareFill() {
int valueIndex = vectorIndex.vectorIndex();
if (valueIndex + 1 < capacity) {
return valueIndex;
}
resize(valueIndex + 1);
// Call to resize may cause rollover, so get new write index afterwards.
return vectorIndex.vectorIndex();
}
@Override
protected final void fillEmpties(int fillCount) {
for (int i = 0; i < fillCount; i++) {
fillOffset(nextOffset);
}
}
@Override
public final void setNextOffset(int newOffset) {
int writeIndex = prepareWrite();
drillBuf.setInt(writeIndex * VALUE_WIDTH, newOffset);
nextOffset = newOffset;
}
public final void reviseOffset(int newOffset) {
int writeIndex = vectorIndex.vectorIndex() + 1;
drillBuf.setInt(writeIndex * VALUE_WIDTH, newOffset);
nextOffset = newOffset;
}
public final void fillOffset(int newOffset) {
drillBuf.setInt((++lastWriteIndex + 1) * VALUE_WIDTH, newOffset);
nextOffset = newOffset;
}
@Override
public final void setValue(Object value) {
throw new InvalidConversionError(
"setValue() not supported for the offset vector writer: " + value);
}
@Override
public void skipNulls() {
// Nothing to do. Fill empties logic will fill in missing offsets.
}
@Override
public void restartRow() {
nextOffset = rowStartOffset;
super.restartRow();
}
@Override
public void preRollover() {
// Rollover is occurring. This means the current row is not complete.
// We want to keep 0..(row index - 1) which gives us (row index)
// rows. But, this being an offset vector, we add one to account
// for the extra 0 value at the start. That is, we want to set
// the value count to the current row start index, which already
// is set to one past the index of the last zero-based index.
// (Offset vector indexes are confusing.)
setValueCount(vectorIndex.rowStartIndex());
}
@Override
public void postRollover() {
int newNext = nextOffset - rowStartOffset;
super.postRollover();
nextOffset = newNext;
}
@Override
public void setValueCount(int valueCount) {
// Value count is in row positions, not index
// positions. (There are one more index positions
// than row positions.)
int offsetCount = valueCount + 1;
mandatoryResize(offsetCount);
fillEmpties(valueCount - lastWriteIndex - 1);
vector().getBuffer().writerIndex(offsetCount * VALUE_WIDTH);
}
@Override
public void dump(HierarchicalFormatter format) {
format.extend();
super.dump(format);
format
.attribute("lastWriteIndex", lastWriteIndex)
.attribute("nextOffset", nextOffset)
.endObject();
}
@Override
public void setDefaultValue(Object value) {
throw new UnsupportedOperationException("Encoding not supported for offset vectors");
}
}