The C++ Core ORC API reads and writes ORC files into its own orc::ColumnVectorBatch vectorized classes.
Data is passed to ORC as instances of orc::ColumnVectorBatch that contain the data a batch of rows. The focus is on speed and accessing the data fields directly. numElements
is the number of rows. ColumnVectorBatch is the parent type of the different kinds of columns and has some fields that are shared across all of the column types. In particular, the hasNulls
flag if there is any null in this column for this batch. For columns where hasNulls == true
the notNull
buffer is false if that value is null.
namespace orc { struct ColumnVectorBatch { uint64_t numElements; DataBuffer<char> notNull; bool hasNulls; ... } }
The subtypes of ColumnVectorBatch are:
ORC Type | ColumnVectorBatch |
---|---|
array | ListVectorBatch |
binary | StringVectorBatch |
bigint | LongVectorBatch |
boolean | LongVectorBatch |
char | StringVectorBatch |
date | LongVectorBatch |
decimal | Decimal64VectorBatch, Decimal128VectorBatch |
double | DoubleVectorBatch |
float | DoubleVectorBatch |
int | LongVectorBatch |
map | MapVectorBatch |
smallint | LongVectorBatch |
string | StringVectorBatch |
struct | StructVectorBatch |
timestamp | TimestampVectorBatch |
tinyint | LongVectorBatch |
uniontype | UnionVectorBatch |
varchar | StringVectorBatch |
LongVectorBatch handles all of the integer types (boolean, bigint, date, int, smallint, and tinyint). The data is represented as a buffer of int64_t where each value is sign-extended as necessary.
struct LongVectorBatch: public ColumnVectorBatch { DataBuffer<int64_t> data; ... };
TimestampVectorBatch handles timestamp values. The data is represented as two buffers of int64_t for seconds and nanoseconds respectively. Note that we always assume data is in GMT timezone; therefore it is user's responsibility to convert wall clock time from local timezone to GMT.
struct TimestampVectorBatch: public ColumnVectorBatch { DataBuffer<int64_t> data; DataBuffer<int64_t> nanoseconds; ... };
DoubleVectorBatch handles all of the floating point types (double, and float). The data is represented as a buffer of doubles.
struct DoubleVectorBatch: public ColumnVectorBatch { DataBuffer<double> data; ... };
Decimal64VectorBatch handles decimal columns with precision no greater than 18. Decimal128VectorBatch handles the others. The data is represented as a buffer of int64_t and orc::Int128 respectively.
struct Decimal64VectorBatch: public ColumnVectorBatch { DataBuffer<int64_t> values; ... }; struct Decimal128VectorBatch: public ColumnVectorBatch { DataBuffer<Int128> values; ... };
StringVectorBatch handles all of the binary types (binary, char, string, and varchar). The data is represented as a char* buffer, and a length buffer.
struct StringVectorBatch: public ColumnVectorBatch { DataBuffer<char*> data; DataBuffer<int64_t> length; ... };
StructVectorBatch handles the struct columns and represents the data as a buffer of ColumnVectorBatch
.
struct StructVectorBatch: public ColumnVectorBatch { std::vector<ColumnVectorBatch*> fields; ... };
UnionVectorBatch handles the union columns. It uses tags
to indicate which subtype has the value and offsets
indicates the offset in child batch of that subtype. A individual ColumnVectorBatch
is used for each subtype.
struct UnionVectorBatch: public ColumnVectorBatch { DataBuffer<unsigned char> tags; DataBuffer<uint64_t> offsets; std::vector<ColumnVectorBatch*> children; ... };
ListVectorBatch handles the array columns and represents the data as a buffer of integers for the offsets and a ColumnVectorBatch
for the children values.
struct ListVectorBatch: public ColumnVectorBatch { DataBuffer<int64_t> offsets; ORC_UNIQUE_PTR<ColumnVectorBatch> elements; ... };
MapVectorBatch handles the map columns and represents the data as two arrays of integers for the offsets and two ColumnVectorBatch
s for the keys and values.
struct MapVectorBatch: public ColumnVectorBatch { DataBuffer<int64_t> offsets; ORC_UNIQUE_PTR<ColumnVectorBatch> keys; ORC_UNIQUE_PTR<ColumnVectorBatch> elements; ... };
To write an ORC file, you need to include OrcFile.hh
and define the schema; then use orc::OutputStream
and orc::WriterOptions
to create a orc::Writer
with the desired filename. This example sets the required schema parameter, but there are many other options to control the ORC writer.
ORC_UNIQUE_PTR<OutputStream> outStream = writeLocalFile("my-file.orc"); ORC_UNIQUE_PTR<Type> schema( Type::buildTypeFromString("struct<x:int,y:int>")); WriterOptions options; ORC_UNIQUE_PTR<Writer> writer = createWriter(*schema, outStream.get(), options);
Now you need to create a row batch, set the data, and write it to the file as the batch fills up. When the file is done, close the Writer
.
uint64_t batchSize = 1024, rowCount = 10000; ORC_UNIQUE_PTR<ColumnVectorBatch> batch = writer->createRowBatch(batchSize); StructVectorBatch *root = dynamic_cast<StructVectorBatch *>(batch.get()); LongVectorBatch *x = dynamic_cast<LongVectorBatch *>(root->fields[0]); LongVectorBatch *y = dynamic_cast<LongVectorBatch *>(root->fields[1]); uint64_t rows = 0; for (uint64_t i = 0; i < rowCount; ++i) { x->data[rows] = i; y->data[rows] = i * 3; rows++; if (rows == batchSize) { root->numElements = rows; x->numElements = rows; y->numElements = rows; writer->add(*batch); rows = 0; } } if (rows != 0) { root->numElements = rows; x->numElements = rows; y->numElements = rows; writer->add(*batch); rows = 0; } writer->close();
To read ORC files, include OrcFile.hh
file to create a orc::Reader
that contains the metadata about the file. There are a few options to the orc::Reader
, but far fewer than the writer and none of them are required. The reader has methods for getting the number of rows, schema, compression, etc. from the file.
ORC_UNIQUE_PTR<InputStream> inStream = readLocalFile("my-file.orc"); ReaderOptions options; ORC_UNIQUE_PTR<Reader> reader = createReader(inStream, options);
To get the data, create a orc::RowReader
object. By default, the RowReader reads all rows and all columns, but there are options to control the data that is read.
RowReaderOptions rowReaderOptions; ORC_UNIQUE_PTR<RowReader> rowReader = reader->createRowReader(rowReaderOptions); ORC_UNIQUE_PTR<ColumnVectorBatch> batch = rowReader->createRowBatch(1024);
With a orc::RowReader
the user can ask for the next batch until there are no more left. The reader will stop the batch at certain boundaries, so the returned batch may not be full, but it will always contain some rows.
while (rowReader->next(*batch)) { for (uint64_t r = 0; r < batch->numElements; ++r) { ... process row r from batch } }