| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ https://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| --- |
| Trevni: A Column File Format |
| --- |
| |
| Trevni: A Column File Format |
| |
| Version 0.1 |
| |
| DRAFT |
| |
| This document is the authoritative specification of a file format. |
| Its intent is to permit compatible, independent implementations that |
| read and/or write files in this format. |
| |
| Introduction |
| |
| Data sets are often described as a <table> composed of <rows> and |
| <columns>. Each record in the dataset is considered a row, with |
| each field of the record occupying a different column. Writing |
| records to a file one-by-one as they are created results in a |
| <row-major> format, like Hadoop’s SequenceFile or Avro data files. |
| |
| In many cases higher query performance may be achieved if the data |
| is instead organized in a <column-major> format, where multiple |
| values of a given column are stored adjacently. This document |
| defines such a column-major file format for datasets. |
| |
| To permit scalable, distributed query evaluation, datasets are |
| partitioned into row groups, containing distinct collections of |
| rows. Each row group is organized in column-major order, while row |
| groups form a row-major partitioning of the entire dataset. |
| |
| Rationale |
| |
| * Goals |
| |
| The format is meant satisfy the following goals: |
| |
| [[1]] Maximize the size of row groups. Disc drives are used most |
| efficiently when sequentially accessing data. Consider a drive that |
| takes 10ms to seek and transfers at 100MB/second. If a 10-column |
| dataset whose values are all the same size is split into 10MB row |
| groups, then accessing a single column will require a sequence of |
| seek+1MB reads, for a cost of 20ms/MB processed. If the same |
| dataset is split into 100MB row groups then this drops to 11ms/MB |
| processed. This effect is exaggerated for datasets with larger |
| numbers of columns and with columns whose values are smaller than |
| average. So we’d prefer row groups that are 100MB or greater. |
| |
| [[1]] Permit random access within a row group. Some queries will |
| first examine one column, and, only when certain relatively rare |
| criteria are met, examine other columns. Rather than iterating |
| through selected columns of the row-group in parallel, one might |
| iterate through one column and randomly access another. This is |
| called support for WHERE clauses, after the SQL operator of that |
| name. |
| |
| [[1]] Minimize the number of files per dataset. HDFS is a primary |
| intended deployment platform for these files. The HDFS Namenode |
| requires memory for each file in the filesystem, thus for a format |
| to be HDFS-friendly it should strive to require the minimum number |
| of distinct files. |
| |
| [[1]] Support co-location of columns within row-groups. Row groups |
| are the unit of parallel operation on a column dataset. For |
| efficient file i/o, the entirety of a row-group should ideally |
| reside on the host that is evaluating the query in order to avoid |
| network latencies and bottlenecks. |
| |
| [[1]] Data integrity. The format should permit applications to |
| detect data corruption. Many file systems may prevent corruption, |
| but files may be moved between filesystems and be subject to |
| corruption at points in that process. It is best if the data in a |
| file can be validated independently. |
| |
| [[1]] Extensibility. The format should permit applications to store |
| additional annotations about a datasets in the files, such as type |
| information, origin, etc. Some environments may have metadata |
| stores for such information, but not all do, and files might be |
| moved among systems with different metadata systems. The ability to |
| keep such information within the file simplifies the coordination of |
| such information. |
| |
| [[1]] Minimal overhead. The column format should not make datasets |
| appreciably larger. Storage is a primary cost and a choice to use |
| this format should not require additional storage. |
| |
| [[1]] Primary format. The column format should be usable as a |
| primary format for datasets, not as an auxiliary, accelerated |
| format. Applications that process a dataset in row-major order |
| should be able to easily consume column files and applications that |
| produce datasets in row-major order should be able to easily |
| generate column files. |
| |
| * Design |
| |
| To meet these goals we propose the following design. |
| |
| [[1]] Each row group is a separate file. All values of a column in |
| a file are written contiguously. This maximizes the row group size, |
| optimizing performance when querying few and small columns. |
| |
| [[1]] Each file occupies a single HDFS block. A larger than normal |
| block size may be specified, e.g., ~1GB instead of the typical |
| ~100MB. This guarantees co-location and eliminates network use when |
| query processing can be co-located with the file. This also |
| moderates the memory impact on the HDFS Namenode since no small |
| files are written. |
| |
| [[1]] Each column in a file is written as a sequence of ~64kB |
| compressed blocks. The sequence is prefixed by a table describing |
| all of the blocks in the column to permit random access within the |
| column. |
| |
| [[1]] Application-specific metadata may be added at the file, |
| column, and block levels. |
| |
| [[1]] Checksums are included with each block, providing data integrity. |
| |
| * Discussion |
| |
| The use of a single block per file achieves the same effect as the |
| custom block placement policy described in the {{CIF}} paper, |
| but while still permitting HDFS rebalancing and not increasing the |
| number of files in the namespace. |
| |
| Format Specification |
| |
| This section formally describes the proposed column file format. |
| |
| * Data Model |
| |
| We assume a simple data model, where a record is a set of named |
| fields, and the value of each field is a sequence of untyped bytes. |
| A type system may be layered on top of this, as specified in the |
| Type Mapping section below. |
| |
| * Primitive Values |
| |
| We define the following primitive value types: |
| |
| * Signed 64-bit <<long>> values are written using a variable-length |
| zig-zag coding, where the high-order bit in each byte determines |
| whether subsequent bytes are present. For example: |
| |
| *--------------*------* |
| decimal value | hex bytes |
| *--------------*------* |
| 0 | 00 |
| *--------------*------* |
| -1 | 01 |
| *--------------*------* |
| 1 | 02 |
| *--------------*------* |
| ... |
| *--------------*------* |
| -64 | 7f |
| *--------------*------* |
| 64 | 80 01 |
| *--------------*------* |
| ... |
| *--------------*------* |
| |
| * <<bytes>> are encoded as a <long> followed by that many bytes of data. |
| |
| * a <<string>> is encoded as a <long> followed by that many bytes of |
| UTF-8 encoded character data. |
| |
| For example, the three-character string "foo" would be encoded as |
| the <long> value 3 (encoded as hex 06) followed by the UTF-8 |
| encoding of 'f', 'o', and 'o' (the hex bytes 66 6f 6f): 06 66 6f 6f |
| |
| * Type Names |
| |
| The following type names are used to describe column values: |
| |
| * <<null>>, requires zero bytes. Sometimes used in array columns. |
| |
| * <<boolean>>, one bit, packed into bytes, little-endian; |
| |
| * <<int>>, like <long>, but restricted to 32-bit signed values |
| |
| * <<long>> 64-bit signed values, represented as above |
| |
| * <<fixed32>> 32-bit values stored as four bytes, little-endian. |
| |
| * <<fixed64>> 64-bit values stored as eight bytes, little-endian. |
| |
| * <<float>> 32-bit IEEE floating point value, little-endian |
| |
| * <<double>> 64-bit IEEE floating point value, little-endian |
| |
| * <<string>> as above |
| |
| * <<bytes>> as above, may be used to encapsulate more complex objects |
| |
| [] |
| |
| Type names are represented as <strings> (UTF-8 encoded, length-prefixed). |
| |
| * Metadata |
| |
| <<Metadata>> consists of: |
| |
| * A <long> indicating the number of metadata key/value pairs. |
| |
| * For each pair, a <string> key and <bytes> value. |
| |
| [] |
| |
| All metadata properties that start with "trevni." are reserved. |
| |
| ** File Metadata |
| |
| The following file metadata properties are defined: |
| |
| * <<trevni.codec>> the name of the default compression codec used to |
| compress blocks, as a <string>. Implementations are required to |
| support the "null" codec. Optional. If absent, it is assumed to |
| be "null". Codecs are described in more detail below. |
| |
| * <<trevni.checksum>> the name of the checksum algorithm used in this |
| file, as a <string>. Implementations are required to support the |
| "crc-32” checksum. Optional. If absent, it is assumed to be |
| "null". Checksums are described in more detail below. |
| |
| [] |
| |
| ** Column Metadata |
| |
| The following column metadata properties are defined: |
| |
| * <<trevni.codec>> the name of the compression codec used to compress |
| the blocks of this column, as a <string>. Implementations are |
| required to support the "null" codec. Optional. If absent, it is |
| assumed to be "null". Codecs are described in more detail below. |
| |
| * <<trevni.name>> the name of the column, as a <string>. Required. |
| |
| * <<trevni.type>> the type of data in the column. One of the type names |
| above. Required. |
| |
| * <<trevni.values>> if present, indicates that the initial value of each |
| block in this column will be stored in the block’s descriptor. |
| Not permitted for array columns or columns that specify a parent. |
| |
| * <<trevni.array>> if present, indicates that each row in this column |
| contains a sequence of values of the named type rather than just a |
| single value. An integer length precedes each sequence of values |
| indicating the count of values in the sequence. If the length is |
| negative then it indicates a sequence of zero or one lengths, where -1 |
| indicates two zeros, -2 two ones, -3 three zeros, -4 three ones, etc. |
| |
| * <<trevni.parent>> if present, the name of an <array> column whose |
| lengths are also used by this column. Thus values of this column |
| are sequences but no lengths are stored in this column. |
| |
| [] |
| |
| For example, consider the following row, as JSON, where all values |
| are primitive types, but one has multiple values. |
| |
| --- |
| {"id"=566, "date"=23423234234 |
| "from"="foo@bar.com", |
| "to"=["bar@baz.com", "bang@foo.com"], |
| "content"="Hi!"} |
| --- |
| |
| The columns for this might be specified as: |
| |
| --- |
| name=id type=int |
| name=date type=long |
| name=from type=string |
| name=to type=string array=true |
| name=content type=string |
| --- |
| |
| If a row contains an array of records, e.g. "received" in the following: |
| |
| --- |
| {"id"=566, "date"=23423234234 |
| "from"="foo@bar.com", |
| "to"=["bar@baz.com", "bang@foo.com"], |
| "content"="Hi!" |
| "received"=[{"date"=234234234234, "host"="192.168.0.0.1"}, |
| {"date"=234234545645, "host"="192.168.0.0.2"}] |
| } |
| --- |
| |
| Then one can define a parent column followed by a column for each |
| field in the record, adding the following columns: |
| |
| --- |
| name=received type=null array=true |
| name=date type=long parent=received |
| name=host type=string parent=received |
| --- |
| |
| If an array value itself contains an array, e.g. the "sigs" below: |
| |
| --- |
| {"id"=566, "date"=23423234234 |
| "from"="foo@bar.com", |
| "to"=["bar@baz.com", "bang@foo.com"], |
| "content"="Hi!" |
| "received"=[{"date"=234234234234, "host"="192.168.0.0.1", |
| "sigs"=[{"algo"="weak", "value"="0af345de"}]}, |
| {"date"=234234545645, "host"="192.168.0.0.2", |
| "sigs"=[]}] |
| } |
| --- |
| |
| Then a parent column may be defined that itself has a parent column. |
| |
| --- |
| name=sigs type=null array=true parent=received |
| name=algo type=string parent=sigs |
| name=value type=string parent=sigs |
| --- |
| |
| ** Block Metadata |
| |
| No block metadata properties are currently defined. |
| |
| * File Format |
| |
| A <<file>> consists of: |
| |
| * A <file header>, followed by |
| |
| * one or more <columns>. |
| |
| [] |
| |
| A <<file header>> consists of: |
| |
| * Four bytes, ASCII 'T', 'r', 'v', followed by 0x02. |
| |
| * a <fixed64> indicating the number of rows in the file |
| |
| * a <fixed32> indicating the number of columns in the file |
| |
| * file <metadata>. |
| |
| * for each column, its <column metadata> |
| |
| * for each column, its starting position in the file as a <fixed64>. |
| |
| [] |
| |
| A <<column>> consists of: |
| |
| * A <fixed32> indicating the number of blocks in this column. |
| |
| * For each block, a <block descriptor> |
| |
| * One or more <blocks>. |
| |
| [] |
| |
| A <<block descriptor>> consists of: |
| |
| * A <fixed32> indicating the number of rows in the block |
| |
| * A <fixed32> indicating the size in bytes of the block before the |
| codec is applied (excluding checksum). |
| |
| * A <fixed32> indicating the size in bytes of the block after the |
| codec is applied (excluding checksum). |
| |
| * If this column’s metadata declares it to include values, the first |
| value in the column, serialized according to this column's type. |
| |
| [] |
| |
| A <<block>> consists of: |
| |
| * The serialized column values. If a column is an array column then |
| value sequences are preceded by their length, as an <int>. If a |
| codec is specified, the values and lengths are compressed by that |
| codec. |
| |
| * The checksum, as determined by the file metadata. |
| |
| [] |
| |
| * Codecs |
| |
| [null] The "null" codec simply passes data through uncompressed. |
| |
| [deflate] The "deflate" codec writes the data block using the |
| deflate algorithm as specified in RFC 1951. |
| |
| [snappy] The "snappy" codec uses Google's Snappy compression library. |
| |
| * Checksum algorithms |
| |
| [null] The "null" checksum contains zero bytes. |
| |
| [crc-32] Each "crc-32" checksum contains the four bytes of an ISO |
| 3309 CRC-32 checksum of the uncompressed block data as a fixed32. |
| |
| * Type Mappings |
| |
| We define a standard mapping for how types defined in various |
| serialization systems are represented in a column file. Records |
| from these systems are <shredded> into columns. When records are |
| nested, a depth-first recursive walk can assign a separate column |
| for each primitive value. |
| |
| ** Avro |
| |
| ** Protocol Buffers |
| |
| ** Thrift |
| |
| Implementation Notes |
| |
| Some possible techniques for writing column files include: |
| |
| [[1]] Use a standard ~100MB block, buffer in memory up to the block |
| size, then flush the file directly to HDFS. A single reduce task |
| might create multiple output files. The namenode requires memory |
| proportional to the number of names and blocks*replication. This |
| would increase the number of names but not blocks, so this should |
| still be much better than a file per column. |
| |
| [[1]] Spill each column to a separate local, temporary file then, |
| when the file is closed, append these files, writing a single file |
| to HDFS whose block size is set to be that of the entire file. This |
| would be a bit slower than and may have trouble when the local disk |
| is full, but it would better use HDFS namespace and further reduce |
| seeks when processing columns whose values are small. |
| |
| [[1]] Use a separate mapreduce job to convert row-major files to |
| column-major. The map output would output a by (row#, column#, |
| value) tuple, partitioned by row# but sorted by column# then row#. |
| The reducer could directly write the column file. But the column |
| file format would need to be changed to write counts, descriptors, |
| etc. at the end of files rather than at the front. |
| |
| [] |
| |
| (1) is the simplest to implement and most implementations should |
| start with it. |
| |
| * References |
| |
| {CIF} {{{https://arxiv.org/pdf/1105.4252.pdf}<Column-Oriented Storage |
| Techniques for MapReduce>}}, Floratou, Patel, Shekita, & Tata, VLDB |
| 2011. |
| |
| {DREMEL} {{{https://research.google.com/pubs/archive/36632.pdf}<Dremel: |
| Interactive Analysis of Web-Scale Datasets>}}, Melnik, Gubarev, Long, |
| Romer, Shivakumar, & Tolton, VLDB 2010. |