specification/ORCv2/index.html - orc - Git at Google

 <!DOCTYPE HTML>
 <html lang="en-US">
 <head>
   <meta charset="UTF-8">
   <title>Evolving Draft for ORC Specification v2</title>
   <meta name="viewport" content="width=device-width,initial-scale=1">
   <meta name="generator" content="Jekyll v4.3.4">
   <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
   <link rel="stylesheet" href="/css/screen.css">
   <link rel="icon" type="image/x-icon" href="/favicon.ico">
   <!--[if lt IE 9]>
   <script src="/js/html5shiv.min.js"></script>
   <script src="/js/respond.min.js"></script>
   <![endif]-->
   <!-- Matomo -->
   <script>
     var _paq = window._paq = window._paq || [];
     /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
     _paq.push(["setDoNotTrack", true]);
     _paq.push(["disableCookies"]);
     _paq.push(['trackPageView']);
     _paq.push(['enableLinkTracking']);
     (function() {
       var u="https://analytics.apache.org/";
       _paq.push(['setTrackerUrl', u+'matomo.php']);
       _paq.push(['setSiteId', '68']);
       var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
       g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
     })();
   </script>
   <!-- End Matomo Code -->
 </head>


 <body class="wrap">
   <header role="banner">
   <nav class="mobile-nav show-on-mobiles">
     <ul>
   <li class="">
     <a href="/">Home</a>
   </li>
   <li class="">
     <a href="/releases/"><span class="show-on-mobiles">Rel</span>
                          <span class="hide-on-mobiles">Releases</span></a>
   </li>
   <li class="">
     <a href="/docs/"><span class="show-on-mobiles">Doc</span>
                      <span class="hide-on-mobiles">Documentation</span></a>
   </li>
   <li class="">
     <a href="/talks/"><span class="show-on-mobiles">Talk</span>
                       <span class="hide-on-mobiles">Talks</span></a>
   </li>
   <li class="">
     <a href="/news/">News</a>
   </li>
   <li class="">
     <a href="/develop/"><span class="show-on-mobiles">Dev</span>
                         <span class="hide-on-mobiles">Develop</span></a>
   </li>
   <li class="">
     <a href="/help/">Help</a>
   </li>
 </ul>

   </nav>
   <div class="grid">
     <div class="unit one-quarter center-on-mobiles">
       <h1>
         <a href="/">
           <span class="sr-only">Apache ORC</span>
           <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
         </a>
       </h1>
     </div>
     <nav class="main-nav unit three-quarters hide-on-mobiles">
       <ul>
   <li class="">
     <a href="/">Home</a>
   </li>
   <li class="">
     <a href="/releases/"><span class="show-on-mobiles">Rel</span>
                          <span class="hide-on-mobiles">Releases</span></a>
   </li>
   <li class="">
     <a href="/docs/"><span class="show-on-mobiles">Doc</span>
                      <span class="hide-on-mobiles">Documentation</span></a>
   </li>
   <li class="">
     <a href="/talks/"><span class="show-on-mobiles">Talk</span>
                       <span class="hide-on-mobiles">Talks</span></a>
   </li>
   <li class="">
     <a href="/news/">News</a>
   </li>
   <li class="">
     <a href="/develop/"><span class="show-on-mobiles">Dev</span>
                         <span class="hide-on-mobiles">Develop</span></a>
   </li>
   <li class="">
     <a href="/help/">Help</a>
   </li>
 </ul>

     </nav>
   </div>
 </header>


   <section class="standalone">
   <div class="grid">

     <div class="unit whole">
       <article>
         <h1>Evolving Draft for ORC Specification v2</h1>
         <p>This specification is rapidly evolving and should only be used for
 developers on the project.</p>

 <h1 id="to-do-items">TO DO items</h1>

 <p>The list of things that we plan to change:</p>

 <ul>
   <li>Move decimal encoding to RLEv3 and remove variable length encoding.</li>
   <li>Create a better float/double encoding that splits mantissa and
 exponent.</li>
   <li>Create a dictionary encoding for float, double, and decimal.</li>
   <li>Create RLEv3:
     <ul>
       <li>64 and 128 bit variants</li>
       <li>Zero suppression</li>
       <li>Evaluate the rle subformats</li>
     </ul>
   </li>
   <li>Group stripe data into stripelets to enable Async IO for reads.</li>
   <li>Reorder stripe data into (stripe metadata, index, dictionary, data)</li>
   <li>Stop sorting dictionaries and record the sort order separately in the index.</li>
   <li>Remove use of RLEv1 and RLEv2.</li>
   <li>Remove non-utf8 bloom filter.</li>
   <li>Use numeric value for decimal statistics and bloom filter.</li>
   <li>Add Zstd with dictionary.</li>
 </ul>

 <h1 id="motivation">Motivation</h1>

 <p>Hive’s RCFile was the standard format for storing tabular data in
 Hadoop for several years. However, RCFile has limitations because it
 treats each column as a binary blob without semantics. In Hive 0.11 we
 added a new file format named Optimized Row Columnar (ORC) file that
 uses and retains the type information from the table definition. ORC
 uses type specific readers and writers that provide light weight
 compression techniques such as dictionary encoding, bit packing, delta
 encoding, and run length encoding – resulting in dramatically smaller
 files. Additionally, ORC can apply generic compression using zlib, or
 Snappy on top of the lightweight compression for even smaller
 files. However, storage savings are only part of the gain. ORC
 supports projection, which selects subsets of the columns for reading,
 so that queries reading only one column read only the required
 bytes. Furthermore, ORC files include light weight indexes that
 include the minimum and maximum values for each column in each set of
 10,000 rows and the entire file. Using pushdown filters from Hive, the
 file reader can skip entire sets of rows that aren’t important for
 this query.</p>

 <p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>

 <h1 id="file-tail">File Tail</h1>

 <p>Since HDFS does not support changing the data in a file after it is
 written, ORC stores the top level index at the end of the file. The
 overall structure of the file is given in the figure above.  The
 file’s tail consists of 3 parts; the file metadata, file footer and
 postscript.</p>

 <p>The metadata for ORC is stored using
 <a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
 the ability to add new fields without breaking readers. This document
 incorporates the Protobuf definition from the
 <a href="https://github.com/apache/orc/blob/main/proto/orc_proto.proto">ORC source code</a> and the
 reader is encouraged to review the Protobuf encoding if they need to
 understand the byte-level encoding</p>

 <p>The sections of the file tail are (and their protobuf message type):</p>
 <ul>
   <li>encrypted stripe statistics: list of ColumnarStripeStatistics</li>
   <li>stripe statistics: Metadata</li>
   <li>footer: Footer</li>
   <li>postscript: PostScript</li>
   <li>psLen: byte</li>
 </ul>

 <h2 id="postscript">Postscript</h2>

 <p>The Postscript section provides the necessary information to interpret
 the rest of the file including the length of the file’s Footer and
 Metadata sections, the version of the file, and the kind of general
 compression used (eg. none, zlib, or snappy). The Postscript is never
 compressed and ends one byte before the end of the file. The version
 stored in the Postscript is the lowest version of Hive that is
 guaranteed to be able to read the file and it stored as a sequence of
 the major and minor version. This file version is encoded as [0,12].</p>

 <p>The process of reading an ORC file works backwards through the
 file. Rather than making multiple short reads, the ORC reader reads
 the last 16k bytes of the file with the hope that it will contain both
 the Footer and Postscript sections. The final byte of the file
 contains the serialized length of the Postscript, which must be less
 than 256 bytes. Once the Postscript is parsed, the compressed
 serialized length of the Footer is known and it can be decompressed
 and parsed.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message PostScript {
  // the length of the footer section in bytes
  optional uint64 footerLength = 1;
  // the kind of generic compression used
  optional CompressionKind compression = 2;
  // the maximum size of each compression chunk
  optional uint64 compressionBlockSize = 3;
  // the version of the writer
  repeated uint32 version = 4 [packed = true];
  // the length of the metadata section in bytes
  optional uint64 metadataLength = 5;
  // the fixed string "ORC"
  optional string magic = 8000;
 }
 </code></pre></div></div>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>enum CompressionKind {
  NONE = 0;
  ZLIB = 1;
  SNAPPY = 2;
  LZO = 3;
  LZ4 = 4;
  ZSTD = 5;
 }
 </code></pre></div></div>

 <h2 id="footer">Footer</h2>

 <p>The Footer section contains the layout of the body of the file, the
 type schema information, the number of rows, and the statistics about
 each of the columns.</p>

 <p>The file is broken in to three parts- Header, Body, and Tail. The
 Header consists of the bytes “ORC’’ to support tools that want to
 scan the front of the file to determine the type of the file. The Body
 contains the rows and indexes, and the Tail gives the file level
 information as described in this section.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Footer {
  // the length of the file header in bytes (always 3)
  optional uint64 headerLength = 1;
  // the length of the file header and body in bytes
  optional uint64 contentLength = 2;
  // the information about the stripes
  repeated StripeInformation stripes = 3;
  // the schema information
  repeated Type types = 4;
  // the user metadata that was added
  repeated UserMetadataItem metadata = 5;
  // the total number of rows in the file
  optional uint64 numberOfRows = 6;
  // the statistics of each column across the file
  repeated ColumnStatistics statistics = 7;
  // the maximum number of rows in each index entry
  optional uint32 rowIndexStride = 8;
  // Each implementation that writes ORC files should register for a code
  // 0 = ORC Java
  // 1 = ORC C++
  // 2 = Presto
  // 3 = Scritchley Go from https://github.com/scritchley/orc
  // 4 = Trino
  // 5 = CUDF
  optional uint32 writer = 9;
  // information about the encryption in this file
  optional Encryption encryption = 10;
  // the number of bytes in the encrypted stripe statistics
  optional uint64 stripeStatisticsLength = 11;
 }
 </code></pre></div></div>

 <h3 id="stripe-information">Stripe Information</h3>

 <p>The body of the file is divided into stripes. Each stripe is self
 contained and may be read using only its own bytes combined with the
 file’s Footer and Postscript. Each stripe contains only entire rows so
 that rows never straddle stripe boundaries. Stripes have three
 sections: a set of indexes for the rows within the stripe, the data
 itself, and a stripe footer. Both the indexes and the data sections
 are divided by columns so that only the data for the required columns
 needs to be read.</p>

 <p>The encryptStripeId and encryptedLocalKeys support column
 encryption. They are set on the first stripe of each ORC file with
 column encryption and not set after that. For a stripe with the values
 set, the reader should use those values for that stripe. Subsequent
 stripes use the previous encryptStripeId + 1 and the same keys.</p>

 <p>The current ORC merging code merges entire files, and thus the reader
 will get the correct values on what was the first stripe and continue
 on. If we develop a merge tool that reorders stripes or does partial
 merges, these values will need to be set correctly by that tool.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message StripeInformation {
  // the start of the stripe within the file
  optional uint64 offset = 1;
  // the length of the indexes in bytes
  optional uint64 indexLength = 2;
  // the length of the data in bytes
  optional uint64 dataLength = 3;
  // the length of the footer in bytes
  optional uint64 footerLength = 4;
  // the number of rows in the stripe
  optional uint64 numberOfRows = 5;
  // If this is present, the reader should use this value for the encryption
  // stripe id for setting the encryption IV. Otherwise, the reader should
  // use one larger than the previous stripe's encryptStripeId.
  // For unmerged ORC files, the first stripe will use 1 and the rest of the
  // stripes won't have it set. For merged files, the stripe information
  // will be copied from their original files and thus the first stripe of
  // each of the input files will reset it to 1.
  // Note that 1 was choosen, because protobuf v3 doesn't serialize
  // primitive types that are the default (eg. 0).
  optional uint64 encryptStripeId = 6;
  // For each encryption variant, the new encrypted local key to use until we
  // find a replacement.
  repeated bytes encryptedLocalKeys = 7;
 }
 </code></pre></div></div>

 <h3 id="type-information">Type Information</h3>

 <p>All of the rows in an ORC file must have the same schema. Logically
 the schema is expressed as a tree as in the figure below, where
 the compound types have subcolumns under them.</p>

 <p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>

 <p>The equivalent Hive DDL would be:</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>create table Foobar (
  myInt int,
  myMap map&lt;string,
  struct&lt;myString : string,
  myDouble: double&gt;&gt;,
  myTime timestamp
 );
 </code></pre></div></div>

 <p>The type tree is flattened in to a list via a pre-order traversal
 where each type is assigned the next id. Clearly the root of the type
 tree is always type id 0. Compound types have a field named subtypes
 that contains the list of their children’s type ids.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Type {
  enum Kind {
  BOOLEAN = 0;
  BYTE = 1;
  SHORT = 2;
  INT = 3;
  LONG = 4;
  FLOAT = 5;
  DOUBLE = 6;
  STRING = 7;
  BINARY = 8;
  TIMESTAMP = 9;
  LIST = 10;
  MAP = 11;
  STRUCT = 12;
  UNION = 13;
  DECIMAL = 14;
  DATE = 15;
  VARCHAR = 16;
  CHAR = 17;
  TIMESTAMP_INSTANT = 18;
  }
  // the kind of this type
  required Kind kind = 1;
  // the type ids of any subcolumns for list, map, struct, or union
  repeated uint32 subtypes = 2 [packed=true];
  // the list of field names for struct
  repeated string fieldNames = 3;
  // the maximum length of the type for varchar or char in UTF-8 characters
  optional uint32 maximumLength = 4;
  // the precision and scale for decimal
  optional uint32 precision = 5;
  optional uint32 scale = 6;
 }
 </code></pre></div></div>

 <h3 id="column-statistics">Column Statistics</h3>

 <p>The goal of the column statistics is that for each column, the writer
 records the count and depending on the type other useful fields. For
 most of the primitive types, it records the minimum and maximum
 values; and for numeric types it additionally stores the sum.
 From Hive 1.1.0 onwards, the column statistics will also record if
 there are any null values within the row group by setting the hasNull flag.
 The hasNull flag is used by ORC’s predicate pushdown to better answer
 ‘IS NULL’ queries.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message ColumnStatistics {
  // the number of values
  optional uint64 numberOfValues = 1;
  // At most one of these has a value for any column
  optional IntegerStatistics intStatistics = 2;
  optional DoubleStatistics doubleStatistics = 3;
  optional StringStatistics stringStatistics = 4;
  optional BucketStatistics bucketStatistics = 5;
  optional DecimalStatistics decimalStatistics = 6;
  optional DateStatistics dateStatistics = 7;
  optional BinaryStatistics binaryStatistics = 8;
  optional TimestampStatistics timestampStatistics = 9;
  optional bool hasNull = 10;
 }
 </code></pre></div></div>

 <p>For integer types (tinyint, smallint, int, bigint), the column
 statistics includes the minimum, maximum, and sum. If the sum
 overflows long at any point during the calculation, no sum is
 recorded.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message IntegerStatistics {
  optional sint64 minimum = 1;
  optional sint64 maximum = 2;
  optional sint64 sum = 3;
 }
 </code></pre></div></div>

 <p>For floating point types (float, double), the column statistics
 include the minimum, maximum, and sum. If the sum overflows a double,
 no sum is recorded.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message DoubleStatistics {
  optional double minimum = 1;
  optional double maximum = 2;
  optional double sum = 3;
 }
 </code></pre></div></div>

 <p>For strings, the minimum value, maximum value, and the sum of the
 lengths of the values are recorded.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  // sum will store the total length of all strings
  optional sint64 sum = 3;
 }
 </code></pre></div></div>

 <p>For booleans, the statistics include the count of false and true values.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message BucketStatistics {
  repeated uint64 count = 1 [packed=true];
 }
 </code></pre></div></div>

 <p>For decimals, the minimum, maximum, and sum are stored.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
 }
 </code></pre></div></div>

 <p>Date columns record the minimum and maximum values as the number of
 days since the UNIX epoch (1/1/1970 in UTC).</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message DateStatistics {
  // min,max values saved as days since epoch
  optional sint32 minimum = 1;
  optional sint32 maximum = 2;
 }
 </code></pre></div></div>

 <p>Timestamp columns record the minimum and maximum values as the number of
 milliseconds since the UNIX epoch (1/1/1970 00:00:00). The timestamp is
 adjusted to UTC before being converted to milliseconds and stored in
 <code class="language-plaintext highlighter-rouge">minimumUtc</code> and <code class="language-plaintext highlighter-rouge">maximumUtc</code>.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message TimestampStatistics {
  // min,max values saved as milliseconds since epoch
  optional sint64 minimum = 1;
  optional sint64 maximum = 2;
  // min,max values saved as milliseconds since UNIX epoch
  optional sint64 minimumUtc = 3;
  optional sint64 maximumUtc = 4;
 }
 </code></pre></div></div>

 <p>Binary columns store the aggregate number of bytes across all of the values.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message BinaryStatistics {
  // sum will store the total binary blob length
  optional sint64 sum = 1;
 }
 </code></pre></div></div>

 <h3 id="user-metadata">User Metadata</h3>

 <p>The user can add arbitrary key/value pairs to an ORC file as it is
 written. The contents of the keys and values are completely
 application defined, but the key is a string and the value is
 binary. Care should be taken by applications to make sure that their
 keys are unique and in general should be prefixed with an organization
 code.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message UserMetadataItem {
  // the user defined key
  required string name = 1;
  // the user defined binary value
  required bytes value = 2;
 }
 </code></pre></div></div>

 <h3 id="file-metadata">File Metadata</h3>

 <p>The file Metadata section contains column statistics at the stripe
 level granularity. These statistics enable input split elimination
 based on the predicate push-down evaluated per a stripe.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message StripeStatistics {
  repeated ColumnStatistics colStats = 1;
 }
 </code></pre></div></div>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Metadata {
  repeated StripeStatistics stripeStats = 1;
 }
 </code></pre></div></div>

 <h1 id="column-encryption">Column Encryption</h1>

 <p>ORC as of Apache ORC 1.6 supports column encryption where the data and
 statistics of specific columns are encrypted on disk. Column
 encryption provides fine-grain column level security even when many
 users have access to the file itself. The encryption is transparent to
 the user and the writer only needs to define which columns and
 encryption keys to use. When reading an ORC file, if the user has
 access to the keys, they will get the real data. If they do not have
 the keys, they will get the masked data.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Encryption {
   // all of the masks used in this file
   repeated DataMask mask = 1;
   // all of the keys used in this file
   repeated EncryptionKey key = 2;
   // The encrypted variants.
   // Readers should prefer the first variant that the user has access to
   // the corresponding key. If they don't have access to any of the keys,
   // they should get the unencrypted masked data.
   repeated EncryptionVariant variants = 3;
   // How are the local keys encrypted?
   optional KeyProviderKind keyProvider = 4;
 }
 </code></pre></div></div>

 <p>Each encrypted column in each file will have a random local key
 generated for it. Thus, even though all of the decryption happens
 locally in the reader, a malicious user that stores the key only
 enables access that column in that file. The local keys are encrypted
 by the Hadoop or Ranger Key Management Server (KMS). The encrypted
 local keys are stored in the file footer’s StripeInformation.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>enum KeyProviderKind {
   UNKNOWN = 0;
   HADOOP = 1;
   AWS = 2;
   GCP = 3;
   AZURE = 4;
 }
 </code></pre></div></div>

 <p>When ORC is using the Hadoop or Ranger KMS, it generates a random encrypted
 local key (16 or 32 bytes for 128 or 256 bit AES respectively). Using the
 first 16 bytes as the IV, it uses AES/CTR to decrypt the local key.</p>

 <p>With the AWS KMS, the GenerateDataKey method is used to create a new local
 key and the Decrypt method is used to decrypt it.</p>

 <h2 id="data-masks">Data Masks</h2>

 <p>The user’s data is statically masked before writing the unencrypted
 variant. Because the masking was done statically when the file was
 written, the information about the masking is just informational.</p>

 <p>The three standard masks are:</p>

 <ul>
   <li>nullify - all values become null</li>
   <li>redact - replace characters with constants such as X or 9</li>
   <li>sha256 - replace string with the SHA 256 of the value</li>
 </ul>

 <p>The default is nullify, but masks may be defined by the user. Masks
 are not allowed to change the type of the column, just the values.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message DataMask {
   // the kind of masking, which may include third party masks
   optional string name = 1;
   // parameters for the mask
   repeated string maskParameters = 2;
   // the unencrypted column roots this mask was applied to
   repeated uint32 columns = 3 [packed = true];
 }
 </code></pre></div></div>

 <h2 id="encryption-keys">Encryption Keys</h2>

 <p>In addition to the encrypted local keys, which are stored in the
 footer’s StripeInformation, the file also needs to describe the master
 key that was used to encrypt the local keys. The master keys are
 described by name, their version, and the encryption algorithm.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message EncryptionKey {
   optional string keyName = 1;
   optional uint32 keyVersion = 2;
   optional EncryptionAlgorithm algorithm = 3;
 }
 </code></pre></div></div>

 <p>The encryption algorithm is stored using an enumeration and since
 ProtoBuf uses the 0 value as a default, we added an unused value. That
 ensures that if we add a new algorithm that old readers will get
 UNKNOWN_ENCRYPTION instead of a real value.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>enum EncryptionAlgorithm {
   // used for detecting future algorithms
   UNKNOWN_ENCRYPTION = 0;
   // 128 bit AES/CTR
   AES_CTR_128 = 1;
   // 256 bit AES/CTR
   AES_CTR_256 = 2;
 }
 </code></pre></div></div>

 <h2 id="encryption-variants">Encryption Variants</h2>

 <p>Each encrypted column is written as two variants:</p>

 <ul>
   <li>encrypted unmasked - for users with access to the key</li>
   <li>unencrypted masked - for all other users</li>
 </ul>

 <p>The changes to the format were done so that old ORC readers will read
 the masked unencrypted data. Encryption variants encrypt a subtree of
 columns and use a single local key. The initial version of encryption
 support only allows the two variants, but this may be extended later
 and thus readers should use the first variant of a column that the
 reader has access to.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message EncryptionVariant {
   // the column id of the root column that is encrypted in this variant
   optional uint32 root = 1;
   // the key that encrypted this variant
   optional uint32 key = 2;
   // The master key that was used to encrypt the local key, referenced as
   // an index into the Encryption.key list.
   optional bytes encryptedKey = 3;
   // the stripe statistics for this variant
   repeated Stream stripeStatistics = 4;
   // encrypted file statistics as a FileStatistics
   optional bytes fileStatistics = 5;
 }
 </code></pre></div></div>

 <p>Each variant stores stripe and file statistics separately. The file
 statistics are serialized as a FileStatistics, compressed, encrypted
 and stored in the EncryptionVariant.fileStatistics.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message FileStatistics {
   repeated ColumnStatistics column = 1;
 }
 </code></pre></div></div>

 <p>The stripe statistics for each column are serialized as
 ColumnarStripeStatistics, compressed, encrypted and stored in a stream
 of kind STRIPE_STATISTICS. By making the column stripe statistics
 independent of each other, the reader only reads and parses the
 columns contained in the SARG.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message ColumnarStripeStatistics {
   // one value for each stripe in the file
   repeated ColumnStatistics colStats = 1;
 }
 </code></pre></div></div>

 <h2 id="stream-encryption">Stream Encryption</h2>

 <p>Our encryption is done using AES/CTR. CTR is a mode that has some very
 nice properties for us:</p>

 <ul>
   <li>It is seeded so that identical data is encrypted differently.</li>
   <li>It does not require padding the stream to the cipher length.</li>
   <li>It allows readers to seek in to a stream.</li>
   <li>The IV does not need to be randomly generated.</li>
 </ul>

 <p>To ensure that we don’t reuse IV, we set the IV as:</p>

 <ul>
   <li>bytes 0 to 2  - column id</li>
   <li>bytes 3 to 4  - stream kind</li>
   <li>bytes 5 to 7  - stripe id</li>
   <li>bytes 8 to 15 - cipher block counter</li>
 </ul>

 <p>However, it is critical for CTR that we never reuse an initialization
 vector (IV) with the same local key.</p>

 <p>For data in the footer, use the number of stripes in the file as the
 stripe id. This guarantees when we write an intermediate footer in to
 a file that we don’t use the same IV.</p>

 <p>Additionally, we never reuse a local key for new data. For example, when
 merging files, we don’t reuse local key from the input files for the new
 file tail, but always generate a new local key.</p>

 <h1 id="compression">Compression</h1>

 <p>If the ORC file writer selects a generic compression codec (zlib or
 snappy), every part of the ORC file except for the Postscript is
 compressed with that codec. However, one of the requirements for ORC
 is that the reader be able to skip over compressed bytes without
 decompressing the entire stream. To manage this, ORC writes compressed
 streams in chunks with headers as in the figure below.
 To handle uncompressable data, if the compressed data is larger than
 the original, the original is stored and the isOriginal flag is
 set. Each header is 3 bytes long with (compressedLength * 2 +
 isOriginal) stored as a little endian value. For example, the header
 for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
 0x03]. The header for 5 bytes that did not compress would be [0x0b,
 0x00, 0x00]. Each compression chunk is compressed independently so
 that as long as a decompressor starts at the top of a header, it can
 start decompressing without the previous bytes.</p>

 <p><img src="/img/CompressionStream.png" alt="compression streams" /></p>

 <p>The default compression chunk size is 256K, but writers can choose
 their own value. Larger chunks lead to better compression, but require
 more memory. The chunk size is recorded in the Postscript so that
 readers can allocate appropriately sized buffers. Readers are
 guaranteed that no chunk will expand to more than the compression chunk
 size.</p>

 <p>ORC files without generic compression write each stream directly
 with no headers.</p>

 <h1 id="run-length-encoding">Run Length Encoding</h1>

 <h2 id="base-128-varint">Base 128 Varint</h2>

 <p>Variable width integer encodings take advantage of the fact that most
 numbers are small and that having smaller encodings for small numbers
 shrinks the overall size of the data. ORC uses the varint format from
 Protocol Buffers, which writes data in little endian format using the
 low 7 bits of each byte. The high bit in each byte is set if the
 number continues into the next byte.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Unsigned Original</th>
       <th style="text-align: left">Serialized</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">0</td>
       <td style="text-align: left">0x00</td>
     </tr>
     <tr>
       <td style="text-align: left">1</td>
       <td style="text-align: left">0x01</td>
     </tr>
     <tr>
       <td style="text-align: left">127</td>
       <td style="text-align: left">0x7f</td>
     </tr>
     <tr>
       <td style="text-align: left">128</td>
       <td style="text-align: left">0x80, 0x01</td>
     </tr>
     <tr>
       <td style="text-align: left">129</td>
       <td style="text-align: left">0x81, 0x01</td>
     </tr>
     <tr>
       <td style="text-align: left">16,383</td>
       <td style="text-align: left">0xff, 0x7f</td>
     </tr>
     <tr>
       <td style="text-align: left">16,384</td>
       <td style="text-align: left">0x80, 0x80, 0x01</td>
     </tr>
     <tr>
       <td style="text-align: left">16,385</td>
       <td style="text-align: left">0x81, 0x80, 0x01</td>
     </tr>
   </tbody>
 </table>

 <p>For signed integer types, the number is converted into an unsigned
 number using a zigzag encoding. Zigzag encoding moves the sign bit to
 the least significant bit using the expression (val « 1) ^ (val »
 63) and derives its name from the fact that positive and negative
 numbers alternate once encoded. The unsigned number is then serialized
 as above.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Signed Original</th>
       <th style="text-align: left">Unsigned</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">0</td>
       <td style="text-align: left">0</td>
     </tr>
     <tr>
       <td style="text-align: left">-1</td>
       <td style="text-align: left">1</td>
     </tr>
     <tr>
       <td style="text-align: left">1</td>
       <td style="text-align: left">2</td>
     </tr>
     <tr>
       <td style="text-align: left">-2</td>
       <td style="text-align: left">3</td>
     </tr>
     <tr>
       <td style="text-align: left">2</td>
       <td style="text-align: left">4</td>
     </tr>
   </tbody>
 </table>

 <h2 id="byte-run-length-encoding">Byte Run Length Encoding</h2>

 <p>For byte streams, ORC uses a very light weight encoding of identical
 values.</p>

 <ul>
   <li>Run - a sequence of at least 3 identical values</li>
   <li>Literals - a sequence of non-identical values</li>
 </ul>

 <p>The first byte of each group of values is a header that determines
 whether it is a run (value between 0 to 127) or literal list (value
 between -128 to -1). For runs, the control byte is the length of the
 run minus the length of the minimal run (3) and the control byte for
 literal lists is the negative length of the list. For example, a
 hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
 would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
 either of the encodings.</p>

 <h2 id="boolean-run-length-encoding">Boolean Run Length Encoding</h2>

 <p>For encoding boolean types, the bits are put in the bytes from most
 significant to least significant. The bytes are encoded using byte run
 length encoding as described in the previous section. For example,
 the byte sequence [0xff, 0x80] would be one true followed by
 seven false values.</p>

 <h2 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h2>

 <p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
 which provides a lightweight compression of signed or unsigned integer
 sequences. RLEv1 has two sub-encodings:</p>

 <ul>
   <li>Run - a sequence of values that differ by a small fixed delta</li>
   <li>Literals - a sequence of varint encoded values</li>
 </ul>

 <p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
 length of the run - 3. A second byte provides the fixed delta in the
 range of -128 to 127. Finally, the first value of the run is encoded
 as a base 128 varint.</p>

 <p>For example, if the sequence is 100 instances of 7 the encoding would
 start with 100 - 3, followed by a delta of 0, and a varint of 7 for
 an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
 running from 100 to 1, the first byte is 100 - 3, the delta is -1,
 and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>

 <p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
 to the negative of number of literals in the sequence. Following the
 header byte, the list of N varints is encoded. Thus, if there are
 no runs, the overhead is 1 byte for each 128 integers. Numbers
 [2, 3, 6, 7, 11] would be encoded as [0xfb, 0x02, 0x03, 0x06, 0x07, 0xb].</p>

 <h2 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h2>

 <p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
 which has improved compression and fixed bit width encodings for
 faster expansion. RLEv2 uses four sub-encodings based on the data:</p>

 <ul>
   <li>Short Repeat - used for short sequences with repeated values</li>
   <li>Direct - used for random sequences with a fixed bit width</li>
   <li>Patched Base - used for random sequences with a variable bit width</li>
   <li>Delta - used for monotonically increasing or decreasing sequences</li>
 </ul>

 <h3 id="short-repeat">Short Repeat</h3>

 <p>The short repeat encoding is used for short repeating integer
 sequences with the goal of minimizing the overhead of the header. All
 of the bits listed in the header are from the first byte to the last
 and from most significant bit to least significant bit. If the type is
 signed, the value is zigzag encoded.</p>

 <ul>
   <li>1 byte header
     <ul>
       <li>2 bits for encoding type (0)</li>
       <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li>
       <li>3 bits for repeat count (3 to 10 values)</li>
     </ul>
   </li>
   <li>W bytes in big endian format, which is zigzag encoded if they type
 is signed</li>
 </ul>

 <p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
 serialized with short repeat encoding (0), a width of 2 bytes (1), and
 repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p>

 <h3 id="direct">Direct</h3>

 <p>The direct encoding is used for integer sequences whose values have a
 relatively constant bit width. It encodes the values directly using a
 fixed width big endian encoding. The width of the values is encoded
 using the table below.</p>

 <p>The 5 bit width encoding table for RLEv2:</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Width in Bits</th>
       <th style="text-align: left">Encoded Value</th>
       <th style="text-align: left">Notes</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">0</td>
       <td style="text-align: left">0</td>
       <td style="text-align: left">for delta encoding</td>
     </tr>
     <tr>
       <td style="text-align: left">1</td>
       <td style="text-align: left">0</td>
       <td style="text-align: left">for non-delta encoding</td>
     </tr>
     <tr>
       <td style="text-align: left">2</td>
       <td style="text-align: left">1</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">4</td>
       <td style="text-align: left">3</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">8</td>
       <td style="text-align: left">7</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">16</td>
       <td style="text-align: left">15</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">24</td>
       <td style="text-align: left">23</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">32</td>
       <td style="text-align: left">27</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">40</td>
       <td style="text-align: left">28</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">48</td>
       <td style="text-align: left">29</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">56</td>
       <td style="text-align: left">30</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">64</td>
       <td style="text-align: left">31</td>
       <td style="text-align: left"> </td>
     </tr>
     <tr>
       <td style="text-align: left">3</td>
       <td style="text-align: left">2</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">5 &lt;= x &lt;= 7</td>
       <td style="text-align: left">x - 1</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">9 &lt;= x &lt;= 15</td>
       <td style="text-align: left">x - 1</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">17 &lt;= x &lt;= 21</td>
       <td style="text-align: left">x - 1</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">26</td>
       <td style="text-align: left">24</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">28</td>
       <td style="text-align: left">25</td>
       <td style="text-align: left">deprecated</td>
     </tr>
     <tr>
       <td style="text-align: left">30</td>
       <td style="text-align: left">26</td>
       <td style="text-align: left">deprecated</td>
     </tr>
   </tbody>
 </table>

 <ul>
   <li>2 bytes header
     <ul>
       <li>2 bits for encoding type (1)</li>
       <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
 width encoding table</li>
       <li>9 bits for length (L) (1 to 512 values)</li>
     </ul>
   </li>
   <li>W * L bits (padded to the next byte) encoded in big endian format, which is
 zigzag encoding if the type is signed</li>
 </ul>

 <p>The unsigned sequence of [23713, 43806, 57005, 48879] would be
 serialized with direct encoding (1), a width of 16 bits (15), and
 length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
 0xbe, 0xef].</p>

 <blockquote>
   <p>Note: the run length(4) is one-off. We can get 4 by adding 1 to 3
 (See <a href="https://github.com/apache/hive/commit/69deabeaac020ba60b0f2156579f53e9fe46157a#diff-c00fea1863eaf0d6f047535e874274199020ffed3eb00deb897f513aa86f6b59R232-R236">Hive-4123</a>)</p>
 </blockquote>

 <p><img src="/img/Direct.png" alt="Direct" /></p>

 <h3 id="patched-base">Patched Base</h3>

 <p>The patched base encoding is used for integer sequences whose bit
 widths varies a lot. The minimum signed value of the sequence is found
 and subtracted from the other values. The bit width of those adjusted
 values is analyzed and the 95 percentile of the bit width is chosen
 as W. The 5% of values larger than W use patches from a patch list
 to set the additional bits. Patches are encoded as a list of gaps in
 the index values and the additional value bits.</p>

 <ul>
   <li>4 bytes header
     <ul>
       <li>2 bits for encoding type (2)</li>
       <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
   width encoding table</li>
       <li>9 bits for length (L) (1 to 512 values)</li>
       <li>3 bits for base value width (BW) (1 to 8 bytes)</li>
       <li>5 bits for patch width (PW) (1 to 64 bits) using  the 5 bit width
 encoding table</li>
       <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li>
       <li>5 bits for patch list length (PLL) (0 to 31 patches)</li>
     </ul>
   </li>
   <li>Base value (BW bytes) - The base value is stored as a big endian value
 with negative values marked by the most significant bit set. If it that
 bit is set, the entire value is negated.</li>
   <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
 values that are added to the base value.</li>
   <li>Patch list (PLL * closestFixedBits(PGW + PW) bits) - A list of patches for
 values that didn’t fit within W bits. Each entry in the list consists of a
 gap, which is the number of elements skipped from the previous
 patch, and a patch value. Patches are applied by logically or’ing
 the data values with the relevant patch shifted W bits left. If a
 patch is 0, it was introduced to skip over more than 255 items. The
 combined length of each patch (PGW + PW) must be less or equal to 64.
 (PGW + PW) is padded to the closest fixed bit size according to the
 below table before being encoded in the patch list.</li>
 </ul>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">(PGW + PW)</th>
       <th style="text-align: left">closestFixedBits(PGW + PW)</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">1 &lt;= x &lt;= 24</td>
       <td style="text-align: left">x</td>
     </tr>
     <tr>
       <td style="text-align: left">25</td>
       <td style="text-align: left">26</td>
     </tr>
     <tr>
       <td style="text-align: left">26</td>
       <td style="text-align: left">26</td>
     </tr>
     <tr>
       <td style="text-align: left">27</td>
       <td style="text-align: left">28</td>
     </tr>
     <tr>
       <td style="text-align: left">28</td>
       <td style="text-align: left">28</td>
     </tr>
     <tr>
       <td style="text-align: left">29</td>
       <td style="text-align: left">30</td>
     </tr>
     <tr>
       <td style="text-align: left">30</td>
       <td style="text-align: left">30</td>
     </tr>
     <tr>
       <td style="text-align: left">31</td>
       <td style="text-align: left">32</td>
     </tr>
     <tr>
       <td style="text-align: left">32</td>
       <td style="text-align: left">32</td>
     </tr>
     <tr>
       <td style="text-align: left">33 &lt;= x &lt;= 40</td>
       <td style="text-align: left">40</td>
     </tr>
     <tr>
       <td style="text-align: left">41 &lt;= x &lt;= 48</td>
       <td style="text-align: left">48</td>
     </tr>
     <tr>
       <td style="text-align: left">49 &lt;= x &lt;= 56</td>
       <td style="text-align: left">56</td>
     </tr>
     <tr>
       <td style="text-align: left">57 &lt;= x &lt;= 64</td>
       <td style="text-align: left">64</td>
     </tr>
   </tbody>
 </table>

 <p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
 2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
 has a minimum of 2000, which makes the adjusted
 sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
 150, 160, 170, 180, 190]. It has an
 encoding of patched base (2), a bit width of 8 (7), a length of 20
 (19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
 patch gap width of 2 bits (1), and a patch list length of 1 (1). The
 base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
 0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
 0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]</p>

 <h3 id="delta">Delta</h3>

 <p>The Delta encoding is used for monotonically increasing or decreasing
 sequences. The first two numbers in the sequence can not be identical,
 because the encoding is using the sign of the first delta to determine
 if the series is increasing or decreasing.</p>

 <ul>
   <li>2 bytes header
     <ul>
       <li>2 bits for encoding type (3)</li>
       <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
 width encoding table</li>
       <li>9 bits for run length (L) (1 to 512 values)</li>
     </ul>
   </li>
   <li>Base value - encoded as (signed or unsigned) varint</li>
   <li>Delta base - encoded as signed varint</li>
   <li>Delta values (W * (L - 2)) bytes - encode each delta after the first
 one. If the delta base is positive, the sequence is increasing and if it is
 negative the sequence is decreasing.</li>
 </ul>

 <p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
 serialized with delta encoding (3), a width of 4 bits (3), length of
 10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
 sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p>

 <h1 id="stripes">Stripes</h1>

 <p>The body of ORC files consists of a series of stripes. Stripes are
 large (typically ~200MB) and independent of each other and are often
 processed by different tasks. The defining characteristic for columnar
 storage formats is that the data for each column is stored separately
 and that reading data out of the file should be proportional to the
 number of columns read.</p>

 <p>In ORC files, each column is stored in several streams that are stored
 next to each other in the file. For example, an integer column is
 represented as two streams PRESENT, which uses one with a bit per
 value recording if the value is non-null, and DATA, which records the
 non-null values. If all of a column’s values in a stripe are non-null,
 the PRESENT stream is omitted from the stripe. For binary data, ORC
 uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.</p>

 <p>The layout of each stripe looks like:</p>
 <ul>
   <li>index streams
     <ul>
       <li>unencrypted</li>
       <li>encryption variant 1..N</li>
     </ul>
   </li>
   <li>data streams
     <ul>
       <li>unencrypted</li>
       <li>encryption variant 1..N</li>
     </ul>
   </li>
   <li>stripe footer</li>
 </ul>

 <p>There is a general order for index and data streams:</p>
 <ul>
   <li>Index streams are always placed together in the beginning of the stripe.</li>
   <li>Data streams are placed together after index streams (if any).</li>
   <li>Inside index streams or data streams, the unencrypted streams should be
 placed first and then followed by streams grouped by each encryption variant.</li>
 </ul>

 <p>There is no fixed order within each unencrypted or encryption variant in the
 index and data streams:</p>
 <ul>
   <li>Different stream kinds of the same column can be placed in any order.</li>
   <li>Streams from different columns can even be placed in any order.
 To get the precise information (a.k.a stream kind, column id and location) of
 a stream within a stripe, the streams field in the StripeFooter described below
 is the single source of truth.</li>
 </ul>

 <p>In the example of the integer column mentioned above, the order of the
 PRESENT stream and the DATA stream cannot be determined in advance.
 We need to get the precise information by <strong>StripeFooter</strong>.</p>

 <h2 id="stripe-footer">Stripe Footer</h2>

 <p>The stripe footer contains the encoding of each column and the
 directory of the streams including their location.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message StripeFooter {
  // the location of each stream
  repeated Stream streams = 1;
  // the encoding of each column
  repeated ColumnEncoding columns = 2;
  optional string writerTimezone = 3;
  // one for each column encryption variant
  repeated StripeEncryptionVariant encryption = 4;
 }
 </code></pre></div></div>

 <p>If the file includes encrypted columns, those streams and column
 encodings are stored separately in a StripeEncryptionVariant per an
 encryption variant. Additionally, the StripeFooter will contain two
 additional virtual streams ENCRYPTED_INDEX and ENCRYPTED_DATA that
 allocate the space that is used by the encryption variants to store
 the encrypted index and data streams.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message StripeEncryptionVariant {
   repeated Stream streams = 1;
   repeated ColumnEncoding encoding = 2;
 }
 </code></pre></div></div>

 <p>To describe each stream, ORC stores the kind of stream, the column id,
 and the stream’s size in bytes. The details of what is stored in each stream
 depends on the type and encoding of the column.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Stream {
  enum Kind {
    // boolean stream of whether the next value is non-null
    PRESENT = 0;
    // the primary data stream
    DATA = 1;
    // the length of each value for variable length data
    LENGTH = 2;
    // the dictionary blob
    DICTIONARY_DATA = 3;
    // deprecated prior to Hive 0.11
    // It was used to store the number of instances of each value in the
    // dictionary
    DICTIONARY_COUNT = 4;
    // a secondary data stream
    SECONDARY = 5;
    // the index for seeking to particular row groups
    ROW_INDEX = 6;
    // original bloom filters used before ORC-101
    BLOOM_FILTER = 7;
    // bloom filters that consistently use utf8
    BLOOM_FILTER_UTF8 = 8;

    // Virtual stream kinds to allocate space for encrypted index and data.
    ENCRYPTED_INDEX = 9;
    ENCRYPTED_DATA = 10;

    // stripe statistics streams
    STRIPE_STATISTICS = 100;
    // A virtual stream kind that is used for setting the encryption IV.
    FILE_STATISTICS = 101;
  }
  required Kind kind = 1;
  // the column id
  optional uint32 column = 2;
  // the number of bytes in the file
  optional uint64 length = 3;
 }
 </code></pre></div></div>

 <p>Depending on their type several options for encoding are possible. The
 encodings are divided into direct or dictionary-based categories and
 further refined as to whether they use RLE v1 or v2.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message ColumnEncoding {
  enum Kind {
  // the encoding is mapped directly to the stream using RLE v1
  DIRECT = 0;
  // the encoding uses a dictionary of unique values using RLE v1
  DICTIONARY = 1;
  // the encoding is direct using RLE v2
  DIRECT_V2 = 2;
  // the encoding is dictionary-based using RLE v2
  DICTIONARY_V2 = 3;
  }
  required Kind kind = 1;
  // for dictionary encodings, record the size of the dictionary
  optional uint32 dictionarySize = 2;
 }
 </code></pre></div></div>

 <h1 id="column-encodings"><a id="column-encoding-section">Column Encodings</a></h1>

 <h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>

 <p>All of the 16, 32, and 64 bit integer column types use the same set of
 potential encodings, which is basically whether they use RLE v1 or
 v2. If the PRESENT stream is not included, all of the values are
 present. For values that have false bits in the present stream, no
 values are included in the data stream.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <blockquote>
   <p>Note that the order of the Stream is not fixed. It also applies to other Column types.</p>
 </blockquote>

 <h2 id="float-and-double-columns">Float and Double Columns</h2>

 <p>Floating point types are stored using IEEE 754 floating point bit
 layout. Float columns use 4 bytes per value and double columns use 8
 bytes.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">IEEE 754 floating point representation</td>
     </tr>
   </tbody>
 </table>

 <h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>

 <p>String, char, and varchar columns may be encoded either using a
 dictionary encoding or a direct encoding. A direct encoding should be
 preferred when there are many distinct values. In all of the
 encodings, the PRESENT stream encodes whether the value is null. The
 Java ORC writer automatically picks the encoding after the first row
 group (10,000 rows).</p>

 <p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
 the length of each value is written into the LENGTH stream. In direct
 encoding, if the values were [“Nevada”, “California”]; the DATA
 would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>

 <p>For dictionary encodings the dictionary is sorted (in lexicographical
 order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.</p>

 <p>In dictionary encoding, if the values were [“Nevada”,
 “California”, “Nevada”, “California”, and “Florida”]; the
 DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
 be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DICTIONARY</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DICTIONARY_DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
     <tr>
       <td style="text-align: left">DICTIONARY_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DICTIONARY_DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="boolean-columns">Boolean Columns</h2>

 <p>Boolean columns are rare, but have a simple encoding.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
   </tbody>
 </table>

 <h2 id="tinyint-columns">TinyInt Columns</h2>

 <p>TinyInt (byte) columns use byte run length encoding.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Byte RLE</td>
     </tr>
   </tbody>
 </table>

 <h2 id="binary-columns">Binary Columns</h2>

 <p>Binary data is encoded with a PRESENT stream, a DATA stream that records
 the contents, and a LENGTH stream that records the number of bytes per a
 value.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">String contents</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="decimal-columns">Decimal Columns</h2>

 <p>Since Hive 0.13, all decimals have had fixed precision and scale.
 The goal is to use RLEv3 for the value and use the fixed scale from
 the type. As an interim solution, we are using RLE v2 for short decimals
 (precision &lt;= 18) and the old encoding for long decimals.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v2</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unbounded base 128 varints</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">SECONDARY</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="date-columns">Date Columns</h2>

 <p>Date data is encoded with a PRESENT stream, a DATA stream that records
 the number of days after January 1, 1970 in UTC.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="timestamp-columns">Timestamp Columns</h2>

 <p>Timestamp records times down to nanoseconds as a PRESENT stream that
 records non-null values, a DATA stream that records the number of
 seconds after 1 January 2015, and a SECONDARY stream that records the
 number of nanoseconds.</p>

 <p>Because the number of nanoseconds often has a large number of trailing
 zeros, the number has trailing decimal zero digits removed and the
 last three bits are used to record how many zeros were removed. if the
 trailing zeros are more than 2. Thus 1000 nanoseconds would be
 serialized as 0x0a and 100000 would be serialized as 0x0c.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">SECONDARY</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Signed Integer RLE v2</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">SECONDARY</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="struct-columns">Struct Columns</h2>

 <p>Structs have no data themselves and delegate everything to their child
 columns except for their PRESENT stream. They have a child column
 for each of the fields.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
   </tbody>
 </table>

 <h2 id="list-columns">List Columns</h2>

 <p>Lists are encoded as the PRESENT stream and a length stream with
 number of items in each list. They have a single child column for the
 element values.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="map-columns">Map Columns</h2>

 <p>Maps are encoded as the PRESENT stream and a length stream with number
 of items in each map. They have a child column for the key and
 another child column for the value.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v1</td>
     </tr>
     <tr>
       <td style="text-align: left">DIRECT_V2</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">LENGTH</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Unsigned Integer RLE v2</td>
     </tr>
   </tbody>
 </table>

 <h2 id="union-columns">Union Columns</h2>

 <p>Unions are encoded as the PRESENT stream and a tag stream that controls which
 potential variant is used. They have a child column for each variant of the
 union. Currently ORC union types are limited to 256 variants, which matches
 the Hive type model.</p>

 <table>
   <thead>
     <tr>
       <th style="text-align: left">Encoding</th>
       <th style="text-align: left">Stream Kind</th>
       <th style="text-align: left">Optional</th>
       <th style="text-align: left">Contents</th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td style="text-align: left">DIRECT</td>
       <td style="text-align: left">PRESENT</td>
       <td style="text-align: left">Yes</td>
       <td style="text-align: left">Boolean RLE</td>
     </tr>
     <tr>
       <td style="text-align: left"> </td>
       <td style="text-align: left">DATA</td>
       <td style="text-align: left">No</td>
       <td style="text-align: left">Byte RLE</td>
     </tr>
   </tbody>
 </table>

 <h1 id="indexes">Indexes</h1>

 <h2 id="row-group-index">Row Group Index</h2>

 <p>The row group indexes consist of a ROW_INDEX stream for each primitive
 column that has an entry for each row group. Row groups are controlled
 by the writer and default to 10,000 rows. Each RowIndexEntry gives the
 position of each stream for the column and the statistics for that row
 group.</p>

 <p>The index streams are placed at the front of the stripe, because in
 the default case of streaming they do not need to be read. They are
 only loaded when either predicate push down is being used or the
 reader seeks to a particular row.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message RowIndexEntry {
  repeated uint64 positions = 1 [packed=true];
  optional ColumnStatistics statistics = 2;
 }
 </code></pre></div></div>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message RowIndex {
  repeated RowIndexEntry entry = 1;
 }
 </code></pre></div></div>

 <p>To record positions, each stream needs a sequence of numbers. For
 uncompressed streams, the position is the byte offset of the RLE run’s
 start location followed by the number of values that need to be
 consumed from the run. In compressed streams, the first number is the
 start of the compression chunk in the stream, followed by the number
 of decompressed bytes that need to be consumed, and finally the number
 of values consumed in the RLE.</p>

 <p>For columns with multiple streams, the sequences of positions in each
 stream are concatenated. That was an unfortunate decision on my part
 that we should fix at some point, because it makes code that uses the
 indexes error-prone.</p>

 <p>Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.</p>

 <p>Note that for columns with multiple streams, the order of stream
 positions in the RowIndex is <strong>fixed</strong>, which may be different to
 the actual data stream placement, and it is the same as
 <a href="#column-encoding-section">Column Encodings</a> section we described above.</p>

 <h2 id="bloom-filter-index">Bloom Filter Index</h2>

 <p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
 Predicate pushdown can make use of bloom filters to better prune
 the row groups that do not satisfy the filter condition.
 The bloom filter indexes consist of a BLOOM_FILTER stream for each
 column specified through ‘orc.bloom.filter.columns’ table properties.
 A BLOOM_FILTER stream records a bloom filter entry for each row
 group (default to 10,000 rows) in a column. Only the row groups that
 satisfy min/max row index evaluation will be evaluated against the
 bloom filter index.</p>

 <p>Each bloom filter entry stores the number of hash functions (‘k’) used
 and the bitset backing the bloom filter. The original encoding (pre
 ORC-101) of bloom filters used the bitset field encoded as a repeating
 sequence of longs in the bitset field with a little endian encoding
 (0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
 sequence of bytes with a little endian encoding in the utf8bitset field.</p>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message BloomFilter {
  optional uint32 numHashFunctions = 1;
  repeated fixed64 bitset = 2;
  optional bytes utf8bitset = 3;
 }
 </code></pre></div></div>

 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message BloomFilterIndex {
  repeated BloomFilter bloomFilter = 1;
 }
 </code></pre></div></div>

 <p>Bloom filter internally uses two different hash functions to map a key
 to a position in the bit set. For tinyint, smallint, int, bigint, float
 and double types, Thomas Wang’s 64-bit integer hash function is used.
 Doubles are converted to IEEE-754 64 bit representation (using Java’s
 Double.doubleToLongBits(double)). Floats are as converted to double
 (using Java’s float to double cast).  All these primitive types
 are cast to long base type before being passed on to the hash function.
 For strings and binary types, Murmur3 64 bit hash algorithm is used.
 The 64 bit variant of Murmur3 considers only the most significant
 8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
 from the above algorithms is used as a base to derive ‘k’ different
 hash functions. We use the idea mentioned in the paper “Less Hashing,
 Same Performance: Building a Better Bloom Filter” by Kirsch et. al. to
 quickly compute the k hashcodes.</p>

 <p>The algorithm for computing k hashcodes and setting the bit position
 in a bloom filter is as follows:</p>

 <ol>
   <li>Get 64 bit base hash code from Murmur3 or Thomas Wang’s hash algorithm.</li>
   <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li>
   <li>k’th hashcode is obtained by (where k &gt; 0):
     <ul>
       <li>combinedHash = hash1 + (k * hash2)</li>
     </ul>
   </li>
   <li>If combinedHash is negative flip all the bits:
     <ul>
       <li>combinedHash = ~combinedHash</li>
     </ul>
   </li>
   <li>Bit set position is obtained by performing modulo with m:
     <ul>
       <li>position = combinedHash % m</li>
     </ul>
   </li>
   <li>Set the position in bit set. The LSB 6 bits identifies the long index
 within bitset and bit position within the long uses little endian order.
     <ul>
       <li>bitset[position »&gt; 6] |= (1L « position);</li>
     </ul>
   </li>
 </ol>

 <p>Bloom filter streams are interlaced with row group indexes. This placement
 makes it convenient to read the bloom filter stream and row index stream
 together in single read operation.</p>

 <p><img src="/img/BloomFilter.png" alt="bloom filter" /></p>

       </article>
     </div>

     <div class="clear"></div>

   </div>
 </section>


   <footer role="contentinfo">
   <p style="margin-left: 20px; margin-right; 20px; text-align: center">The contents of this website are &copy;&nbsp;2025
      <a href="https://www.apache.org/">Apache Software Foundation</a>
      under the terms of the <a
       href="https://www.apache.org/licenses/LICENSE-2.0.html">
       Apache&nbsp;License&nbsp;v2</a>. Apache ORC and its logo are trademarks
       of the Apache Software Foundation.</p>
 </footer>

   <script>
   var anchorForId = function (id) {
     var anchor = document.createElement("a");
     anchor.className = "header-link";
     anchor.href      = "#" + id;
     anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
     anchor.title = "Permalink";
     return anchor;
   };

   var linkifyAnchors = function (level, containingElement) {
     var headers = containingElement.getElementsByTagName("h" + level);
     for (var h = 0; h < headers.length; h++) {
       var header = headers[h];

       if (typeof header.id !== "undefined" && header.id !== "") {
         header.appendChild(anchorForId(header.id));
       }
     }
   };

   document.onreadystatechange = function () {
     if (this.readyState === "complete") {
       var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
       if (!contentBlock) {
         return;
       }
       for (var level = 1; level <= 6; level++) {
         linkifyAnchors(level, contentBlock);
       }
     }
   };
 </script>


 </body>
 </html>