output/docs/index.xml - parquet-site - Git at Google

 <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Apache Parquet – Documentation</title><link>/docs/</link><description>Recent content in Documentation on Apache Parquet</description><generator>Hugo -- gohugo.io</generator><atom:link href="/docs/index.xml" rel="self" type="application/rss+xml"/><item><title>Docs: Encodings</title><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description>
 &lt;h3 id="a-nameplainaplain-plain--0">&lt;a name="PLAIN">&lt;/a>Plain: (PLAIN = 0)&lt;/h3>
 &lt;p>Supported Types: all&lt;/p>
 &lt;p>This is the plain encoding that must be supported for types. It is
 intended to be the simplest encoding. Values are encoded back to back.&lt;/p>
 &lt;p>The plain encoding is used whenever a more efficient encoding can not be used. It
 stores the data in the following format:&lt;/p>
 &lt;ul>
 &lt;li>BOOLEAN: &lt;a href="#RLE">Bit Packed&lt;/a>, LSB first&lt;/li>
 &lt;li>INT32: 4 bytes little endian&lt;/li>
 &lt;li>INT64: 8 bytes little endian&lt;/li>
 &lt;li>INT96: 12 bytes little endian (deprecated)&lt;/li>
 &lt;li>FLOAT: 4 bytes IEEE little endian&lt;/li>
 &lt;li>DOUBLE: 8 bytes IEEE little endian&lt;/li>
 &lt;li>BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array&lt;/li>
 &lt;li>FIXED_LEN_BYTE_ARRAY: the bytes contained in the array&lt;/li>
 &lt;/ul>
 &lt;p>For native types, this outputs the data as little endian. Floating
 point types are encoded in IEEE.&lt;/p>
 &lt;p>For the byte array type, it encodes the length as a 4 byte little
 endian, followed by the bytes.&lt;/p>
 &lt;h3 id="dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8">Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)&lt;/h3>
 &lt;p>The dictionary encoding builds a dictionary of values encountered in a given column. The
 dictionary will be stored in a dictionary page per column chunk. The values are stored as integers
 using the &lt;a href="#RLE">RLE/Bit-Packing Hybrid&lt;/a> encoding. If the dictionary grows too big, whether in size
 or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is
 written first, before the data pages of the column chunk.&lt;/p>
 &lt;p>Dictionary page format: the entries in the dictionary - in dictionary order - using the &lt;a href="#PLAIN">plain&lt;/a> encoding.&lt;/p>
 &lt;p>Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
 followed by the values encoded using RLE/Bit packed described above (with the given bit width).&lt;/p>
 &lt;p>Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
 in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.&lt;/p>
 &lt;h3 id="a-namerlearun-length-encoding--bit-packing-hybrid-rle--3">&lt;a name="RLE">&lt;/a>Run Length Encoding / Bit-Packing Hybrid (RLE = 3)&lt;/h3>
 &lt;p>This encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values.&lt;/p>
 &lt;p>The grammar for this encoding looks like this, given a fixed bit-width known in advance:&lt;/p>
 &lt;pre tabindex="0">&lt;code>rle-bit-packed-hybrid: &amp;lt;length&amp;gt; &amp;lt;encoded-data&amp;gt;
 length := length of the &amp;lt;encoded-data&amp;gt; in bytes stored as 4 bytes little endian (unsigned int32)
 encoded-data := &amp;lt;run&amp;gt;*
 run := &amp;lt;bit-packed-run&amp;gt; | &amp;lt;rle-run&amp;gt;
 bit-packed-run := &amp;lt;bit-packed-header&amp;gt; &amp;lt;bit-packed-values&amp;gt;
 bit-packed-header := varint-encode(&amp;lt;bit-pack-scaled-run-len&amp;gt; &amp;lt;&amp;lt; 1 | 1)
 // we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8
 bit-pack-scaled-run-len := (bit-packed-run-len) / 8
 bit-packed-run-len := *see 3 below*
 bit-packed-values := *see 1 below*
 rle-run := &amp;lt;rle-header&amp;gt; &amp;lt;repeated-value&amp;gt;
 rle-header := varint-encode( (rle-run-len) &amp;lt;&amp;lt; 1)
 rle-run-len := *see 3 below*
 repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)
 &lt;/code>&lt;/pre>&lt;ol>
 &lt;li>
 &lt;p>The bit-packing here is done in a different order than the one in the &lt;a href="#BITPACKED">deprecated bit-packing&lt;/a> encoding.
 The values are packed from the least significant bit of each byte to the most significant bit,
 though the order of the bits in each value remains in the usual order of most significant to least
 significant. For example, to pack the same values as the example in the deprecated encoding above:&lt;/p>
 &lt;p>The numbers 1 through 7 using bit width 3:&lt;/p>
 &lt;pre tabindex="0">&lt;code>dec value: 0 1 2 3 4 5 6 7
 bit value: 000 001 010 011 100 101 110 111
 bit label: ABC DEF GHI JKL MNO PQR STU VWX
 &lt;/code>&lt;/pre>&lt;p>would be encoded like this where spaces mark byte boundaries (3 bytes):&lt;/p>
 &lt;pre tabindex="0">&lt;code>bit value: 10001000 11000110 11111010
 bit label: HIDEFABC RMNOJKLG VWXSTUPQ
 &lt;/code>&lt;/pre>&lt;p>The reason for this packing order is to have fewer word-boundaries on little-endian hardware
 when deserializing more than one byte at at time. This is because 4 bytes can be read into a
 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by
 shifting and ORing with a mask. (to make this optimization work on a big-endian machine,
 you would have to use the ordering used in the &lt;a href="#BITPACKED">deprecated bit-packing&lt;/a> encoding)&lt;/p>
 &lt;/li>
 &lt;li>
 &lt;p>varint-encode() is ULEB-128 encoding, see &lt;a href="https://en.wikipedia.org/wiki/LEB128">https://en.wikipedia.org/wiki/LEB128&lt;/a>&lt;/p>
 &lt;/li>
 &lt;li>
 &lt;p>bit-packed-run-len and rle-run-len must be in the range [1, 2&lt;sup>31&lt;/sup> - 1].
 This means that a Parquet implementation can always store the run length in a signed
 32-bit integer. This length restriction was not part of the Parquet 2.5.0 and earlier
 specifications, but longer runs were not readable by the most common Parquet
 implementations so, in practice, were not safe for Parquet writers to emit.&lt;/p>
 &lt;/li>
 &lt;/ol>
 &lt;p>Note that the RLE encoding method is only supported for the following types of
 data:&lt;/p>
 &lt;ul>
 &lt;li>Repetition and definition levels&lt;/li>
 &lt;li>Dictionary indices&lt;/li>
 &lt;li>Boolean values in data pages, as an alternative to PLAIN encoding&lt;/li>
 &lt;/ul>
 &lt;h3 id="a-namebitpackedabit-packed-deprecated-bit_packed--4">&lt;a name="BITPACKED">&lt;/a>Bit-packed (Deprecated) (BIT_PACKED = 4)&lt;/h3>
 &lt;p>This is a bit-packed only encoding, which is deprecated and will be replaced by the &lt;a href="#RLE">RLE/bit-packing&lt;/a> hybrid encoding.
 Each value is encoded back to back using a fixed width.
 There is no padding between values (except for the last byte) which is padded with 0s.
 For example, if the max repetition level was 3 (2 bits) and the max definition level as 3
 (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes.&lt;/p>
 &lt;p>This implementation is deprecated because the &lt;a href="#RLE">RLE/bit-packing&lt;/a> hybrid is a superset of this implementation.
 For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit,
 which is not the same as the &lt;a href="#RLE">RLE/bit-packing&lt;/a> hybrid.&lt;/p>
 &lt;p>For example, the numbers 1 through 7 using bit width 3:&lt;/p>
 &lt;pre tabindex="0">&lt;code>dec value: 0 1 2 3 4 5 6 7
 bit value: 000 001 010 011 100 101 110 111
 bit label: ABC DEF GHI JKL MNO PQR STU VWX
 &lt;/code>&lt;/pre>&lt;p>would be encoded like this where spaces mark byte boundaries (3 bytes):&lt;/p>
 &lt;pre tabindex="0">&lt;code>bit value: 00000101 00111001 01110111
 bit label: ABCDEFGH IJKLMNOP QRSTUVWX
 &lt;/code>&lt;/pre>&lt;p>Note that the BIT_PACKED encoding method is only supported for encoding
 repetition and definition levels.&lt;/p>
 &lt;h3 id="a-namedeltaencadelta-encoding-delta_binary_packed--5">&lt;a name="DELTAENC">&lt;/a>Delta Encoding (DELTA_BINARY_PACKED = 5)&lt;/h3>
 &lt;p>Supported Types: INT32, INT64&lt;/p>
 &lt;p>This encoding is adapted from the Binary packing described in &lt;a href="http://arxiv.org/pdf/1209.2137v5.pdf">&amp;ldquo;Decoding billions of integers per second through vectorization&amp;rdquo;&lt;/a> by D. Lemire and L. Boytsov.&lt;/p>
 &lt;p>In delta encoding we make use of variable length integers for storing various numbers (not the deltas themselves). For unsigned values, we use ULEB128, which is the unsigned version of LEB128 (&lt;a href="https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)">https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)&lt;/a>. For signed values, we use zigzag encoding (&lt;a href="https://developers.google.com/protocol-buffers/docs/encoding#signed-integers">https://developers.google.com/protocol-buffers/docs/encoding#signed-integers&lt;/a>) to map negative values to positive ones and apply ULEB128 on the result.&lt;/p>
 &lt;p>Delta encoding consists of a header followed by blocks of delta encoded values binary packed. Each block is made of miniblocks, each of them binary packed with its own bit width.&lt;/p>
 &lt;p>The header is defined as follows:&lt;/p>
 &lt;pre tabindex="0">&lt;code>&amp;lt;block size in values&amp;gt; &amp;lt;number of miniblocks in a block&amp;gt; &amp;lt;total value count&amp;gt; &amp;lt;first value&amp;gt;
 &lt;/code>&lt;/pre>&lt;ul>
 &lt;li>the block size is a multiple of 128; it is stored as a ULEB128 int&lt;/li>
 &lt;li>the miniblock count per block is a divisor of the block size such that their quotient, the number of values in a miniblock, is a multiple of 32; it is stored as a ULEB128 int&lt;/li>
 &lt;li>the total value count is stored as a ULEB128 int&lt;/li>
 &lt;li>the first value is stored as a zigzag ULEB128 int&lt;/li>
 &lt;/ul>
 &lt;p>Each block contains&lt;/p>
 &lt;pre tabindex="0">&lt;code>&amp;lt;min delta&amp;gt; &amp;lt;list of bitwidths of miniblocks&amp;gt; &amp;lt;miniblocks&amp;gt;
 &lt;/code>&lt;/pre>&lt;ul>
 &lt;li>the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing)&lt;/li>
 &lt;li>the bitwidth of each block is stored as a byte&lt;/li>
 &lt;li>each miniblock is a list of bit packed ints according to the bit width stored at the begining of the block&lt;/li>
 &lt;/ul>
 &lt;p>To encode a block, we will:&lt;/p>
 &lt;ol>
 &lt;li>
 &lt;p>Compute the differences between consecutive elements. For the first element in the block, use the last element in the previous block or, in the case of the first block, use the first value of the whole sequence, stored in the header.&lt;/p>
 &lt;/li>
 &lt;li>
 &lt;p>Compute the frame of reference (the minimum of the deltas in the block). Subtract this min delta from all deltas in the block. This guarantees that all values are non-negative.&lt;/p>
 &lt;/li>
 &lt;li>
 &lt;p>Encode the frame of reference (min delta) as a zigzag ULEB128 int followed by the bit widths of the miniblocks and the delta values (minus the min delta) bit packed per miniblock.&lt;/p>
 &lt;/li>
 &lt;/ol>
 &lt;p>Having multiple blocks allows us to adapt to changes in the data by changing the frame of reference (the min delta) which can result in smaller values after the subtraction which, again, means we can store them with a lower bit width.&lt;/p>
 &lt;p>If there are not enough values to fill the last miniblock, we pad the miniblock so that its length is always the number of values in a full miniblock multiplied by the bit width. The values of the padding bits should be zero, but readers must accept paddings consisting of arbitrary bits as well.&lt;/p>
 &lt;p>If, in the last block, less than &lt;code>&amp;lt;number of miniblocks in a block&amp;gt;&lt;/code> miniblocks are needed to store the values, the bytes storing the bit widths of the unneeded miniblocks are still present, their value should be zero, but readers must accept arbitrary values as well. There are no additional padding bytes for the miniblock bodies though, as if their bit widths were 0 (regardless of the actual byte values). The reader knows when to stop reading by keeping track of the number of values read.&lt;/p>
 &lt;p>The following examples use 8 as the block size to keep the examples short, but in real cases it would be invalid.&lt;/p>
 &lt;h4 id="example-1">Example 1&lt;/h4>
 &lt;p>1, 2, 3, 4, 5&lt;/p>
 &lt;p>After step 1), we compute the deltas as:&lt;/p>
 &lt;p>1, 1, 1, 1&lt;/p>
 &lt;p>The minimum delta is 1 and after step 2, the deltas become&lt;/p>
 &lt;p>0, 0, 0, 0&lt;/p>
 &lt;p>The final encoded data is:&lt;/p>
 &lt;p>header:
 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value)&lt;/p>
 &lt;p>block
 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)&lt;/p>
 &lt;h4 id="example-2">Example 2&lt;/h4>
 &lt;p>7, 5, 3, 1, 2, 3, 4, 5, the deltas would be&lt;/p>
 &lt;p>-2, -2, -2, 1, 1, 1, 1&lt;/p>
 &lt;p>The minimum is -2, so the relative deltas are:&lt;/p>
 &lt;p>0, 0, 0, 3, 3, 3, 3&lt;/p>
 &lt;p>The encoded data is&lt;/p>
 &lt;p>header:
 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)&lt;/p>
 &lt;p>block
 -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits)&lt;/p>
 &lt;h4 id="characteristics">Characteristics&lt;/h4>
 &lt;p>This encoding is similar to the &lt;a href="#RLE">RLE/bit-packing&lt;/a> encoding. However the &lt;a href="#RLE">RLE/bit-packing&lt;/a> encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page.
 The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header.&lt;/p>
 &lt;h3 id="delta-length-byte-array-delta_length_byte_array--6">Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)&lt;/h3>
 &lt;p>Supported Types: BYTE_ARRAY&lt;/p>
 &lt;p>This encoding is always preferred over PLAIN for byte array columns.&lt;/p>
 &lt;p>For this encoding, we will take all the byte array lengths and encode them using delta
 encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just
 concatenated back to back. The expected savings is from the cost of encoding the lengths
 and possibly better compression in the data (it is no longer interleaved with the lengths).&lt;/p>
 &lt;p>The data stream looks like:&lt;/p>
 &lt;p>&lt;Delta Encoded Lengths> &lt;Byte Array Data>&lt;/p>
 &lt;p>For example, if the data was &amp;ldquo;Hello&amp;rdquo;, &amp;ldquo;World&amp;rdquo;, &amp;ldquo;Foobar&amp;rdquo;, &amp;ldquo;ABCDEF&amp;rdquo;:&lt;/p>
 &lt;p>The encoded data would be DeltaEncoding(5, 5, 6, 6) &amp;ldquo;HelloWorldFoobarABCDEF&amp;rdquo;&lt;/p>
 &lt;h3 id="delta-strings-delta_byte_array--7">Delta Strings: (DELTA_BYTE_ARRAY = 7)&lt;/h3>
 &lt;p>Supported Types: BYTE_ARRAY&lt;/p>
 &lt;p>This is also known as incremental encoding or front compression: for each element in a
 sequence of strings, store the prefix length of the previous entry plus the suffix.&lt;/p>
 &lt;p>For a longer description, see &lt;a href="https://en.wikipedia.org/wiki/Incremental_encoding">https://en.wikipedia.org/wiki/Incremental_encoding&lt;/a>.&lt;/p>
 &lt;p>This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by
 the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).&lt;/p>
 &lt;h3 id="byte-stream-split-byte_stream_split--9">Byte Stream Split: (BYTE_STREAM_SPLIT = 9)&lt;/h3>
 &lt;p>Supported Types: FLOAT DOUBLE&lt;/p>
 &lt;p>This encoding does not reduce the size of the data but can lead to a significantly better
 compression ratio and speed when a compression algorithm is used afterwards.&lt;/p>
 &lt;p>This encoding creates K byte-streams of length N where K is the size in bytes of the data
 type and N is the number of elements in the data sequence.
 The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the
 0-th stream, the 1-st byte goes to the 1-st stream and so on.
 The streams are concatenated in the following order: 0-th stream, 1-st stream, etc.&lt;/p>
 &lt;p>Example:
 Original data is three 32-bit floats and for simplicity we look at their raw representation.&lt;/p>
 &lt;pre tabindex="0">&lt;code> Element 0 Element 1 Element 2
 Bytes AA BB CC DD 00 11 22 33 A3 B4 C5 D6
 &lt;/code>&lt;/pre>&lt;p>After applying the transformation, the data has the following representation:&lt;/p>
 &lt;pre tabindex="0">&lt;code>Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 &lt;/code>&lt;/pre></description></item><item><title>Docs: License</title><link>/docs/asf/license/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/asf/license/</guid><description>
 &lt;p>&lt;a href="https://www.apache.org/licenses/">License&lt;/a>&lt;/p></description></item><item><title>Docs: Modules</title><link>/docs/contribution-guidelines/modules/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/contribution-guidelines/modules/</guid><description>
 &lt;p>The &lt;a href="https://github.com/apache/parquet-format">parquet-format&lt;/a> project contains format specifications and Thrift definitions of metadata required to properly read Parquet files.&lt;/p>
 &lt;p>The &lt;a href="https://github.com/apache/parquet-mr">parquet-mr&lt;/a> project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop Input/Output Formats, Pig loaders, and other Java-based utilities for interacting with Parquet.&lt;/p>
 &lt;p>The &lt;a href="https://github.com/apache/parquet-cpp">parquet-cpp&lt;/a> project is a C++ library to read-write Parquet files.&lt;/p>
 &lt;p>The &lt;a href="https://github.com/apache/arrow-rs/tree/master/parquet">parquet-rs&lt;/a> project is a Rust library to read-write Parquet files.&lt;/p>
 &lt;p>The &lt;a href="https://github.com/Parquet/parquet-compatibility">parquet-compatibility&lt;/a> project (deprecated) contains compatibility tests that can be used to verify that implementations in different languages can read and write each other’s files. As of January 2022 compatibility tests only exist up to version 1.2.0.&lt;/p></description></item><item><title>Docs: Spark Summit 2020</title><link>/docs/learning-resources/presentations/spark-summit-2020/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/learning-resources/presentations/spark-summit-2020/</guid><description>
 &lt;p>&lt;a href="https://www.slideshare.net/databricks/the-apache-spark-file-format-ecosystem">Slides&lt;/a>&lt;/p>
 &lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
 &lt;iframe src="https://www.youtube.com/embed/auNAzC3AU18" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">&lt;/iframe>
 &lt;/div></description></item><item><title>Docs: Building Parquet</title><link>/docs/contribution-guidelines/building/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/contribution-guidelines/building/</guid><description>
 &lt;p>Building
 Java resources can be build using &lt;code>mvn package&lt;/code>. The current stable version should always be available from Maven Central.&lt;/p>
 &lt;p>C++ thrift resources can be generated via make.&lt;/p>
 &lt;p>Thrift can be also code-genned into any other thrift-supported language.&lt;/p></description></item><item><title>Docs: Hadoop Summit 2014</title><link>/docs/learning-resources/presentations/hadoop-summit-2014/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/learning-resources/presentations/hadoop-summit-2014/</guid><description>
 &lt;p>&lt;a href="https://www.slideshare.net/cloudera/hadoop-summit-36479635">Slides&lt;/a>&lt;/p>
 &lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
 &lt;iframe src="https://www.youtube.com/embed/MZNjmfx4LMc" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">&lt;/iframe>
 &lt;/div></description></item><item><title>Docs: Motivation</title><link>/docs/overview/motivation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/overview/motivation/</guid><description>
 &lt;p>We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.&lt;/p>
 &lt;p>Parquet is built from the ground up with complex nested data structures in mind, and uses the &lt;a href="https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper">record shredding and assembly algorithm&lt;/a> described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.&lt;/p>
 &lt;p>Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.&lt;/p>
 &lt;p>Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.&lt;/p></description></item><item><title>Docs: Security</title><link>/docs/asf/security/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/asf/security/</guid><description>
 &lt;p>&lt;a href="https://www.apache.org/security/">Security&lt;/a>&lt;/p></description></item><item><title>Docs: #CONF 2014</title><link>/docs/learning-resources/presentations/conf-2014-parquet-summit-twitter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/learning-resources/presentations/conf-2014-parquet-summit-twitter/</guid><description>
 &lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
 &lt;iframe src="https://www.youtube.com/embed/Qfp6Uv1UrA0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">&lt;/iframe>
 &lt;/div></description></item><item><title>Docs: Contributing to Parquet</title><link>/docs/contribution-guidelines/contributing/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/contribution-guidelines/contributing/</guid><description>
 &lt;h2 id="pull-requests">Pull Requests&lt;/h2>
 &lt;p>We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the &lt;a href="https://github.com/apache/parquet-mr">github.com/apache/parquet-mr&lt;/a> repository. If you’ve previously forked Parquet from its old location, you will need to add a remote or update your origin remote to &lt;a href="https://github.com/apache/parquet-mr.git">https://github.com/apache/parquet-mr.git&lt;/a> Here are a few tips to get your contribution in:&lt;/p>
 &lt;ol>
 &lt;li>Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.&lt;/li>
 &lt;li>Create a JIRA for your patch on the &lt;a href="https://issues.apache.org/jira/browse/PARQUET">Parquet Project JIRA&lt;/a>.&lt;/li>
 &lt;li>Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the JIRA name (ex: &lt;a href="https://github.com/apache/parquet-mr/pull/5">https://github.com/apache/parquet-mr/pull/5&lt;/a>).&lt;/li>
 &lt;li>Make sure that your code passes the unit tests. You can run the tests with &lt;code>mvn test&lt;/code> in the root directory.&lt;/li>
 &lt;li>Add new unit tests for your code.&lt;/li>
 &lt;li>All Pull Requests are tested automatically on &lt;a href="https://github.com/apache/parquet-mr/actions">GitHub Actions&lt;/a>. &lt;a href="https://travis-ci.org/github/apache/parquet-mr">TravisCI&lt;/a> is also used to run the tests on ARM64 CPU architecture&lt;/li>
 &lt;/ol>
 &lt;p>If you’d like to report a bug but don’t have time to fix it, you can still post it to our &lt;a href="https://issues.apache.org/jira/browse/PARQUET">issue tracker&lt;/a>, or email the mailing list (&lt;a href="mailto:dev@parquet.apache.org">dev@parquet.apache.org&lt;/a>).&lt;/p>
 &lt;h2 id="committers">Committers&lt;/h2>
 &lt;p>Merging a pull request requires being a comitter on the project.&lt;/p>
 &lt;p>How to merge a Pull request (have an apache and github-apache remote setup):&lt;/p>
 &lt;pre>&lt;code>git remote add github-apache git@github.com:apache/parquet-mr.git
 git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
 &lt;/code>&lt;/pre>
 &lt;p>run the following command&lt;/p>
 &lt;pre>&lt;code>dev/merge_parquet_pr.py
 &lt;/code>&lt;/pre>
 &lt;p>example output:&lt;/p>
 &lt;pre>&lt;code>Which pull request would you like to merge? (e.g. 34):
 &lt;/code>&lt;/pre>
 &lt;p>Type the pull request number (from &lt;a href="https://github.com/apache/parquet-mr/pulls">https://github.com/apache/parquet-mr/pulls&lt;/a>) and hit enter.&lt;/p>
 &lt;pre>&lt;code>=== Pull Request #X ===
 title Blah Blah Blah
 source repo/branch
 target master
 url https://api.github.com/repos/apache/parquet-mr/pulls/X
 Proceed with merging pull request #3? (y/n):
 &lt;/code>&lt;/pre>
 &lt;p>If this looks good, type &lt;code>y&lt;/code> and hit enter.&lt;/p>
 &lt;pre>&lt;code>From gitbox.apache.org:/repos/asf/parquet-mr.git
 * [new branch] master -&amp;gt; PR_TOOL_MERGE_PR_3_MASTER
 Switched to branch 'PR_TOOL_MERGE_PR_3_MASTER'
 Merge complete (local ref PR_TOOL_MERGE_PR_3_MASTER). Push to apache? (y/n):
 &lt;/code>&lt;/pre>
 &lt;p>A local branch with the merge has been created. Type &lt;code>y&lt;/code> and hit enter to push it to apache master&lt;/p>
 &lt;pre>&lt;code>Counting objects: 67, done.
 Delta compression using up to 4 threads.
 Compressing objects: 100% (26/26), done.
 Writing objects: 100% (36/36), 5.32 KiB, done.
 Total 36 (delta 17), reused 0 (delta 0)
 To gitbox.apache.org:/repos/asf/parquet-mr.git
 b767ac4..485658a PR_TOOL_MERGE_PR_X_MASTER -&amp;gt; master
 Restoring head pointer to b767ac4e
 Note: checking out 'b767ac4e'.
 You are in 'detached HEAD' state. You can look around, make experimental
 changes and commit them, and you can discard any commits you make in this
 state without impacting any branches by performing another checkout.
 If you want to create a new branch to retain commits you create, you may
 do so (now or later) by using -b with the checkout command again. Example:
 git checkout -b new_branch_name
 HEAD is now at b767ac4... Update README.md
 Deleting local branch PR_TOOL_MERGE_PR_X
 Deleting local branch PR_TOOL_MERGE_PR_X_MASTER
 Pull request #X merged!
 Merge hash: 485658a5
 Would you like to pick 485658a5 into another branch? (y/n):
 &lt;/code>&lt;/pre>
 &lt;p>For now just say &lt;code>n&lt;/code> as we have 1 branch&lt;/p>
 &lt;h2 id="website">Website&lt;/h2>
 &lt;h3 id="release-documentation">Release Documentation&lt;/h3>
 &lt;p>To create documentation for a new release of &lt;code>parquet-format&lt;/code> create a new &lt;releaseNumber>.md file under &lt;code>content/en/blog/parquet-format&lt;/code>. Please see existing files in that directory as an example.&lt;/p>
 &lt;p>To create documentation for a new release of &lt;code>parquet-mr&lt;/code> create a new &lt;releaseNumber>.md file under &lt;code>content/en/blog/parquet-mr&lt;/code>. Please see existing files in that directory as an example.&lt;/p>
 &lt;h3 id="website-development-and-deployment">Website development and deployment&lt;/h3>
 &lt;h4 id="staging">Staging&lt;/h4>
 &lt;p>To make a change to the &lt;code>staging&lt;/code> version of the website:&lt;/p>
 &lt;ol>
 &lt;li>Make a PR against the &lt;code>staging&lt;/code> branch in the repository&lt;/li>
 &lt;li>Once the PR is merged, the &lt;code>Build and Deploy Parquet Site&lt;/code>
 job in the &lt;a href="https://github.com/apache/parquet-site/blob/staging/.github/workflows/deploy.yml">deployment workflow&lt;/a> will be run, populating the &lt;code>asf-staging&lt;/code> branch on this repo with the necessary files.&lt;/li>
 &lt;/ol>
 &lt;p>&lt;strong>Do not directly edit the &lt;code>asf-staging&lt;/code> branch of this repo&lt;/strong>&lt;/p>
 &lt;h4 id="production">Production&lt;/h4>
 &lt;p>To make a change to the &lt;code>production&lt;/code> version of the website:&lt;/p>
 &lt;ol>
 &lt;li>Make a PR against the &lt;code>production&lt;/code> branch in the repository&lt;/li>
 &lt;li>Once the PR is merged, the &lt;code>Build and Deploy Parquet Site&lt;/code>
 job in the &lt;a href="https://github.com/apache/parquet-site/blob/production/.github/workflows/deploy.yml">deployment workflow&lt;/a> will be run, populating the &lt;code>asf-site&lt;/code> branch on this repo with the necessary files.&lt;/li>
 &lt;/ol>
 &lt;p>&lt;strong>Do not directly edit the &lt;code>asf-site&lt;/code> branch of this repo&lt;/strong>&lt;/p></description></item><item><title>Docs: Sponsor</title><link>/docs/asf/sponsor/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/asf/sponsor/</guid><description>
 &lt;p>&lt;a href="https://www.apache.org/foundation/thanks.html">Sponsor&lt;/a>&lt;/p></description></item><item><title>Docs: Donate</title><link>/docs/asf/donate/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/asf/donate/</guid><description>
 &lt;p>&lt;a href="https://www.apache.org/foundation/sponsorship.html">Donate&lt;/a>&lt;/p></description></item><item><title>Docs: Releasing Parquet</title><link>/docs/contribution-guidelines/releasing/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/contribution-guidelines/releasing/</guid><description>
 &lt;h3 id="setup">Setup&lt;/h3>
 &lt;p>You will need: * PGP code signing keys, published in &lt;a href="https://downloads.apache.org/parquet/KEYS">KEYS&lt;/a> * Permission to stage artifacts in Nexus&lt;/p>
 &lt;p>Make sure you have permission to deploy Parquet artifacts to Nexus by pushing a snapshot:&lt;/p>
 &lt;pre>&lt;code>mvn deploy
 &lt;/code>&lt;/pre>
 &lt;p>If you have problems, read the &lt;a href="https://www.apache.org/dev/publishing-maven-artifacts.html">publishing Maven artifacts documentation&lt;/a>&lt;/p>
 &lt;h3 id="release-process">Release process&lt;/h3>
 &lt;p>Parquet uses the maven-release-plugin to tag a release and push binary artifacts to staging in Nexus. Once maven completes the release, the offical source tarball is built from the tag.&lt;/p>
 &lt;p>Before you start the release process:&lt;/p>
 &lt;ol>
 &lt;li>Verify that the release is finished (no planned JIRAs are pending)&lt;/li>
 &lt;li>Build and test the project&lt;/li>
 &lt;li>Update the change log
 &lt;ul>
 &lt;li>Go to the release notes for the release in JIRA&lt;/li>
 &lt;li>Copy the HTML and convert it to markdown with an &lt;a href="https://domchristie.github.io/turndown/">online converter&lt;/a>&lt;/li>
 &lt;li>Add the content to CHANGES.md and update formatting&lt;/li>
 &lt;li>Commit the update to CHANGES.md&lt;/li>
 &lt;/ul>
 &lt;/li>
 &lt;/ol>
 &lt;h4 id="1-run-the-prepare-script">1. Run the prepare script&lt;/h4>
 &lt;pre>&lt;code>dev/prepare-release.sh &amp;lt;version&amp;gt; &amp;lt;rc-number&amp;gt;
 &lt;/code>&lt;/pre>
 &lt;p>This runs maven’s release prepare with a consistent tag name. After this step, the release tag will exist in the git repository.&lt;/p>
 &lt;p>If this step fails, you can roll back the changes by running these commands.&lt;/p>
 &lt;pre>&lt;code>find ./ -type f -name '*.releaseBackup' -exec rm {} \;
 find ./ -type f -name 'pom.xml' -exec git checkout {} \;
 &lt;/code>&lt;/pre>
 &lt;h4 id="2-run-releaseperform-to-stage-binaries">2. Run release:perform to stage binaries&lt;/h4>
 &lt;pre>&lt;code>mvn release:perform
 &lt;/code>&lt;/pre>
 &lt;p>This uploads binary artifacts for the release tag to &lt;a href="https://repository.apache.org/">Nexus&lt;/a>.&lt;/p>
 &lt;h4 id="3-in-nexus-close-the-staging-repository">3. In Nexus, close the staging repository&lt;/h4>
 &lt;p>Closing a staging repository makes the binaries available in &lt;a href="https://repository.apache.org/content/groups/staging/org/apache/parquet/">staging&lt;/a>, but does not publish them.&lt;/p>
 &lt;ol>
 &lt;li>Go to &lt;a href="https://repository.apache.org/">Nexus&lt;/a>.&lt;/li>
 &lt;li>In the menu on the left, choose “Staging Repositories”.&lt;/li>
 &lt;li>Select the Parquet repository.&lt;/li>
 &lt;li>At the top, click “Close” and follow the instructions. For the comment use “Apache Parquet [Format] ”.&lt;/li>
 &lt;/ol>
 &lt;h4 id="4-run-the-source-tarball-script">4. Run the source tarball script&lt;/h4>
 &lt;pre>&lt;code>dev/source-release.sh &amp;lt;version&amp;gt; &amp;lt;rc-number&amp;gt;
 &lt;/code>&lt;/pre>
 &lt;p>This script builds the source tarball from the release tag’s SHA1, signs it, and uploads the necessary files with SVN.&lt;/p>
 &lt;p>The source release is pushed to &lt;a href="https://dist.apache.org/repos/dist/dev/parquet/">https://dist.apache.org/repos/dist/dev/parquet/&lt;/a>&lt;/p>
 &lt;p>The last message from the script is the release commit’s SHA1 hash and URL for the VOTE e-mail.&lt;/p>
 &lt;h4 id="5-send-a-vote-e-mail-to-devparquetapacheorgmailtodevparquetapacheorg">5. Send a VOTE e-mail to &lt;a href="mailto:dev@parquet.apache.org">dev@parquet.apache.org&lt;/a>&lt;/h4>
 &lt;p>Here is a template you can use. Make sure everything applies to your release.&lt;/p>
 &lt;pre>&lt;code>Subject: [VOTE] Release Apache Parquet &amp;lt;VERSION&amp;gt; RC&amp;lt;NUM&amp;gt;
 Hi everyone,
 I propose the following RC to be released as official Apache Parquet &amp;lt;VERSION&amp;gt; release.
 The commit id is &amp;lt;SHA1&amp;gt;
 * This corresponds to the tag: apache-parquet-&amp;lt;VERSION&amp;gt;-rc&amp;lt;NUM&amp;gt;
 * https://github.com/apache/parquet-mr/tree/&amp;lt;SHA1&amp;gt;
 The release tarball, signature, and checksums are here:
 * https://dist.apache.org/repos/dist/dev/parquet/&amp;lt;PATH&amp;gt;
 You can find the KEYS file here:
 * https://downloads.apache.org/parquet/KEYS
 Binary artifacts are staged in Nexus here:
 * https://repository.apache.org/content/groups/staging/org/apache/parquet/
 This release includes important changes that I should have summarized here, but I'm lazy.
 Please download, verify, and test.
 Please vote in the next 72 hours.
 [ ] +1 Release this as Apache Parquet &amp;lt;VERSION&amp;gt;
 [ ] +0
 [ ] -1 Do not release this because...
 &lt;/code>&lt;/pre>
 &lt;h3 id="publishing-after-the-vote-passes">Publishing after the vote passes&lt;/h3>
 &lt;p>After a release candidate passes a vote, the candidate needs to be published as the final release.&lt;/p>
 &lt;h4 id="1-tag-final-release-and-set-development-version">1. Tag final release and set development version&lt;/h4>
 &lt;pre>&lt;code>dev/finalize-release &amp;lt;release-version&amp;gt; &amp;lt;rc-num&amp;gt; &amp;lt;new-development-version-without-SNAPSHOT-suffix&amp;gt;
 &lt;/code>&lt;/pre>
 &lt;p>This will add the final release tag to the RC tag and sets the new development version in the pom files. If everything is fine push the changes and the new tag to github: &lt;code>git push --follow-tags&lt;/code>&lt;/p>
 &lt;h4 id="2-release-the-binary-repository-in-nexus">2. Release the binary repository in Nexus&lt;/h4>
 &lt;h4 id="3-copy-the-release-artifacts-in-svn-into-releases">3. Copy the release artifacts in SVN into releases&lt;/h4>
 &lt;p>First, check out the candidates and releases locations in SVN:&lt;/p>
 &lt;pre>&lt;code>mkdir parquet
 cd parquet
 svn co https://dist.apache.org/repos/dist/dev/parquet candidates
 svn co https://dist.apache.org/repos/dist/release/parquet releases
 &lt;/code>&lt;/pre>
 &lt;p>Next, copy the directory for the release candidate the passed from candidates to releases and rename it; remove the “-rcN” part of the directory name.&lt;/p>
 &lt;pre>&lt;code>cp -r candidates/apache-parquet-&amp;lt;VERSION&amp;gt;-rcN/ releases/apache-parquet-&amp;lt;VERSION&amp;gt;
 &lt;/code>&lt;/pre>
 &lt;p>Then add and commit the release artifacts:&lt;/p>
 &lt;pre>&lt;code>cd releases
 svn add apache-parquet-&amp;lt;version&amp;gt;
 svn ci -m &amp;quot;Parquet: Add release &amp;lt;VERSION&amp;gt;&amp;quot;
 &lt;/code>&lt;/pre>
 &lt;h4 id="4-update-parquetapacheorg">4. Update parquet.apache.org&lt;/h4>
 &lt;p>Update the downloads page on parquet.apache.org. Instructions for updating the site are on the &lt;a href="http://parquet.apache.org/docs/contribution-guidelines/contributing/">contribution page&lt;/a>.&lt;/p>
 &lt;h4 id="5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list">5. Send an ANNOUNCE e-mail to &lt;a href="mailto:announce@apache.org">announce@apache.org&lt;/a> and the dev list&lt;/h4>
 &lt;pre>&lt;code>[ANNOUNCE] Apache Parquet release &amp;lt;VERSION&amp;gt;
 I'm please to announce the release of Parquet &amp;lt;VERSION&amp;gt;!
 Parquet is a general-purpose columnar file format for nested data. It uses
 space-efficient encodings and a compressed and splittable structure for
 processing frameworks like Hadoop.
 Changes are listed at: https://github.com/apache/parquet-mr/blob/apache-parquet-&amp;lt;VERSION&amp;gt;/CHANGES.md
 This release can be downloaded from: https://parquet.apache.org/downloads/
 Java artifacts are available from Maven Central.
 Thanks to everyone for contributing!
 &lt;/code>&lt;/pre></description></item><item><title>Docs: Strata 2013</title><link>/docs/learning-resources/presentations/strata-2013/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/learning-resources/presentations/strata-2013/</guid><description>
 &lt;p>&lt;a href="https://www.slideshare.net/julienledem/parquet-stratany-hadoopworld2013">Slides&lt;/a>&lt;/p></description></item><item><title>Docs: Configurations</title><link>/docs/file-format/configurations/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/configurations/</guid><description>
 &lt;h3 id="row-group-size">Row Group Size&lt;/h3>
 &lt;p>Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.&lt;/p>
 &lt;h3 id="data-page--size">Data Page Size&lt;/h3>
 &lt;p>Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes.&lt;/p></description></item><item><title>Docs: Events</title><link>/docs/asf/events/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/asf/events/</guid><description>
 &lt;p>&lt;a href="https://apachecon.com/?ref=parquet.apache.org">Events&lt;/a>&lt;/p></description></item><item><title>Docs: Extensibility</title><link>/docs/file-format/extensibility/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/extensibility/</guid><description>
 &lt;p>There are many places in the format for compatible extensions:&lt;/p>
 &lt;p>File Version: The file metadata contains a version.
 Encodings: Encodings are specified by enum and more can be added in the future.
 Page types: Additional page types can be added and safely skipped.&lt;/p></description></item><item><title>Docs: Logical Types</title><link>/docs/file-format/types/logicaltypes/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/types/logicaltypes/</guid><description>
 &lt;p>Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet’s efficient encodings. For example, strings are stored as byte arrays (binary) with a UTF8 annotation. These annotations define how to further decode and interpret the data. Annotations are stored as a ConvertedType in the file metadata and are documented in LogicalTypes.md.&lt;/p></description></item><item><title>Docs: Metadata</title><link>/docs/file-format/metadata/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/metadata/</guid><description>
 &lt;p>There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.&lt;/p>
 &lt;p>&lt;img src="/images/FileFormat.gif" alt="File Layout">&lt;/p></description></item><item><title>Docs: Nested Encoding</title><link>/docs/file-format/nestedencoding/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/nestedencoding/</guid><description>
 &lt;p>To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all values in the column).&lt;/p>
 &lt;p>Two encodings for the levels are supported BITPACKED and RLE. Only RLE is now used as it supersedes BITPACKED.&lt;/p></description></item><item><title>Docs: Checksumming</title><link>/docs/file-format/data-pages/checksumming/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/data-pages/checksumming/</guid><description>
 &lt;p>Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.&lt;/p></description></item><item><title>Docs: Column Chunks</title><link>/docs/file-format/data-pages/columnchunks/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/data-pages/columnchunks/</guid><description>
 &lt;p>Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.&lt;/p></description></item><item><title>Docs: Error Recovery</title><link>/docs/file-format/data-pages/errorrecovery/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/data-pages/errorrecovery/</guid><description>
 &lt;p>If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.&lt;/p>
 &lt;p>Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for orc or avro files using sync markers, a reader could recover partially written files.&lt;/p></description></item><item><title>Docs: Nulls</title><link>/docs/file-format/nulls/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/file-format/nulls/</guid><description>
 &lt;p>Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.&lt;/p></description></item></channel></rss>