Merge pull request #7 from kevinburkesegment/update-links source/documentation/latest: update links

commit: 53010cac4c5122bca74dfc6fcf3281f54181a612 [log] [tgz]
author: Xinli Shang <shangx@uber.com> Fri Jan 07 18:18:14 2022 -0800
committer: GitHub <noreply@github.com> Fri Jan 07 18:18:14 2022 -0800
tree: 07db93a19a1515439374c5769dcb88cae5b3deac
parent: b257a016b2d2f8e707ab4b5a2a58380cf90efa59 [diff]
parent: 156f6c55cf1fd83a9f30f332b8fbec6e79813b9b [diff]
diff --git a/Gemfile.lock b/Gemfile.lock
index 0504300..1553ade 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock

@@ -30,7 +30,7 @@
     execjs (2.7.0)
     fast_blank (1.0.0)
     fastimage (2.1.7)
-    ffi (1.11.1)
+    ffi (1.15.4)
     haml (5.1.2)
       temple (>= 0.8.0)
       tilt
@@ -136,4 +136,4 @@
   redcarpet!
 
 BUNDLED WITH
-   2.0.2
+   2.3.4

diff --git a/output/documentation/latest/index.html b/output/documentation/latest/index.html
index 84181bd..c496e22 100644
--- a/output/documentation/latest/index.html
+++ b/output/documentation/latest/index.html

@@ -146,9 +146,9 @@
 
 <p>The <a href="https://github.com/apache/parquet-cpp">parquet-cpp</a> project is a C++ library to read-write Parquet files.</p>
 
-<p>The <a href="https://github.com/sunchao/parquet-rs">parquet-rs</a> project is a Rust library to read-write Parquet files.</p>
+<p>The <a href="https://github.com/apache/arrow-rs/tree/master/parquet">parquet-rs</a> project is a Rust library to read-write Parquet files.</p>
 
-<p>The <a href="https://github.com/Parquet/parquet-compatibility">parquet-compatibility</a> project contains compatibility tests that can be used to verify that implementations in different languages can read and write each other&rsquo;s files.</p>
+<p>The <a href="https://github.com/Parquet/parquet-compatibility">parquet-compatibility</a> project (deprecated) contains compatibility tests that can be used to verify that implementations in different languages can read and write each other&rsquo;s files. As of January 2022 compatibility tests only exist up to version 1.2.0.</p>
 
 <h2 id="building">Building</h2>
 
@@ -165,8 +165,8 @@
 <h2 id="glossary">Glossary</h2>
 
 <ul>
-<li><p>Block (hdfs block): This means a block in hdfs and the meaning is 
-unchanged for describing this file format.  The file format is 
+<li><p>Block (hdfs block): This means a block in hdfs and the meaning is
+unchanged for describing this file format.  The file format is
 designed to work well on top of hdfs.</p></li>
 <li><p>File: A hdfs file that must include the metadata for the file.
 It does not need to actually contain the data.</p></li>
@@ -182,7 +182,7 @@
 
 <p>Hierarchically, a file consists of one or more row groups.  A row group
 contains exactly one column chunk per column.  Column chunks contain one or
-more pages. </p>
+more pages.</p>
 
 <h2 id="unit-of-parallelization">Unit of parallelization</h2>
 
@@ -213,14 +213,14 @@
 4-byte length in bytes of file metadata
 4-byte magic number "PAR1"
 </code></pre></div>
-<p>In the above example, there are N columns in this table, split into M row 
-groups.  The file metadata contains the locations of all the column metadata 
-start locations.  More details on what is contained in the metadata can be found 
+<p>In the above example, there are N columns in this table, split into M row
+groups.  The file metadata contains the locations of all the column metadata
+start locations.  More details on what is contained in the metadata can be found
 in the thrift files.</p>
 
 <p>Metadata is written after the data to allow for single pass writing.</p>
 
-<p>Readers are expected to first read the file metadata to find all the column 
+<p>Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read sequentially.</p>
 
 <p><img src="https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif" alt="File Layout" /></p>
@@ -263,31 +263,31 @@
 
 <h2 id="nested-encoding">Nested Encoding</h2>
 
-<p>To encode nested columns, Parquet uses the Dremel encoding with definition and 
-repetition levels.  Definition levels specify how many optional fields in the 
+<p>To encode nested columns, Parquet uses the Dremel encoding with definition and
+repetition levels.  Definition levels specify how many optional fields in the
 path for the column are defined.  Repetition levels specify at what repeated field
 in the path has the value repeated.  The max definition and repetition levels can
 be computed from the schema (i.e. how much nesting there is).  This defines the
 maximum number of bits required to store the levels (levels are defined for all
-values in the column).  </p>
+values in the column).</p>
 
 <p>Two encodings for the levels are supported BIT<em>PACKED and RLE. Only RLE is now used as it supersedes BIT</em>PACKED.</p>
 
 <h2 id="nulls">Nulls</h2>
 
-<p>Nullity is encoded in the definition levels (which is run-length encoded).  NULL values 
-are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs 
+<p>Nullity is encoded in the definition levels (which is run-length encoded).  NULL values
+are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs
 would be encoded with run-length encoding (0, 1000 times) for the definition levels and
-nothing else.  </p>
+nothing else.</p>
 
 <h2 id="data-pages">Data Pages</h2>
 
 <p>For data pages, the 3 pieces of information are encoded back to back, after the page
-header. We have the </p>
+header. We have the</p>
 
 <ul>
-<li>definition levels data,<br></li>
-<li>repetition levels data, </li>
+<li>definition levels data,</li>
+<li>repetition levels data,</li>
 <li>encoded values.
 The size of specified in the header is for all 3 pieces combined.</li>
 </ul>
@@ -296,7 +296,7 @@
 are optional, based on the schema definition.  If the column is not nested (i.e.
 the path to the column has length 1), we do not encode the repetition levels (it would
 always have the value 1).  For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition level). </p>
+skipped (if encoded, it will always have the value of the max definition level).</p>
 
 <p>For example, in the case where the column is non-nested and required, the data in the
 page is only the encoded values.</p>
@@ -305,52 +305,52 @@
 
 <h2 id="column-chunks">Column chunks</h2>
 
-<p>Column chunks are composed of pages written back to back.  The pages share a common 
-header and readers can skip over page they are not interested in.  The data for the 
-page follows the header and can be compressed and/or encoded.  The compression and 
+<p>Column chunks are composed of pages written back to back.  The pages share a common
+header and readers can skip over page they are not interested in.  The data for the
+page follows the header and can be compressed and/or encoded.  The compression and
 encoding is specified in the page metadata.</p>
 
 <h2 id="checksumming">Checksumming</h2>
 
-<p>Data pages can be individually checksummed.  This allows disabling of checksums at the 
+<p>Data pages can be individually checksummed.  This allows disabling of checksums at the
 HDFS file level, to better support single row lookups.</p>
 
 <h2 id="error-recovery">Error recovery</h2>
 
-<p>If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt, 
-that column chunk is lost (but column chunks for this column in other row groups are 
-okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If 
-the data within a page is corrupt, that page is lost.  The file will be more 
+<p>If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt,
+that column chunk is lost (but column chunks for this column in other row groups are
+okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If
+the data within a page is corrupt, that page is lost.  The file will be more
 resilient to corruption with smaller row groups.</p>
 
-<p>Potential extension: With smaller row groups, the biggest issue is placing the file 
-metadata at the end.  If an error happens while writing the file metadata, all the 
-data written will be unreadable.  This can be fixed by writing the file metadata 
-every Nth row group.<br>
-Each file metadata would be cumulative and include all the row groups written so 
-far.  Combining this with the strategy used for rc or avro files using sync markers, 
-a reader could recover partially written files.  </p>
+<p>Potential extension: With smaller row groups, the biggest issue is placing the file
+metadata at the end.  If an error happens while writing the file metadata, all the
+data written will be unreadable.  This can be fixed by writing the file metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written so
+far.  Combining this with the strategy used for rc or avro files using sync markers,
+a reader could recover partially written files.</p>
 
 <h2 id="separating-metadata-and-column-data">Separating metadata and column data.</h2>
 
 <p>The format is explicitly designed to separate the metadata from the data. This
 allows splitting columns into multiple files, as well as having a single metadata
-file reference multiple parquet files.  </p>
+file reference multiple parquet files.</p>
 
 <h2 id="configurations">Configurations</h2>
 
 <ul>
-<li>Row group size: Larger row groups allow for larger column chunks which makes it 
-possible to do larger sequential IO.  Larger groups also require more buffering in 
-the write path (or a two pass write).  We recommend large row groups (512MB - 1GB). 
-Since an entire row group might need to be read, we want it to completely fit on 
-one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An 
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block 
+<li>Row group size: Larger row groups allow for larger column chunks which makes it
+possible to do larger sequential IO.  Larger groups also require more buffering in
+the write path (or a two pass write).  We recommend large row groups (512MB - 1GB).
+Since an entire row group might need to be read, we want it to completely fit on
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block
 per HDFS file.</li>
-<li>Data page size: Data pages should be considered indivisible so smaller data pages 
-allow for more fine grained reading (e.g. single row lookup).  Larger page sizes 
-incur less space overhead (less page headers) and potentially less parsing overhead 
-(processing headers).  Note: for sequential scans, it is not expected to read a page 
+<li>Data page size: Data pages should be considered indivisible so smaller data pages
+allow for more fine grained reading (e.g. single row lookup).  Larger page sizes
+incur less space overhead (less page headers) and potentially less parsing overhead
+(processing headers).  Note: for sequential scans, it is not expected to read a page
 at a time; this is not the IO chunk.  We recommend 8KB for page sizes.</li>
 </ul>
 
@@ -360,7 +360,7 @@
 
 <ul>
 <li>File Version: The file metadata contains a version.</li>
-<li>Encodings: Encodings are specified by enum and more can be added in the future.<br></li>
+<li>Encodings: Encodings are specified by enum and more can be added in the future.</li>
 <li>Page types: Additional page types can be added and safely skipped.</li>
 </ul>
 

diff --git a/source/documentation/latest.html.md b/source/documentation/latest.html.md
index ad3a6b6..b9df579 100644
--- a/source/documentation/latest.html.md
+++ b/source/documentation/latest.html.md

@@ -16,9 +16,9 @@
 
 The [parquet-cpp](https://github.com/apache/parquet-cpp) project is a C++ library to read-write Parquet files.
 
-The [parquet-rs](https://github.com/sunchao/parquet-rs) project is a Rust library to read-write Parquet files.
+The [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet) project is a Rust library to read-write Parquet files.
 
-The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility) project contains compatibility tests that can be used to verify that implementations in different languages can read and write each other's files.
+The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility) project (deprecated) contains compatibility tests that can be used to verify that implementations in different languages can read and write each other's files. As of January 2022 compatibility tests only exist up to version 1.2.0.
 
 ## Building
 
@@ -35,8 +35,8 @@
 [how-to-release]: ../how-to-release/
 
 ## Glossary
-  - Block (hdfs block): This means a block in hdfs and the meaning is 
-    unchanged for describing this file format.  The file format is 
+  - Block (hdfs block): This means a block in hdfs and the meaning is
+    unchanged for describing this file format.  The file format is
     designed to work well on top of hdfs.
 
   - File: A hdfs file that must include the metadata for the file.
@@ -55,7 +55,7 @@
 
 Hierarchically, a file consists of one or more row groups.  A row group
 contains exactly one column chunk per column.  Column chunks contain one or
-more pages. 
+more pages.
 
 ## Unit of parallelization
   - MapReduce - File/Row Group
@@ -83,14 +83,14 @@
     4-byte length in bytes of file metadata
     4-byte magic number "PAR1"
 
-In the above example, there are N columns in this table, split into M row 
-groups.  The file metadata contains the locations of all the column metadata 
-start locations.  More details on what is contained in the metadata can be found 
+In the above example, there are N columns in this table, split into M row
+groups.  The file metadata contains the locations of all the column metadata
+start locations.  More details on what is contained in the metadata can be found
 in the thrift files.
 
 Metadata is written after the data to allow for single pass writing.
 
-Readers are expected to first read the file metadata to find all the column 
+Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read sequentially.
 
  ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
@@ -129,28 +129,28 @@
 [logical-types]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
 
 ## Nested Encoding
-To encode nested columns, Parquet uses the Dremel encoding with definition and 
-repetition levels.  Definition levels specify how many optional fields in the 
+To encode nested columns, Parquet uses the Dremel encoding with definition and
+repetition levels.  Definition levels specify how many optional fields in the
 path for the column are defined.  Repetition levels specify at what repeated field
 in the path has the value repeated.  The max definition and repetition levels can
 be computed from the schema (i.e. how much nesting there is).  This defines the
 maximum number of bits required to store the levels (levels are defined for all
-values in the column).  
+values in the column).
 
 Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED.
 
 ## Nulls
-Nullity is encoded in the definition levels (which is run-length encoded).  NULL values 
-are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs 
+Nullity is encoded in the definition levels (which is run-length encoded).  NULL values
+are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs
 would be encoded with run-length encoding (0, 1000 times) for the definition levels and
-nothing else.  
+nothing else.
 
 ## Data Pages
 For data pages, the 3 pieces of information are encoded back to back, after the page
-header. We have the 
+header. We have the
 
- - definition levels data,  
- - repetition levels data, 
+ - definition levels data,
+ - repetition levels data,
  - encoded values.
 The size of specified in the header is for all 3 pieces combined.
 
@@ -158,7 +158,7 @@
 are optional, based on the schema definition.  If the column is not nested (i.e.
 the path to the column has length 1), we do not encode the repetition levels (it would
 always have the value 1).  For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition level). 
+skipped (if encoded, it will always have the value of the max definition level).
 
 For example, in the case where the column is non-nested and required, the data in the
 page is only the encoded values.
@@ -166,52 +166,52 @@
 The supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
 
 ## Column chunks
-Column chunks are composed of pages written back to back.  The pages share a common 
-header and readers can skip over page they are not interested in.  The data for the 
-page follows the header and can be compressed and/or encoded.  The compression and 
+Column chunks are composed of pages written back to back.  The pages share a common
+header and readers can skip over page they are not interested in.  The data for the
+page follows the header and can be compressed and/or encoded.  The compression and
 encoding is specified in the page metadata.
 
 ## Checksumming
-Data pages can be individually checksummed.  This allows disabling of checksums at the 
+Data pages can be individually checksummed.  This allows disabling of checksums at the
 HDFS file level, to better support single row lookups.
 
 ## Error recovery
-If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt, 
-that column chunk is lost (but column chunks for this column in other row groups are 
-okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If 
-the data within a page is corrupt, that page is lost.  The file will be more 
+If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt,
+that column chunk is lost (but column chunks for this column in other row groups are
+okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If
+the data within a page is corrupt, that page is lost.  The file will be more
 resilient to corruption with smaller row groups.
 
-Potential extension: With smaller row groups, the biggest issue is placing the file 
-metadata at the end.  If an error happens while writing the file metadata, all the 
-data written will be unreadable.  This can be fixed by writing the file metadata 
-every Nth row group.  
-Each file metadata would be cumulative and include all the row groups written so 
-far.  Combining this with the strategy used for rc or avro files using sync markers, 
-a reader could recover partially written files.  
+Potential extension: With smaller row groups, the biggest issue is placing the file
+metadata at the end.  If an error happens while writing the file metadata, all the
+data written will be unreadable.  This can be fixed by writing the file metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written so
+far.  Combining this with the strategy used for rc or avro files using sync markers,
+a reader could recover partially written files.
 
 ## Separating metadata and column data.
 The format is explicitly designed to separate the metadata from the data. This
 allows splitting columns into multiple files, as well as having a single metadata
-file reference multiple parquet files.  
+file reference multiple parquet files.
 
 ## Configurations
-- Row group size: Larger row groups allow for larger column chunks which makes it 
-possible to do larger sequential IO.  Larger groups also require more buffering in 
-the write path (or a two pass write).  We recommend large row groups (512MB - 1GB). 
-Since an entire row group might need to be read, we want it to completely fit on 
-one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An 
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block 
+- Row group size: Larger row groups allow for larger column chunks which makes it
+possible to do larger sequential IO.  Larger groups also require more buffering in
+the write path (or a two pass write).  We recommend large row groups (512MB - 1GB).
+Since an entire row group might need to be read, we want it to completely fit on
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block
 per HDFS file.
-- Data page size: Data pages should be considered indivisible so smaller data pages 
-allow for more fine grained reading (e.g. single row lookup).  Larger page sizes 
-incur less space overhead (less page headers) and potentially less parsing overhead 
-(processing headers).  Note: for sequential scans, it is not expected to read a page 
+- Data page size: Data pages should be considered indivisible so smaller data pages
+allow for more fine grained reading (e.g. single row lookup).  Larger page sizes
+incur less space overhead (less page headers) and potentially less parsing overhead
+(processing headers).  Note: for sequential scans, it is not expected to read a page
 at a time; this is not the IO chunk.  We recommend 8KB for page sizes.
 
 ## Extensibility
 There are many places in the format for compatible extensions:
 
 - File Version: The file metadata contains a version.
-- Encodings: Encodings are specified by enum and more can be added in the future.  
+- Encodings: Encodings are specified by enum and more can be added in the future.
 - Page types: Additional page types can be added and safely skipped.
commit	53010cac4c5122bca74dfc6fcf3281f54181a612	[log] [tgz]
author	Xinli Shang <shangx@uber.com>	Fri Jan 07 18:18:14 2022 -0800
committer	GitHub <noreply@github.com>	Fri Jan 07 18:18:14 2022 -0800
tree	07db93a19a1515439374c5769dcb88cae5b3deac
parent	b257a016b2d2f8e707ab4b5a2a58380cf90efa59 [diff]
parent	156f6c55cf1fd83a9f30f332b8fbec6e79813b9b [diff]