Spec: Fix table of content generation (#11067)
diff --git a/format/spec.md b/format/spec.md
index 601cbcc..6b80e87 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -30,13 +30,13 @@
The format version number is incremented when new features are added that will break forward-compatibility---that is, when older readers would not read newer table features correctly. Tables may continue to be written with an older version of the spec to ensure compatibility by not using features that are not yet implemented by processing engines.
-#### Version 1: Analytic Data Tables
+### Version 1: Analytic Data Tables
Version 1 of the Iceberg spec defines how to manage large analytic tables using immutable file formats: Parquet, Avro, and ORC.
All version 1 data and metadata files are valid after upgrading a table to version 2. [Appendix E](#version-2) documents how to default version 2 fields when reading version 1 metadata.
-#### Version 2: Row-level Deletes
+### Version 2: Row-level Deletes
Version 2 of the Iceberg spec adds row-level updates and deletes for analytic tables with immutable files.
@@ -44,7 +44,7 @@
In addition to row-level deletes, version 2 makes some requirements stricter for writers. The full set of changes are listed in [Appendix E](#version-2).
-#### Version 3: Extended Types and Capabilities
+### Version 3: Extended Types and Capabilities
Version 3 of the Iceberg spec extends data types and existing metadata structures to add new capabilities:
@@ -75,7 +75,7 @@
The manifests that make up a snapshot are stored in a manifest list file. Each manifest list stores metadata about manifests, including partition stats and data file counts. These stats are used to avoid reading manifests that are not required for an operation.
-#### Optimistic Concurrency
+### Optimistic Concurrency
An atomic swap of one table metadata file for another provides the basis for serializable isolation. Readers use the snapshot that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.
@@ -85,7 +85,7 @@
The conditions required by a write to successfully commit determines the isolation level. Writers can select what to validate and can make different isolation guarantees.
-#### Sequence Numbers
+### Sequence Numbers
The relative age of data and delete files relies on a sequence number that is assigned to every successful commit. When a snapshot is created for a commit, it is optimistically assigned the next sequence number, and it is written into the snapshot's metadata. If the commit fails and must be retried, the sequence number is reassigned and written into new snapshot metadata.
@@ -94,7 +94,7 @@
Inheriting the sequence number from manifest metadata allows writing a new manifest once and reusing it in commit retries. To change a sequence number for a retry, only the manifest list must be rewritten -- which would be rewritten anyway with the latest set of manifests.
-#### Row-level Deletes
+### Row-level Deletes
Row-level deletes are stored in delete files.
@@ -106,7 +106,7 @@
Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see [Scan Planning](#scan-planning) for details. Column metrics can be used to determine whether a delete file's rows overlap the contents of a data file or a scan range.
-#### File System Operations
+### File System Operations
Iceberg only requires that file systems support the following operations:
@@ -121,9 +121,9 @@
Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files.
-# Specification
+## Specification
-### Terms
+#### Terms
* **Schema** -- Names and types of fields in a table.
* **Partition spec** -- A definition of how partition values are derived from data fields.
@@ -133,7 +133,7 @@
* **Data file** -- A file that contains rows of a table.
* **Delete file** -- A file that encodes rows of a table that are deleted by position or data values.
-### Writer requirements
+#### Writer requirements
Some tables in this spec have columns that specify requirements for tables by version. These requirements are intended for writers when adding metadata files (including manifests files and manifest lists) to a table with the given version.
@@ -158,19 +158,19 @@
Readers may be more strict for metadata JSON files because the JSON files are not reused and will always match the table version. Required fields that were not present in or were optional in prior versions may be handled as required fields. For example, a v2 table that is missing `last-sequence-number` can throw an exception.
-### Writing data files
+#### Writing data files
All columns must be written to data files even if they introduce redundancy with metadata stored in manifest files (e.g. columns with identity partition transforms). Writing all columns provides a backup in case of corruption or bugs in the metadata layer.
Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform.
-## Schemas and Data Types
+### Schemas and Data Types
A table's **schema** is a list of named columns. All data types are either primitives or nested types, which are maps, lists, or structs. A table schema is also a struct type.
For the representations of these types in Avro, ORC, and Parquet file formats, see Appendix A.
-### Nested Types
+#### Nested Types
A **`struct`** is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. Fields may have an optional comment or doc string. Fields can have [default values](#default-values).
@@ -178,7 +178,7 @@
A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.
-### Primitive Types
+#### Primitive Types
Supported primitive types are defined in the table below. Primitive types added after v1 have an "added by" version that is the first spec version in which the type is allowed. For example, nanosecond-precision timestamps are part of the v3 spec; using v3 types in v1 or v2 tables can break forward compatibility.
@@ -211,7 +211,7 @@
For details on how to serialize a schema to JSON, see Appendix C.
-### Default values
+#### Default values
Default values can be tracked for struct fields (both nested structs and the top-level schema's struct). There can be two defaults with a field:
@@ -227,7 +227,7 @@
Default values are attributes of fields in schemas and serialized with fields in the JSON format. See [Appendix C](#appendix-c-json-serialization).
-### Schema Evolution
+#### Schema Evolution
Schemas may be evolved by type promotion or adding, deleting, renaming, or reordering fields in structs (both nested structs and the top-level schema’s struct).
@@ -275,7 +275,7 @@
* If a field value is missing from a struct's `write-default`, the field's `write-default` must be used for the field
-#### Column Projection
+##### Column Projection
Columns in Iceberg data files are selected by field id. The table schema's column names and order may change after a data file is written, and projection must be done using field ids.
@@ -307,7 +307,7 @@
For details on serialization, see [Appendix C](#name-mapping-serialization).
-### Identifier Field IDs
+#### Identifier Field IDs
A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
@@ -316,7 +316,7 @@
Identifier fields may be nested in structs but cannot be nested within maps or lists. Float, double, and optional fields cannot be used as identifier fields and a nested field cannot be used as an identifier field if it is nested in an optional struct, to avoid null values in identifiers.
-### Reserved Field IDs
+#### Reserved Field IDs
Iceberg tables must not use field ids greater than 2147483447 (`Integer.MAX_VALUE - 200`). This id range is reserved for metadata columns that can be used in user data schemas, like the `_file` column that holds the file path in which a row was stored.
@@ -335,7 +335,7 @@
| **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled, see [Row Lineage](#row-lineage) |
| **`2147483542 _last_updated_sequence_number`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |
-### Row Lineage
+#### Row Lineage
In v3 and later, an Iceberg table can track row lineage fields for all newly created rows. Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files:
@@ -347,7 +347,7 @@
When row lineage is enabled, new snapshots cannot include [Equality Deletes](#equality-delete-files). Row lineage is incompatible with equality deletes because lineage values must be maintained, but equality deletes are used to avoid reading existing data before writing changes.
-#### Row lineage assignment
+##### Row lineage assignment
Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_updated_sequence_number`) must not be written to data files. The rest of this section applies when row lineage is enabled.
@@ -368,7 +368,7 @@
3. If the write has not modified the row, the existing non-null `_last_updated_sequence_number` value must be copied to the new data file
-#### Row lineage example
+##### Row lineage example
This example demonstrates how `_row_id` and `_last_updated_sequence_number` are assigned for a snapshot when row lineage is enabled. This starts with a table with row lineage enabled and a `next-row-id` of 1000.
@@ -409,7 +409,7 @@
When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225:
-### Enabling Row Lineage for Non-empty Tables
+##### Enabling Row Lineage for Non-empty Tables
Any snapshot without the field `first-row-id` does not have any lineage information and values for `_row_id` and `_last_updated_sequence_number` cannot be assigned accurately.
@@ -419,7 +419,7 @@
and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.
-## Partitioning
+### Partitioning
Data files are stored in manifests with a tuple of partition values that are used in scans to filter out files that cannot contain records that match the scan’s filter predicate. Partition values for a data file must be the same for all records stored in the data file. (Manifests store data files from any partition, as long as the partition spec is the same for the data files.)
@@ -440,7 +440,7 @@
Partition field IDs must be reused if an existing partition spec contains an equivalent field.
-### Partition Transforms
+#### Partition Transforms
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
@@ -458,7 +458,7 @@
The `void` transform may be used to replace the transform in an existing partition field so that the field is effectively dropped in v1 tables. See partition evolution below.
-### Bucket Transform Details
+#### Bucket Transform Details
Bucket partition transforms use a 32-bit hash of the source value. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
@@ -475,7 +475,7 @@
For hash function details by type, see Appendix B.
-### Truncate Transform Details
+#### Truncate Transform Details
| **Type** | **Config** | **Truncate specification** | **Examples** |
|---------------|-----------------------|------------------------------------------------------------------|----------------------------------|
@@ -493,7 +493,7 @@
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes.
-### Partition Evolution
+#### Partition Evolution
Table partitioning can be evolved by adding, removing, renaming, or reordering partition spec fields.
@@ -510,7 +510,7 @@
3. Only add partition fields at the end of the previous partition spec
-## Sorting
+### Sorting
Users can sort their data within partitions by columns to gain performance. The information on how the data is sorted can be declared per data or delete file, by a **sort order**.
@@ -530,7 +530,7 @@
A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
-## Manifests
+### Manifests
A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot.
@@ -598,7 +598,7 @@
The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan.
-### Manifest Entry Fields
+#### Manifest Entry Fields
The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields.
@@ -616,7 +616,7 @@
1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge cases where incremental deletes can break table snapshots.
2. Manifest list files are required in v2, so that the `sequence_number` and `snapshot_id` to inherit are always available.
-### Sequence Number Inheritance
+#### Sequence Number Inheritance
Manifests track the sequence number when a data or delete file was added to the table.
@@ -629,7 +629,7 @@
When reading v1 manifests with no sequence number column, sequence numbers for all files must default to 0.
-### First Row ID Inheritance
+#### First Row ID Inheritance
Row ID inheritance is used when row lineage is enabled. When not enabled, a data file's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled.
@@ -639,7 +639,7 @@
The `first_row_id` is only inherited for added data files. The inherited value must be written into the data file metadata for existing and deleted entries. The value of `first_row_id` for delete files is always `null`.
-## Snapshots
+### Snapshots
A snapshot consists of the following fields:
@@ -673,7 +673,7 @@
Valid snapshots are stored as a list in table metadata. For serialization, see Appendix C.
-### Snapshot Row IDs
+#### Snapshot Row IDs
When row lineage is not enabled, `first-row-id` must be omitted. The rest of this section applies when row lineage is enabled.
@@ -811,13 +811,13 @@
2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot)
5. Expire any snapshot not in the set of snapshots to retain.
-## Table Metadata
+### Table Metadata
Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation. This operation is used to ensure that a new version of table metadata replaces the version on which it was based. This produces a linear history of table versions and ensures that concurrent writes are not lost.
The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
-### Table Metadata Fields
+#### Table Metadata Fields
Table metadata consists of the following fields:
@@ -853,7 +853,7 @@
When a new snapshot is added, the table's `next-row-id` should be updated to the previous `next-row-id` plus the sum of `record_count` for all data files added in the snapshot (this is also equal to the sum of `added_rows_count` for all manifests added in the snapshot). This ensures that `next-row-id` is always higher than any assigned row ID in the table.
-### Table Statistics
+#### Table Statistics
Table statistics files are valid [Puffin files](puffin-spec.md). Statistics are informational. A reader can choose to
ignore statistics information. Statistics support is not required to read the table correctly. A table can contain
@@ -881,7 +881,7 @@
| _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. |
-### Partition Statistics
+#### Partition Statistics
Partition statistics files are based on [partition statistics file spec](#partition-statistics-file).
Partition statistics are not required for reading or planning and readers may ignore them.
@@ -897,7 +897,7 @@
| _required_ | _required_ | **`statistics-path`** | `string` | Path of the partition statistics file. See [Partition statistics file](#partition-statistics-file). |
| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. |
-#### Partition Statistics File
+##### Partition Statistics File
Statistics information for each unique partition tuple is stored as a row in any of the data file format of the table (for example, Parquet or ORC).
These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning.
@@ -934,7 +934,7 @@
and then the table has evolved into `spec#1` which has just one field `{field#2}`.
The unified partition type looks like `Struct<field#1, field#2>`.
-## Commit Conflict Resolution and Retry
+### Commit Conflict Resolution and Retry
When two commits happen at the same time and are based on the same version, only one commit will succeed. In most cases, the failed commit can be applied to the new current version of table metadata and retried. Updates verify the conditions under which they can be applied to a new version and retry if those conditions are met.
@@ -944,7 +944,7 @@
* Table schema updates and partition spec changes must validate that the schema has not changed between the base version and the current version.
-### File System Tables
+#### File System Tables
_Note: This file system based scheme to commit a metadata file is **deprecated** and will be removed in version 4 of this spec. The scheme is **unsafe** in object stores and local file systems._
@@ -963,7 +963,7 @@
1. The file system table scheme is implemented in [HadoopTableOperations](../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/hadoop/HadoopTableOperations.html).
-### Metastore Tables
+#### Metastore Tables
The atomic swap needed to commit new versions of table metadata can be implemented by storing a pointer in a metastore or database that is updated with a check-and-put operation [1]. The check-and-put validates that the version of the table that a write is based on is still current and then makes the new metadata from the write the current version.
@@ -980,7 +980,7 @@
1. The metastore table scheme is partly implemented in [BaseMetastoreTableOperations](../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/BaseMetastoreTableOperations.html).
-## Delete Formats
+### Delete Formats
This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1.
@@ -991,7 +991,7 @@
Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table.
-### Position Delete Files
+#### Position Delete Files
Position-based delete files identify deleted rows by file and position in one or more data files, and may optionally contain the deleted row.
@@ -1016,7 +1016,7 @@
* Sorting by `file_path` allows filter pushdown by file in columnar storage formats.
* Sorting by `pos` allows filtering rows while scanning, to avoid keeping deletes in memory.
-### Equality Delete Files
+#### Equality Delete Files
Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row.
@@ -1068,7 +1068,7 @@
If a delete column in an equality delete file is later dropped from the table, it must still be used when applying the equality deletes. If a column was added to a table and later used as a delete column in an equality delete file, the column value is read for older data files using normal projection rules (defaults to `null`).
-### Delete File Stats
+#### Delete File Stats
Manifests hold the same statistics for delete files and data files. For delete files, the metrics describe the values that were deleted.