rfc/rfc-45/rfc-45.md - hudi - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->

 # RFC-45: Asynchronous Metadata Indexing

 ## Proposers

 - @codope
 - @manojpec

 ## Approvers

 - @nsivabalan
 - @vinothchandar

 ## Status

 JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)

 ## Abstract

 Metadata indexing (aka metadata bootstrapping) is the process of creation of one
 or more metadata-based indexes, e.g. data partitions to files index, that is
 stored in Hudi metadata table. Currently, the metadata table (referred as MDT
 hereafter) supports single partition which is created synchronously with the
 corresponding data table, i.e. commits are first applied to metadata table
 followed by data table. Our goal for MDT is to support multiple partitions to
 boost the performance of existing index and records lookup. However, the
 synchronous manner of metadata indexing is not very scalable as we add more
 partitions to the MDT because the regular writers (writing to the data table)
 have to wait until the MDT commit completes. In this RFC, we propose a design to
 support asynchronous metadata indexing.

 ## Background

 We can read more about the MDT design
 in [RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
 . Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
 internal Merge-on-Read (MOR) table that has a single partition called `files`
 which stores the data partitions to files index that is used in file listing.
 MDT is co-located with the data table (inside `.hoodie/metadata` directory under
 the basepath). In order to handle multi-writer scenario, users configure lock
 provider and only one writer can access MDT in read-write mode. Hence, any write
 to MDT is guarded by the data table lock. This ensures only one write is
 committed to MDT at any point in time and thus guarantees serializability.
 However, locking overhead adversely affects the write throughput and will reach
 its scalability limits as we add more partitions to the MDT.

 ## Goals

 - Support indexing one or more partitions in MDT while regular writers and table
   services (such as cleaning or compaction) are in progress.
 - Locking to be as lightweight as possible.
 - Keep required config changes to a minimum to simplify deployment / upgrade in
   production.
 - Do not require specific ordering of how writers and table service pipelines
   need to be upgraded / restarted.
 - If an external long-running process is being used to initialize the index, the
   process should be made idempotent so it can handle errors from previous runs.
 - To re-initialize the index, make it as simple as running the external
   initialization process again without having to change configs.

 ## Implementation

 ### High Level Design

 #### A new Hudi action: INDEXING

 We introduce a new action `index` which will denote the index building process,
 the mechanics of which is as follows:

 1. From an external process, users can issue a CREATE INDEX or run a job to
    trigger indexing for an existing table.
     1. This will schedule INDEXING action and add
        a `<instant_time>.index.requested` to the timeline, which contains the
        indexing plan. Index scheduling will also initialize the filegroup for
        the partitions for which indexing is planned. The creation of filegroups
        will be done within a lock.
     2. From here on, the index building process will continue to build an index
        up to instant time `t`, where `t` is the latest completed instant time on
        the timeline without any
        "holes" i.e. no pending async operations prior to it.
     3. The indexing process will write these out as base files within the
        corresponding metadata partition. A metadata partition cannot be used if
        there is any pending indexing action against it. As and when indexing is
        completed for a partition, then table config (`hoodie.properties`) will
        be updated to indicate that partition is available for reads or
        synchronous updates. Hudi table config will be the source of truth for
        the current state of metadata index.

 2. Any inflight writers (i.e. with instant time `t'` > `t`)  will check for any
    new indexing request on the timeline prior to preparing to commit.
     1. Such writers will proceed to additionally add log entries corresponding
        to each such indexing request into the metadata partition.
     2. There is always a TOCTOU issue here, where the inflight writer may not
        see an indexing request that was just added and proceed to commit without
        that. We will correct this during indexing action completion. In the
        average case, this may not happen and the design has liveness.

 3. When the indexing process is about to complete (i.e. indexing upto
    instant `t` is done but before completing indexing commit), it will check for
    all completed commit instants after `t` to ensure each of them added entries
    per its indexing plan, otherwise simply abort after a configurable timeout.
    Let's call this the **indexing catchup**. So, the indexer will not only write
    base files but also ensure that log entries due to instants after `t` are in
    the same filegroup i.e. no new filegroup is initialized by writers while
    indexing is in progress.
     1. The corner case here would be that the indexing catchup does not factor
        in the inflight writer just about to commit. But given indexing would
        take some finite amount of time to go from requested to completion (or we
        can add some, configurable artificial delays here say 60 seconds), an
        inflight writer, that is just about to commit concurrently, has a very
        high chance of seeing the indexing plan and aborting itself.

 We can just introduce a lock for adding events to the timeline and these races
 would vanish completely, still providing great scalability and asynchrony for
 these processes. The indexer will error out if there is no lock provider
 configured.

 #### Multi-writer scenario

 ![](./async_metadata_index.png)

 Let us walkthrough a concrete mutli-writer scenario to understand the above
 indexing mechanism. In this scenario, let instant `t0` be the last completed
 instant on the timeline. Suppose user triggered index building from an external
 process at `t3`. This will create `t3.index.requested` file with the indexing
 plan. The plan contains the metadata partitions that need to be created and the
 last completed instant, e.g.

 ```
 [
   {MetadataPartitionType.FILES.partitionPath(), t0},
   {MetadataPartitionType.BLOOM_FILTER.partitionPath(), t0},
   {MetadataPartitionType.COLUMN_STATS.partitionPath(), t0}
 ]
 ```

 Further, suppose there were two inflight writers Writer1 and Writer2 (with
 inflight instants `t1` and `t2` respectively) while the indexing was requested
 or inflight. In this case, the writers will check for pending index action and
 find a pending instant `t3`. Now, if the metadata index creation is pending,
 which means indexer has already intialized a filegroup, then each writer will
 create log files in the same filegroup for the metadata index update. This will
 happen within the existing data table lock.

 The indexer runs in a loop until the metadata for data upto `t0` plus the data
 written due to `t1` and `t2` has been indexed, or the indexing timed out.
 Whether indexing timed out or not, table config would be updated with any MDT
 partition(s) for which indexing was complete till `t2`. In case of timeout
 indexer will abort. At this point, user can trigger the index process again,
 however, this time indexer will check for available partitions in table config
 and skip those partitions. This design ensures that the regular writers do not
 fail due to indexing.

 ### Low Level Design

 #### Schedule Indexing

 The scheduling initializes the file groups for metadata partitions in a lock. It
 does not update any table config.

 ```
 1 Run pre-scheduling validation (valid index requested, lock provider configured, idempotent checks)
 2 Begin transaction
   2.a  Get the base instant
   2.b  Start initializing file groups for each partition
   2.c  Create index plan and save indexing.requested instant to the timeline
 3 End transaction
 ```

 If there is failure in any of the above steps, then we abort gracefully i.e.
 delete the metadata partition if it was initialized.

 #### Run Indexing

 This is a separate executor, which reads the plan and builds the index.

 ```
 1 Run pre-indexing checks (lock provider configured, indexing.requested exists, idempotent checks)
 2 Read the indexing plan and if any of the requested partition is inflight or already completed then error out and return early
 3 Transition indexing.requested to inflight
 4 Build metadata partitions
   4.a  Build the base file in the metadata partition to index upto instant as per the plan
   4.b  Update inflight partitions config in hoodie.properties
 5 Determine the catchup start instant based on write and non-write timeline
 6 Start indexing catchup in a separate thread (that can be interrupted upon timeout)
   6.a  For each instant to catchup
     6.a.i  if instant is completed and has corresponding deltacommit in metadata timeline then continue
     6.a.ii  if instant is inflight, then reload active timeline periodically until completed or timed out
     6.a.iii update metadata table, if needed, within a lock
 7 Build indexing commit metadata with the partition info and caught upto instant
 8 Begin transaction
   8.a  update completed metadata partitions in table config
   8.b  save indexing commit metadata to the timeline transition indexing.inflight to completed.
 9 End transaction
 ```

 If there is failure in any of the above steps, then we abort gracefully i.e.
 delete the metadata partition if it exists and revert the table config updates.

 #### Configs

 ```
 # enable metadata
 hoodie.metadata.enable=true
 # enable asynchronous metadata indexing
 hoodie.metadata.index.async=true
 # enable column stats index
 hoodie.metadata.index.column.stats.enable=true
 # set indexing catchup timeout
 hoodie.metadata.index.check.timeout.seconds=60
 # set OCC concurrency mode
 hoodie.write.concurrency.mode=optimistic_concurrency_control
 # set lock provider
 hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
 ```

 #### Table upgrade/downgrade

 While upgrading from a previous version to the current version, if metadata is
 enabled and `files` partition exists then completed partitions in
 hoodie.paroperties will be updated to `files` partition. While downgrading to a
 previous version, if metadata table exists then it is deleted because metadata
 table in current version has a schema that is not forward compatible.

 ### Error Handling

 **Case 1: Writer fails while indexer is inflight**

 This means index update due to writer did not complete. Indexer continues to
 build the index ignoring the failed instant due to writer. The next update by
 the writer will trigger a rollback of the failed instant, which will also
 rollback incomplete updates in metadata table.

 **Case 2: Indexer fails while writer is inflight**

 Writer will commit adding log entries to the metadata partition. However, table
 config will indicate that partition is not ready to use. When indexer is
 re-triggered, it will check the plan and table config to figure out which MDT
 partitions to index and start indexing for those partitions.

 **Case 3: Race conditions**

 a) Writer went inflight just after an indexing request was added but indexer has
 not yet started executing.

 In this case, writer will continue to log updates in metadata partition. At the
 time of execution, indexer will see there are already some log files and ensure
 that the indexing catchup passes.

 b) Inflight writer about to commit, but indexing completed just before that.

 Ideally, the indexing catchup in the indexer should have failed. But this could
 happen in the following sequence of events:

 1. No pending data commit. Indexing check passed, indexing commit not
    completed (table config yet to be updated).
 2. Writer went inflight knowing that MDT partition is not ready for use.
 3. Indexing commit done, table config updated.

 In this case, the writer will continue to write log files under the latest base
 filegroup in the MDT partition. Even though the indexer missed the updates due
 to writer, there is no "index loss" as such i.e. metadata due to writer is still
 updated in the MDT partition. Async compaction on the MDT will eventually merge
 the updates into another base file.

 Or, we can introduce a lock for adding events to the metadata timeline.

 c) Inflight writer about to commit but index is still being scheduled

 Consider the following scenario:

 1. Writer is in inflight mode.
 2. Indexer is starting and creating the file-groups. Suppose there are 100
    file-groups to be created.
 3. Writer just finished and tries to write log blocks - it only sees a subset of
    file-groups created yet (as the above step 2 above has not completed yet).
    This will cause writer to incorrectly write updated to lesser number of
    shards.

 In this case, we ensure that scheduling for metadata index always happens within
 a lock. Since the initialization of filegroups happen at the time of scheduling,
 indexer will hold the lock until all the filegroups are created.

 **Case 4: Async table services**

 The metadata partition cannot be used if there is any pending index action
 against it. So, async compaction/cleaning/clustering will ignore the metadata
 partition for which indexing is inflight.

 **Case 5: Data timeline with holes**

 Let's say the data timeline when indexer is started looks
 like: `C1, C2,.... C5 (inflight), C6, C7, C8`, where `C1` is a commit at
 instant `1`. In this case the latest completed instant without any hole is `C4`.
 So, indexer will continue to index upto `C4`. Instants `C5-C8` will go through
 the indexing catchup. If `C5` does not complete before the timeout, then indexer
 will abort. The indexer will run through the same process again when
 re-triggered.

 The above example contained only write commits however the indexer will consider
 non-write commits (such as clean/restore/rollback) as well. Let's take such an
 example:

 |  DC  |  DC  | DC | CLEAN | DC | DC | COMPACT | DC | INDEXING |  DC  |
 | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 |  1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |  9   |  10  |
 |  C   |  C   |  C   |  I   |  C   |  C   |  R   |  C   |  R   |  I   |

 Here, DC indicates a deltacommit, second row is the instant time, and the last
 row is whether the action is completed (C), inflight (I) or requested(R). In
 this case, the base instant upto which there are no holes in write timeline
 is `DC6`. The indexer will also check the earliest pending instant in non-write
 timeline before this base instant, which is `CLEAN4`. While the indexing is done
 upto base instant, the remaining instants (CLEAN4, COMPACT7, DC8) are checked
 during indexing catchup whether they logged updated to corresponding filegroup
 as per the index plan. Note that during catchup, indexer won't move beyond
 unless the instants to catch up actually get into completed state. For instance,
 if the CLEAN4 was inflight till the configured timeout, then indexer will abort.

 ## Summary of key proposals

 - New INDEXING action on data timeline.
 - Async indexer to handle state change for the new action.
 - Concept of "indexing catchup" to reconcile instants that went inflight after
   indexer started.
 - Table config to be the source of truth for inflight and completed MDT
   partitions.
 - Indexer will error out if lock provider not configured.

 ## Rollout/Adoption Plan

 - What impact (if any) will there be on existing users?

 There can be two kinds of existing users:

 a) Enabling metadata for the first time: There should not be any impact on such
 users. When they enable metadata, they can trigger indexing process. b) Metadata
 already enabled: Such users already have metadata table with at least one
 partition. If they trigger indexing process, then the indexer should take into
 account the existing metadata and ignore instants upto which MDT is in sync with
 the data table.

 - If we are changing behavior how will we phase out the older behavior?

 The changes will be backward-compatible and if the async indexing is diabled
 then the existing behavior of MDT creation and updates will be used.

 - If we need special migration tools, describe them here.

 Not required.

 - When will we remove the existing behavior

 Not required

 ## Test Plan

 - Extensive unit tests to cover all scenarios including conflicts and
   error-handling.
 - Run a long-running test on EMR cluster with async indexing enabled.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	# RFC-45: Asynchronous Metadata Indexing

	## Proposers

	- @codope
	- @manojpec

	## Approvers

	- @nsivabalan
	- @vinothchandar

	## Status

	JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)

	## Abstract

	Metadata indexing (aka metadata bootstrapping) is the process of creation of one
	or more metadata-based indexes, e.g. data partitions to files index, that is
	stored in Hudi metadata table. Currently, the metadata table (referred as MDT
	hereafter) supports single partition which is created synchronously with the
	corresponding data table, i.e. commits are first applied to metadata table
	followed by data table. Our goal for MDT is to support multiple partitions to
	boost the performance of existing index and records lookup. However, the
	synchronous manner of metadata indexing is not very scalable as we add more
	partitions to the MDT because the regular writers (writing to the data table)
	have to wait until the MDT commit completes. In this RFC, we propose a design to
	support asynchronous metadata indexing.

	## Background

	We can read more about the MDT design
	in [RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
	. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
	internal Merge-on-Read (MOR) table that has a single partition called `files`
	which stores the data partitions to files index that is used in file listing.
	MDT is co-located with the data table (inside `.hoodie/metadata` directory under
	the basepath). In order to handle multi-writer scenario, users configure lock
	provider and only one writer can access MDT in read-write mode. Hence, any write
	to MDT is guarded by the data table lock. This ensures only one write is
	committed to MDT at any point in time and thus guarantees serializability.
	However, locking overhead adversely affects the write throughput and will reach
	its scalability limits as we add more partitions to the MDT.

	## Goals

	- Support indexing one or more partitions in MDT while regular writers and table
	services (such as cleaning or compaction) are in progress.
	- Locking to be as lightweight as possible.
	- Keep required config changes to a minimum to simplify deployment / upgrade in
	production.
	- Do not require specific ordering of how writers and table service pipelines
	need to be upgraded / restarted.
	- If an external long-running process is being used to initialize the index, the
	process should be made idempotent so it can handle errors from previous runs.
	- To re-initialize the index, make it as simple as running the external
	initialization process again without having to change configs.

	## Implementation

	### High Level Design

	#### A new Hudi action: INDEXING

	We introduce a new action `index` which will denote the index building process,
	the mechanics of which is as follows:

	1. From an external process, users can issue a CREATE INDEX or run a job to
	trigger indexing for an existing table.
	1. This will schedule INDEXING action and add
	a `<instant_time>.index.requested` to the timeline, which contains the
	indexing plan. Index scheduling will also initialize the filegroup for
	the partitions for which indexing is planned. The creation of filegroups
	will be done within a lock.
	2. From here on, the index building process will continue to build an index
	up to instant time `t`, where `t` is the latest completed instant time on
	the timeline without any
	"holes" i.e. no pending async operations prior to it.
	3. The indexing process will write these out as base files within the
	corresponding metadata partition. A metadata partition cannot be used if
	there is any pending indexing action against it. As and when indexing is
	completed for a partition, then table config (`hoodie.properties`) will
	be updated to indicate that partition is available for reads or
	synchronous updates. Hudi table config will be the source of truth for
	the current state of metadata index.

	2. Any inflight writers (i.e. with instant time `t'` > `t`) will check for any
	new indexing request on the timeline prior to preparing to commit.
	1. Such writers will proceed to additionally add log entries corresponding
	to each such indexing request into the metadata partition.
	2. There is always a TOCTOU issue here, where the inflight writer may not
	see an indexing request that was just added and proceed to commit without
	that. We will correct this during indexing action completion. In the
	average case, this may not happen and the design has liveness.

	3. When the indexing process is about to complete (i.e. indexing upto
	instant `t` is done but before completing indexing commit), it will check for
	all completed commit instants after `t` to ensure each of them added entries
	per its indexing plan, otherwise simply abort after a configurable timeout.
	Let's call this the indexing catchup. So, the indexer will not only write
	base files but also ensure that log entries due to instants after `t` are in
	the same filegroup i.e. no new filegroup is initialized by writers while
	indexing is in progress.
	1. The corner case here would be that the indexing catchup does not factor
	in the inflight writer just about to commit. But given indexing would
	take some finite amount of time to go from requested to completion (or we
	can add some, configurable artificial delays here say 60 seconds), an
	inflight writer, that is just about to commit concurrently, has a very
	high chance of seeing the indexing plan and aborting itself.

	We can just introduce a lock for adding events to the timeline and these races
	would vanish completely, still providing great scalability and asynchrony for
	these processes. The indexer will error out if there is no lock provider
	configured.

	#### Multi-writer scenario

	![](./async_metadata_index.png)

	Let us walkthrough a concrete mutli-writer scenario to understand the above
	indexing mechanism. In this scenario, let instant `t0` be the last completed
	instant on the timeline. Suppose user triggered index building from an external
	process at `t3`. This will create `t3.index.requested` file with the indexing
	plan. The plan contains the metadata partitions that need to be created and the
	last completed instant, e.g.

	```
	[
	{MetadataPartitionType.FILES.partitionPath(), t0},
	{MetadataPartitionType.BLOOM_FILTER.partitionPath(), t0},
	{MetadataPartitionType.COLUMN_STATS.partitionPath(), t0}
	]
	```

	Further, suppose there were two inflight writers Writer1 and Writer2 (with
	inflight instants `t1` and `t2` respectively) while the indexing was requested
	or inflight. In this case, the writers will check for pending index action and
	find a pending instant `t3`. Now, if the metadata index creation is pending,
	which means indexer has already intialized a filegroup, then each writer will
	create log files in the same filegroup for the metadata index update. This will
	happen within the existing data table lock.

	The indexer runs in a loop until the metadata for data upto `t0` plus the data
	written due to `t1` and `t2` has been indexed, or the indexing timed out.
	Whether indexing timed out or not, table config would be updated with any MDT
	partition(s) for which indexing was complete till `t2`. In case of timeout
	indexer will abort. At this point, user can trigger the index process again,
	however, this time indexer will check for available partitions in table config
	and skip those partitions. This design ensures that the regular writers do not
	fail due to indexing.

	### Low Level Design

	#### Schedule Indexing

	The scheduling initializes the file groups for metadata partitions in a lock. It
	does not update any table config.

	```
	1 Run pre-scheduling validation (valid index requested, lock provider configured, idempotent checks)
	2 Begin transaction
	2.a Get the base instant
	2.b Start initializing file groups for each partition
	2.c Create index plan and save indexing.requested instant to the timeline
	3 End transaction
	```

	If there is failure in any of the above steps, then we abort gracefully i.e.
	delete the metadata partition if it was initialized.

	#### Run Indexing

	This is a separate executor, which reads the plan and builds the index.

	```
	1 Run pre-indexing checks (lock provider configured, indexing.requested exists, idempotent checks)
	2 Read the indexing plan and if any of the requested partition is inflight or already completed then error out and return early
	3 Transition indexing.requested to inflight
	4 Build metadata partitions
	4.a Build the base file in the metadata partition to index upto instant as per the plan
	4.b Update inflight partitions config in hoodie.properties
	5 Determine the catchup start instant based on write and non-write timeline
	6 Start indexing catchup in a separate thread (that can be interrupted upon timeout)
	6.a For each instant to catchup
	6.a.i if instant is completed and has corresponding deltacommit in metadata timeline then continue
	6.a.ii if instant is inflight, then reload active timeline periodically until completed or timed out
	6.a.iii update metadata table, if needed, within a lock
	7 Build indexing commit metadata with the partition info and caught upto instant
	8 Begin transaction
	8.a update completed metadata partitions in table config
	8.b save indexing commit metadata to the timeline transition indexing.inflight to completed.
	9 End transaction
	```

	If there is failure in any of the above steps, then we abort gracefully i.e.
	delete the metadata partition if it exists and revert the table config updates.

	#### Configs

	```
	# enable metadata
	hoodie.metadata.enable=true
	# enable asynchronous metadata indexing
	hoodie.metadata.index.async=true
	# enable column stats index
	hoodie.metadata.index.column.stats.enable=true
	# set indexing catchup timeout
	hoodie.metadata.index.check.timeout.seconds=60
	# set OCC concurrency mode
	hoodie.write.concurrency.mode=optimistic_concurrency_control
	# set lock provider
	hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
	```

	#### Table upgrade/downgrade

	While upgrading from a previous version to the current version, if metadata is
	enabled and `files` partition exists then completed partitions in
	hoodie.paroperties will be updated to `files` partition. While downgrading to a
	previous version, if metadata table exists then it is deleted because metadata
	table in current version has a schema that is not forward compatible.

	### Error Handling

	Case 1: Writer fails while indexer is inflight

	This means index update due to writer did not complete. Indexer continues to
	build the index ignoring the failed instant due to writer. The next update by
	the writer will trigger a rollback of the failed instant, which will also
	rollback incomplete updates in metadata table.

	Case 2: Indexer fails while writer is inflight

	Writer will commit adding log entries to the metadata partition. However, table
	config will indicate that partition is not ready to use. When indexer is
	re-triggered, it will check the plan and table config to figure out which MDT
	partitions to index and start indexing for those partitions.

	Case 3: Race conditions

	a) Writer went inflight just after an indexing request was added but indexer has
	not yet started executing.

	In this case, writer will continue to log updates in metadata partition. At the
	time of execution, indexer will see there are already some log files and ensure
	that the indexing catchup passes.

	b) Inflight writer about to commit, but indexing completed just before that.

	Ideally, the indexing catchup in the indexer should have failed. But this could
	happen in the following sequence of events:

	1. No pending data commit. Indexing check passed, indexing commit not
	completed (table config yet to be updated).
	2. Writer went inflight knowing that MDT partition is not ready for use.
	3. Indexing commit done, table config updated.

	In this case, the writer will continue to write log files under the latest base
	filegroup in the MDT partition. Even though the indexer missed the updates due
	to writer, there is no "index loss" as such i.e. metadata due to writer is still
	updated in the MDT partition. Async compaction on the MDT will eventually merge
	the updates into another base file.

	Or, we can introduce a lock for adding events to the metadata timeline.

	c) Inflight writer about to commit but index is still being scheduled

	Consider the following scenario:

	1. Writer is in inflight mode.
	2. Indexer is starting and creating the file-groups. Suppose there are 100
	file-groups to be created.
	3. Writer just finished and tries to write log blocks - it only sees a subset of
	file-groups created yet (as the above step 2 above has not completed yet).
	This will cause writer to incorrectly write updated to lesser number of
	shards.

	In this case, we ensure that scheduling for metadata index always happens within
	a lock. Since the initialization of filegroups happen at the time of scheduling,
	indexer will hold the lock until all the filegroups are created.

	Case 4: Async table services

	The metadata partition cannot be used if there is any pending index action
	against it. So, async compaction/cleaning/clustering will ignore the metadata
	partition for which indexing is inflight.

	Case 5: Data timeline with holes

	Let's say the data timeline when indexer is started looks
	like: `C1, C2,.... C5 (inflight), C6, C7, C8`, where `C1` is a commit at
	instant `1`. In this case the latest completed instant without any hole is `C4`.
	So, indexer will continue to index upto `C4`. Instants `C5-C8` will go through
	the indexing catchup. If `C5` does not complete before the timeout, then indexer
	will abort. The indexer will run through the same process again when
	re-triggered.

	The above example contained only write commits however the indexer will consider
	non-write commits (such as clean/restore/rollback) as well. Let's take such an
	example:

	\| DC \| DC \| DC \| CLEAN \| DC \| DC \| COMPACT \| DC \| INDEXING \| DC \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
	\| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9 \| 10 \|
	\| C \| C \| C \| I \| C \| C \| R \| C \| R \| I \|

	Here, DC indicates a deltacommit, second row is the instant time, and the last
	row is whether the action is completed (C), inflight (I) or requested(R). In
	this case, the base instant upto which there are no holes in write timeline
	is `DC6`. The indexer will also check the earliest pending instant in non-write
	timeline before this base instant, which is `CLEAN4`. While the indexing is done
	upto base instant, the remaining instants (CLEAN4, COMPACT7, DC8) are checked
	during indexing catchup whether they logged updated to corresponding filegroup
	as per the index plan. Note that during catchup, indexer won't move beyond
	unless the instants to catch up actually get into completed state. For instance,
	if the CLEAN4 was inflight till the configured timeout, then indexer will abort.

	## Summary of key proposals

	- New INDEXING action on data timeline.
	- Async indexer to handle state change for the new action.
	- Concept of "indexing catchup" to reconcile instants that went inflight after
	indexer started.
	- Table config to be the source of truth for inflight and completed MDT
	partitions.
	- Indexer will error out if lock provider not configured.

	## Rollout/Adoption Plan

	- What impact (if any) will there be on existing users?

	There can be two kinds of existing users:

	a) Enabling metadata for the first time: There should not be any impact on such
	users. When they enable metadata, they can trigger indexing process. b) Metadata
	already enabled: Such users already have metadata table with at least one
	partition. If they trigger indexing process, then the indexer should take into
	account the existing metadata and ignore instants upto which MDT is in sync with
	the data table.

	- If we are changing behavior how will we phase out the older behavior?

	The changes will be backward-compatible and if the async indexing is diabled
	then the existing behavior of MDT creation and updates will be used.

	- If we need special migration tools, describe them here.

	Not required.

	- When will we remove the existing behavior

	Not required

	## Test Plan

	- Extensive unit tests to cover all scenarios including conflicts and
	error-handling.
	- Run a long-running test on EMR cluster with async indexing enabled.