JIRA: https://issues.apache.org/jira/browse/HUDI-53
HUDI requires an Index during updates to locate the existing records by their unique record keys. The HUDI Index is a mapping of the record-key to record's file path. Hudi supports several indexes like:
We are proposing a new Index called Record Index which will save the record key to file path location within the HUDI Metadata Table. Since the HUDI Metadata Table is internal to a HUDI Dataset, the Record Index is updated and queried using the resources already available to the HUDI dataset.
Bloom and Simple Index are slow for large datasets as they have high costs involved in gathering the index data from various data files at lookup time. Furthermore, these indexes do not save a one-to-one record-key to record file path mapping but deduce the mapping via an optimized search at lookup time. A per file overhead required in these indexes means that datasets with larger number of files or number of records will not work well with these indexes.
The Hbase Index saves one to one mapping for each record key so is very fast and scaled with the dataset size. But Hbase Index requires a separate HBase cluster to be maintained. HBase is operationally difficult to maintain and scale for throughput, requires dedicated resources and expertise to maintain.
The Record Index will provide the speed and scalability of HBase Index without all the limitation and overhead. Since the HUDI Metadata Table is a HUDI Table, all future performance improvements in writes and queries will automatically provide those improvements to Record Index performance.
Record Index will save the record-key to file path mapping in a new partition within the HUDI Metadata Table. Metadata table uses HBase HFile - the tree map file format to store and retrieve data. HFile is an indexed file format and supports map like faster lookups by keys. Since, we will be storing mapping for every single record key, Record Index lookups for large number of keys transform into direct lookups of keys from HUDI Metadata Table and should be able to benefit greatly from the faster lookups in HFile.
A new partition record_index will be added under the metadata table. The existing metadata table payload schema will be extended and shared for this partition also. The type field will be used to detect the record_index payload record. Here is the schema for the record_index payload record.
{
"name": "recordIndexMetadata",
"doc": "Metadata Index that contains information about record keys and their location in the dataset",
"type": [
"null",
{
"type": "record",
"name": "HoodieRecordIndexInfo",
"fields": [
{
"name": "partition",
"type": "string",
"doc": "Partition which contains the record",
"avro.java.string": "String"
},
{
"name": "fileIdHighBits",
"type": "long",
"doc": "fileId which contains the record (high 64 bits)"
},
{
"name": "fileIdLowBits",
"type": "long",
"doc": "fileId which contains the record (low 64 bits)"
},
{
"name": "fileIndex",
"type": "int",
"doc": "index of the file"
},
{
"name": "instantTime",
"type": "long",
"doc": "Epoch time in millisecond at which record was added"
}
]
}
],
"default" : null
}
The key for the record index record would be the actual key from the record. The partition name is also saved as string. HUDI base files names have a format which includes a UUID fileID, an integer file Index, a write token and a timestamp. The record index payload only saves the fileID and file index information. The fileID is split into UUID and the integer file index. The UUID is encoded into two longs and the file index is saved as an integer. The timestamp is encoded into epoch time in milliseconds.
This schema format is chosen to minimize the data size of each mapping to ensure the smallest possible size of the record index even for datasets with billions of records.
Experiments have shown that with random UUID record keys and datestr partitions (YYYY/MM/DD), we can achieve an average size of 50 to 55 bytes per mapping saved in the record index. The size might even be lower for keys which may compress better.
Below picture gives a pictorial representation of record index partition in metadata table.
Like any other HUDI Metadata Table index, the record index can be initialized inline (before the writer writes records to the dataset) or via the Async Indexer.
The initialization involves the following steps:
files partition is a pre-requisite for all other partitions in Metadata Table, the list of all files can be taken from the Metadata Table itself and does not involve listing the entire dataset.record index partitionrecord indexWe will add functionality to automatically estimate the number of fileGroups to use for the record index partition based on the number of records in the dataset (available after Step 2 above). This should simplify rollout as the user does not have to worry about the number of fileGroups for optimal performance. Configs will allow specifying the number of fileGroups too.
For the incoming upsert records, given their keys, tag their current location. The key lookup would require the following steps:
Given N fileGroups in the record index, an indexing lookup of M keys is reduced to N lookups of M/N keys in parallel. Hence, for fastest lookup operation, the number of executors for the writer process should be >= N.
This also means that lookup from record index can be scaled with growing data size by:
HDFS based experiments have shown than on average key lookups from HFile in HUDI Metadata Table complete in 1-2msec. So for lookup of M keys we expect ballpark time of K + M / N * 2msec where K is the overhead of opening HFile (~100msec) and merging the log files. Periodic compaction of Metadata Table keeps the value of K lower.
Let's walk through the writer flow to update the record index.
Whenever a new commit is getting applied to metadata table, we do the following.
We need to ensure that WriteStatus tracks all written records keys for every commit.
When a new batch of write is ingested into Hudi, we need to tag the records with their original file group location. Refer to Metadata Index lookup section for more details.