Designing file formats

Use little JVM heap

Lucene generally prefers to avoid loading gigabytes of data into the JVM heap. Could this data be stored in a file and accessed using a org.apache.lucene.store.RandomAccessInput instead?

Avoid options

One of the hardest problems with file formats is maintaining backward compatibility. Avoid giving options to the user, and instead let the file format make decisions based on the information it has. If an expert user wants to optimize for a specific case, they can write a custom codec and maintain it on their own.

How to split the data into files?

Most file formats split the data into 3 files:

  • metadata,
  • index data,
  • raw data.

The metadata file contains all the data that is read once at open time. This helps on several fronts:

  • One can validate the checksums of this data at open time without significant overhead since all data needs to be read anyway, this helps detect corruptions early.
  • No need to perform expensive seeks into the index/raw data files at open time, one can create slices into these files from offsets that have been written into the metadata file.

The index file contains data-structures that help search the raw data. For KD trees, this would be the inner nodes, for doc values this would be jump tables, for KNN vectors this would be the HNSW graph structure, for terms this would be the FST that stores term prefixes, etc. Having it in a separate file from the data file enables users to do things like MMapDirectory#setPreload(boolean) on these files which are generally rather small and accessed randomly. It is also convenient at times so that index and raw data can be written on the fly without buffering all index data into memory.

The raw file contains the data that needs to be retrieved.

Some file formats are simpler, e.g. the compound file format's index is so small that it can be loaded fully into memory at open time. So it becomes read-once and can be stored in the same file as metadata.

Some file formats are more complex, e.g. postings have multiple types of data (docs, freqs, positions, offsets, payloads) that are optionally retrieved, so they use multiple data files in order not to have to read lots of useless data.

Don't use too many files

The maximum number of file descriptors is usually not infinite. It's ok to use multiple files per segment as described above, but this number should always be small. For instance, it would be a bad practice to use a different file per field.

Add codec headers and footers to all files

Use CodecUtil to add headers and footers to all files of the index. This helps make sure that we are opening the right file and differenciate Lucene bugs from file corruptions.

Validate checksums of the metadata file when opening a segment

If data has been organized in such a way that the metadata file only contains read-once data then verifying checksums is very cheap to do and can help detect corruptions early and in a way that we can give users a meaningful error message that tells users that their index is corrupt, rather than a confusing exception that tells them that Lucene tried to read data beyond the end of the file or anything like that.

Validate structures of other files when opening a segment

One of the most frequent case of index corruption that we have observed over the years is file truncation. Verifying that index files have the expected codec header and a correct structure for the codec footer when opening a segment helps detect a significant spectrum of cases of corruption.

Do as many consistency checks as reasonable

It is common for some data to be redundant, e.g. data from the metadata file might be redundant with information from FieldInfos, or all files from the same file format should have the same version in their codec header. Checking that these redundant pieces of information are consistent is always a good idea, as it would make cases of corruption much easier to debug.

Make sure to not leak files

Be paranoid regarding where exceptions might be thrown and make sure that files would be closed on all paths. E.g. imagine that opening the data file fails while the index file is already open, make sure that the index file would also get closed in that case. Lucene has tests that randomly throw exceptions when interacting with the Directory in order to detect some bugs, but it might take many runs before randomization triggers the exact case that triggers a bug.

Verify checksums upon merges

Merges need to read most if not all input data anyway, so make sure to verify checksums before starting a merge by calling checkIntegrity() on the file format reader in order to make sure that file corruptions don't get propagated by merges. All default implementations do this.

How to make backward-compatible changes to file formats?

See here.