title: “FileFormat” weight: 7 type: docs aliases:

  • /concepts/spec/fileformat.html

File Format

Currently, supports Parquet, Avro, ORC, CSV, JSON, and Lance file formats.

  • Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
  • Recommended row based format is Avro, which has good performance n reading and writing full row (all columns).
  • Recommended testing format is CSV, which has better readability but the worst read-write performance.
  • Recommended format for ML workloads is Lance, which is optimized for vector search and machine learning use cases.

PARQUET

Parquet is the default file format for Paimon.

The following table lists the type mapping from Paimon type to Parquet type.

Limitations:

  1. Parquet does not support nullable map keys.
  2. Parquet TIMESTAMP type with precision 9 will use INT96, but this int96 is a time zone converted value and requires additional adjustments.

AVRO

The following table lists the type mapping from Paimon type to Avro type.

Note:

In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null), where something is the Avro type converted from Paimon type.

You can refer to Avro Specification for more information about Avro types.

ORC

The following table lists the type mapping from Paimon type to Orc type.

Limitations:

  1. ORC has a time zone bias when mapping TIMESTAMP_LOCAL_ZONE type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.

CSV

Experimental feature, not recommended for production.

Format Options:

Paimon CSV format uses jackson databind API to parse and generate CSV string.

The following table lists the type mapping from Paimon type to CSV type.

TEXT

Experimental feature, not recommended for production.

Format Options:

The Paimon text table contains only one field, and it is of string type.

JSON

Experimental feature, not recommended for production.

Format Options:

Paimon JSON format uses jackson databind API to parse and generate JSON string.

The following table lists the type mapping from Paimon type to JSON type.

LANCE

Lance is a modern columnar data format optimized for machine learning and vector search workloads. It provides high-performance read and write operations with native support for Apache Arrow.

The following table lists the type mapping from Paimon type to Lance (Arrow) type.

Limitations:

  1. Lance file format does not support MAP type.
  2. Lance file format does not support TIMESTAMP_LOCAL_ZONE type.

BLOB

The BLOB format is a specialized format for storing large binary objects such as images, videos, and other multimodal data. Unlike other formats that store data inline, BLOB format stores large binary data in separate files with an optimized layout for random access.

BLOB files use the .blob extension and have the following structure:

+------------------+
| Blob Entry 1     |
|   Magic Number   |  4 bytes (1481511375, Little Endian)
|   Blob Data      |  Variable length
|   Length         |  8 bytes (Little Endian)
|   CRC32          |  4 bytes (Little Endian)
+------------------+
| Blob Entry 2     |
|   ...            |
+------------------+
| Index            |  Variable (Delta-Varint compressed)
+------------------+
| Index Length     |  4 bytes (Little Endian)
| Version          |  1 byte
+------------------+

Key features:

  • CRC32 Checksums: Each blob entry has a CRC32 checksum for data integrity verification
  • Indexed Access: The index at the end enables efficient random access to any blob in the file
  • Delta-Varint Compression: The index uses delta-varint compression for space efficiency

Limitations:

  1. BLOB format only supports a single BLOB type column per file.
  2. BLOB format does not support predicate pushdown.
  3. Statistics collection is not supported for BLOB columns.

For usage details, configuration options, and examples, see [Blob Type]({{< ref “append-table/blob” >}}).