Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet's efficient encodings. For example, strings are stored as byte arrays (binary) with a UTF8 annotation.
This file contains the specification for all logical types.
The parquet format's ConvertedType
stores the type annotation. The annotation may require additional metadata fields, as well as rules for those fields.
UTF8
may only be used to annotate the binary primitive type and indicates that the byte array should be interpreted as a UTF-8 encoded character string.
INT_8
, INT_16
, INT_32
, and INT_64
annotations can be used to specify the maximum number of bits in the stored value. Implementations may use these annotations to produce smaller in-memory representations when reading data.
If a stored value is larger than the maximum allowed by the annotation, the behavior is not defined and can be determined by the implementation. Implementations must not write values that are larger than the annotation allows.
INT_8
, INT_16
, and INT_32
must annotate an int32
primitive type and INT_64
must annotate an int64
primitive type. INT_32
and INT_64
are implied by the int32
and int64
primitive types if no other annotation is present and should be considered optional.
UINT_8
, UINT_16
, UINT_32
, and UINT_64
annotations can be used to specify unsigned integer types, along with a maximum number of bits in the stored value. Implementations may use these annotations to produce smaller in-memory representations when reading data.
If a stored value is larger than the maximum allowed by the annotation, the behavior is not defined and can be determined by the implementation. Implementations must not write values that are larger than the annotation allows.
UINT_8
, UINT_16
, and UINT_32
must annotate an int32
primitive type and UINT_64
must annotate an int64
primitive type.
DECIMAL
annotation represents arbitrary-precision signed decimal numbers of the form unscaledValue * 10^(-scale)
.
The primitive type stores an unscaled integer value. For byte arrays, binary and fixed, the unscaled number must be encoded as two's complement using big-endian byte order (the most significant byte is the zeroth element). The scale stores the number of digits of that value that are to the right of the decimal point, and the precision stores the maximum number of digits supported in the unscaled value.
If not specified, the scale is 0. Scale must be zero or a positive integer less than the precision. Precision is required and must be a non-zero positive integer. A precision too large for the underlying type (see below) is an error.
DECIMAL
can be used to annotate the following types:
int32
: for 1 <= precision <= 9int64
: for 1 <= precision <= 18; precision <= 10 will produce a warningfixed_len_byte_array
: precision is limited by the array size. Length n
can store <= floor(log_10(2^(8*n - 1) - 1))
base-10 digitsbinary
: precision
is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.A SchemaElement
with the DECIMAL
ConvertedType
must also have both scale
and precision
fields set, even if scale is 0 by default.
DATE
is used to for a logical date type, without a time of day. It must annotate an int32
that stores the number of days from the Unix epoch, 1 January 1970.
TIME_MILLIS
is used for a logical time type, without a date. It must annotate an int32
that stores the number of milliseconds after midnight.
TIMESTAMP_MILLIS
is used for a combined logical date and time type. It must annotate an int64
that stores the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
INTERVAL
is used for an interval of time. It must annotate a fixed_len_byte_array
of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.
Each component in this representation is independent of the others. For example, there is no requirement that a large number of days should be expressed as a mix of months and days because there is not a constant conversion from days to months.
JSON
is used for an embedded JSON document. It must annotate a binary
primitive type. The binary
data is interpreted as a UTF-8 encoded character string of valid JSON as defined by the JSON specification
BSON
is used for an embedded BSON document. It must annotate a binary
primitive type. The binary
data is interpreted as an encoded BSON document as defined by the BSON specification.