Add metadata in the schema for storing decimals.
diff --git a/LogicalTypes.md b/LogicalTypes.md
new file mode 100644
index 0000000..96775af
--- /dev/null
+++ b/LogicalTypes.md
@@ -0,0 +1,47 @@
+Parquet Logical Type Definitions
+====
+
+Logical types are used to extend the types that parquet can be used to store,
+by specifying how the primitive types should be interpreted. This keeps the set
+of primitive types to a minimum and reuses parquet's efficient encodings. For
+example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+
+This file contains the specification for all logical types.
+
+### Metadata
+
+The parquet format's `ConvertedType` stores the type annotation. The annotation
+may require additional metadata fields, as well as rules for those fields.
+
+### UTF8 (Strings)
+
+`UTF8` may only be used to annotate the binary primitive type and indicates
+that the byte array should be interpreted as a UTF-8 encoded character string.
+
+### DECIMAL
+
+`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
+the form `unscaledValue * 10^(-scale)`.
+
+The primitive type stores an unscaled integer value. For byte arrays, binary
+and fixed, the unscaled number must be encoded as two's complement using
+big-endian byte order (the most significant byte is the zeroth element). The
+scale stores the number of digits of that value that are to the right of the
+decimal point, and the precision stores the maximum number of digits supported
+in the unscaled value.
+
+If not specified, the scale is 0. Scale must be zero or a positive integer less
+than the precision. Precision is required and must be a non-zero positive
+integer. A precision too large for the underlying type (see below) is an error.
+
+`DECIMAL` can be used to annotate the following types:
+* `int32`: for 1 <= precision <= 9
+* `int64`: for 1 <= precision <= 18; precision <= 10 will produce a
+ warning
+* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
+ can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
+* `binary`: `precision` is not limited, but is required. The minimum number of
+ bytes to store the unscaled value should be used.
+
+A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
+`scale` and `precision` fields set, even if scale is 0 by default.
diff --git a/README.md b/README.md
index 0b7a058..d64247f 100644
--- a/README.md
+++ b/README.md
@@ -114,6 +114,18 @@
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays.
+### Logical Types
+Logical types are used to extend the types that parquet can be used to store,
+by specifying how the primitive types should be interpreted. This keeps the set
+of primitive types to a minimum and reuses parquet's efficient encodings. For
+example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+These annotations define how to further decode and interpret the data.
+Annotations are stored as a `ConvertedType` in the file metadata and are
+documented in
+[LogicalTypes.md][logical-types].
+
+[logical-types]: https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md
+
## Nested Encoding
To encode nested columns, Parquet uses the Dremel encoding with definition and
repetition levels. Definition levels specify how many optional fields in the
diff --git a/src/thrift/parquet.thrift b/src/thrift/parquet.thrift
index dea26ef..52dea7f 100644
--- a/src/thrift/parquet.thrift
+++ b/src/thrift/parquet.thrift
@@ -60,6 +60,22 @@
/** an enum is converted into a binary field */
ENUM = 4;
+
+ /**
+ * A decimal value.
+ *
+ * This may be used to annotate binary or fixed primitive types. The
+ * underlying byte array stores the unscaled value encoded as two's
+ * complement using big-endian byte order (the most significant byte is the
+ * zeroth element). The value of the decimal is the value * 10^{-scale}.
+ *
+ * This must be accompanied by a (maximum) precision and a scale in the
+ * SchemaElement. The precision specifies the number of digits in the decimal
+ * and the scale stores the location of the decimal point. For example 1.23
+ * would have precision 3 (3 total digits) and scale 2 (the decimal point is
+ * 2 digits over).
+ */
+ DECIMAL = 5;
}
/**
@@ -86,7 +102,7 @@
2: optional binary min;
/** count of null value in the column */
3: optional i64 null_count;
- /** count of dictinct values occuring */
+ /** count of distinct values occurring */
4: optional i64 distinct_count;
}
@@ -125,6 +141,12 @@
* Used to record the original type to help with cross conversion.
*/
6: optional ConvertedType converted_type;
+
+ /** Used when this column contains decimal data.
+ * See the DECIMAL converted type for more details.
+ */
+ 7: optional i32 scale
+ 8: optional i32 precision
}
/**
@@ -145,9 +167,9 @@
PLAIN = 0;
/** Group VarInt encoding for INT32/INT64.
+ * This encoding is deprecated. It was never used
*/
-// GROUP_VAR_INT = 1;
-// This encoding is deprecated. It was never used
+ // GROUP_VAR_INT = 1;
/**
* Deprecated: Dictionary encoding. The values in the dictionary are encoded in the