Add metadata in the schema for storing decimals.

commit: b2836e591da8216cfca47075baee2c9a7b0b9289 [log] [tgz]
author: Nong Li <nong@cloudera.com> Mon Mar 03 16:10:10 2014 -0800
committer: Nong Li <nong@cloudera.com> Tue Apr 15 10:32:44 2014 -0700
tree: 67fe872134ed29da75bd150a1ae246ced5f46f20
parent: f84edb351e7f981ac539384c2d7acf64d2808a6b [diff]
diff --git a/LogicalTypes.md b/LogicalTypes.md
new file mode 100644
index 0000000..96775af
--- /dev/null
+++ b/LogicalTypes.md

@@ -0,0 +1,47 @@
+Parquet Logical Type Definitions
+====
+
+Logical types are used to extend the types that parquet can be used to store,
+by specifying how the primitive types should be interpreted. This keeps the set
+of primitive types to a minimum and reuses parquet's efficient encodings. For
+example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+
+This file contains the specification for all logical types.
+
+### Metadata
+
+The parquet format's `ConvertedType` stores the type annotation. The annotation
+may require additional metadata fields, as well as rules for those fields.
+
+### UTF8 (Strings)
+
+`UTF8` may only be used to annotate the binary primitive type and indicates
+that the byte array should be interpreted as a UTF-8 encoded character string.
+
+### DECIMAL
+
+`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
+the form `unscaledValue * 10^(-scale)`.
+
+The primitive type stores an unscaled integer value. For byte arrays, binary
+and fixed, the unscaled number must be encoded as two's complement using
+big-endian byte order (the most significant byte is the zeroth element). The
+scale stores the number of digits of that value that are to the right of the
+decimal point, and the precision stores the maximum number of digits supported
+in the unscaled value.
+
+If not specified, the scale is 0. Scale must be zero or a positive integer less
+than the precision. Precision is required and must be a non-zero positive
+integer. A precision too large for the underlying type (see below) is an error.
+
+`DECIMAL` can be used to annotate the following types:
+* `int32`: for 1 &lt;= precision &lt;= 9
+* `int64`: for 1 &lt;= precision &lt;= 18; precision &lt;= 10 will produce a
+  warning
+* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
+  can store &lt;= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
+* `binary`: `precision` is not limited, but is required. The minimum number of
+  bytes to store the unscaled value should be used.
+
+A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
+`scale` and `precision` fields set, even if scale is 0 by default.

diff --git a/README.md b/README.md
index 0b7a058..d64247f 100644
--- a/README.md
+++ b/README.md

@@ -114,6 +114,18 @@
   - DOUBLE: IEEE 64-bit floating point values
   - BYTE_ARRAY: arbitrarily long byte arrays.
 
+### Logical Types
+Logical types are used to extend the types that parquet can be used to store,
+by specifying how the primitive types should be interpreted. This keeps the set
+of primitive types to a minimum and reuses parquet's efficient encodings. For
+example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+These annotations define how to further decode and interpret the data.
+Annotations are stored as a `ConvertedType` in the file metadata and are
+documented in
+[LogicalTypes.md][logical-types].
+
+[logical-types]: https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md
+
 ## Nested Encoding
 To encode nested columns, Parquet uses the Dremel encoding with definition and 
 repetition levels.  Definition levels specify how many optional fields in the 

diff --git a/src/thrift/parquet.thrift b/src/thrift/parquet.thrift
index dea26ef..52dea7f 100644
--- a/src/thrift/parquet.thrift
+++ b/src/thrift/parquet.thrift

@@ -60,6 +60,22 @@
 
   /** an enum is converted into a binary field */
   ENUM = 4;
+
+  /**
+   * A decimal value.
+   *
+   * This may be used to annotate binary or fixed primitive types. The
+   * underlying byte array stores the unscaled value encoded as two's
+   * complement using big-endian byte order (the most significant byte is the
+   * zeroth element). The value of the decimal is the value * 10^{-scale}.
+   *
+   * This must be accompanied by a (maximum) precision and a scale in the
+   * SchemaElement. The precision specifies the number of digits in the decimal
+   * and the scale stores the location of the decimal point. For example 1.23
+   * would have precision 3 (3 total digits) and scale 2 (the decimal point is
+   * 2 digits over).
+   */
+  DECIMAL = 5;
 }
 
 /**
@@ -86,7 +102,7 @@
    2: optional binary min;
    /** count of null value in the column */
    3: optional i64 null_count;
-   /** count of dictinct values occuring */
+   /** count of distinct values occurring */
    4: optional i64 distinct_count;
 }
 
@@ -125,6 +141,12 @@
    * Used to record the original type to help with cross conversion.
    */
   6: optional ConvertedType converted_type;
+
+  /** Used when this column contains decimal data.
+   * See the DECIMAL converted type for more details.
+   */
+  7: optional i32 scale
+  8: optional i32 precision
 }
 
 /**
@@ -145,9 +167,9 @@
   PLAIN = 0;
 
   /** Group VarInt encoding for INT32/INT64.
+   * This encoding is deprecated. It was never used
    */
-//  GROUP_VAR_INT = 1;
-// This encoding is deprecated. It was never used
+  //  GROUP_VAR_INT = 1;
 
   /**
    * Deprecated: Dictionary encoding. The values in the dictionary are encoded in the
commit	b2836e591da8216cfca47075baee2c9a7b0b9289	[log] [tgz]
author	Nong Li <nong@cloudera.com>	Mon Mar 03 16:10:10 2014 -0800
committer	Nong Li <nong@cloudera.com>	Tue Apr 15 10:32:44 2014 -0700
tree	67fe872134ed29da75bd150a1ae246ced5f46f20
parent	f84edb351e7f981ac539384c2d7acf64d2808a6b [diff]