tree: 65b10dc812e8c5dcfb1bde903697655d47fdf56d [path history] [tgz]
  1. array_empty.metadata
  2. array_empty.value
  3. array_nested.metadata
  4. array_nested.value
  5. array_primitive.metadata
  6. array_primitive.value
  7. data_dictionary.json
  8. long_string.metadata
  9. long_string.value
  10. object_empty.metadata
  11. object_empty.value
  12. object_nested.metadata
  13. object_nested.value
  14. object_primitive.metadata
  15. object_primitive.value
  16. primitive_binary.metadata
  17. primitive_binary.value
  18. primitive_boolean_false.metadata
  19. primitive_boolean_false.value
  20. primitive_boolean_true.metadata
  21. primitive_boolean_true.value
  22. primitive_date.metadata
  23. primitive_date.value
  24. primitive_decimal16.metadata
  25. primitive_decimal16.value
  26. primitive_decimal4.metadata
  27. primitive_decimal4.value
  28. primitive_decimal8.metadata
  29. primitive_decimal8.value
  30. primitive_double.metadata
  31. primitive_double.value
  32. primitive_float.metadata
  33. primitive_float.value
  34. primitive_int16.metadata
  35. primitive_int16.value
  36. primitive_int32.metadata
  37. primitive_int32.value
  38. primitive_int64.metadata
  39. primitive_int64.value
  40. primitive_int8.metadata
  41. primitive_int8.value
  42. primitive_null.metadata
  43. primitive_null.value
  44. primitive_string.metadata
  45. primitive_string.value
  46. primitive_time.metadata
  47. primitive_time.value
  48. primitive_timestamp.metadata
  49. primitive_timestamp.value
  50. primitive_timestamp_nanos.metadata
  51. primitive_timestamp_nanos.value
  52. primitive_timestampntz.metadata
  53. primitive_timestampntz.value
  54. primitive_timestampntz_nanos.metadata
  55. primitive_timestampntz_nanos.value
  56. primitive_uuid.metadata
  57. primitive_uuid.value
  58. README.md
  59. regen.py
  60. short_string.metadata
  61. short_string.value
variant/README.md

Variant Binary Encoding

This directory contains binary artifacts encoded using the Parquet Variant binary encoding. These files are not valid Parquet files, but rather raw binary data.

Structure

  • data_dictionary.json - contains the JSON representation for each example

Each example consists of 2 files:

  • .metadata -- the binary contents of the metadata field
  • .value -- the binary contents of the value field

Descriptions

  1. primitive_<type> -- Examples primitive (basic_type = 1), one for each of the primitive types listed in the spec
  2. short_string -- Example of short string (basic_type = 2)
  3. object_empty -- Example of object (basic_type = 3) with no fields
  4. object_primitive -- Example of object with only primitive fields
  5. object_nested -- Example of object with other objects in fields
  6. array_empty -- Example of array (basic_type = 4) with no elements
  7. array_primitive -- Example of array with only primitive elements
  8. array_nested -- Example of an with objects and other arrays in the elements

Regenerating these files

The files in this directory were initially generated by running the regen.py script which used Apache Spark to generate the files. The files have been subsequently modified when necessary to ensure that they conform to the Parquet spec.

Modification 1: Created metadata and value for primitive_null as a single byte (0x01)

Per https://github.com/apache/parquet-testing/issues/81, Spark did not generate any metadata for null and left primitive_null.metadata empty. The metadata for primitive_null should be the same 3 bytes as other primitive types

  • header = 0x01
  • dictionary_size = 0x00
  • dictionary_size + 1 = 1 byte values: 0x00
cp primitive_int8.metadata primitive_null.metadata

The value for a primitive should be a value_header and no value_data, resulting in a single 0 byte:

echo -n 'a' | tr a '\0' > primitive_null.value

Modification 2: Created TimeNTZ/Timestamp with timezone nanos/Timestamp without timezone nanos/UUID with Iceberg test code

Currently, Spark does not support Variant values containing UUID, Time, or nanosecond-precision Timestamp. the primitive_time.[metadata/value], primitive_timestamp_nanos.[metadata/value], primitive_timestampntz_nanos.[metadata/value] and primitive_uuid.[metadata/data] was generated by Iceberg test code