title: Xlang Serialization Format sidebar_position: 0 id: fory_xlang_serialization_spec license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Apache Fory™ xlang serialization enables automatic cross-language object serialization with support for shared references, circular references, and polymorphism. Unlike traditional serialization frameworks that require IDL definitions and schema compilation, Fory serializes objects directly without any intermediate steps.
Key characteristics:
This specification defines the Fory xlang binary format. The format is dynamic rather than static, which enables flexibility and ease of use at the cost of additional complexity in the wire format.
List<SomeClass>, we can save dynamic serializer dispatch since SomeClass is morphic(final).struct whose type mapping will be encoded as a name.compatible_struct whose type mapping will be encoded as a name.ext type whose type mapping will be encoded as a name.list/map/set/array are not allowed as key of map.Note:
For polymorphism, if one non-final class is registered, and only one subclass is registered, then we can take all elements in List/Map have same type, thus reduce runtime check cost.
Collection/Array polymorphism are not fully supported, since some languages such as golang have only one collection type. If users want to get exactly the type he passed, he must pass that type when deserializing or annotate that type to the field of struct.
Due to differences between type systems of languages, those types can't be mapped one-to-one between languages. When deserializing, Fory use the target data structure type and the data type in the data jointly to determine how to deserialize and populate the target data structure. For example:
class Foo { int[] intArray; Object[] objects; List<Object> objectList; } class Foo2 { int[] intArray; List<Object> objects; List<Object> objectList; }
intArray has an int32_array type. But both objects and objectList fields in the serialize data have list data type. When deserializing, the implementation will create an Object array for objects, but create a ArrayList for objectList to populate its elements. And the serialized data of Foo can be deserialized into Foo2 too.
Users can also provide meta hints for fields of a type, or the type whole. Here is an example in java which use annotation to provide such information.
@ForyObject(fieldsNullable = false, trackingRef = false) class Foo { @ForyField(trackingRef = false) int[] intArray; @ForyField(polymorphic = true) Object object; @ForyField(tagId = 1, nullable = true) List<Object> objectList; }
Such information can be provided in other languages too:
All internal data types are expressed using an ID in range 0~64. Users can use IDs in range 0~8192 for registering their custom types (struct/ext/enum). User type IDs are in a separate namespace and combined with internal type IDs via bit shifting: (user_type_id << 8) | internal_type_id.
| Type ID | Name | Description |
|---|---|---|
| 0 | UNKNOWN | Unknown type, used for dynamic typing |
| 1 | BOOL | Boolean value |
| 2 | INT8 | 8-bit signed integer |
| 3 | INT16 | 16-bit signed integer |
| 4 | INT32 | 32-bit signed integer |
| 5 | VAR_INT32 | Variable-length encoded 32-bit signed integer |
| 6 | INT64 | 64-bit signed integer |
| 7 | VAR_INT64 | Variable-length encoded 64-bit signed integer |
| 8 | SLI_INT64 | Small Long as Int encoded 64-bit signed integer |
| 9 | FLOAT16 | 16-bit floating point (half precision) |
| 10 | FLOAT32 | 32-bit floating point (single precision) |
| 11 | FLOAT64 | 64-bit floating point (double precision) |
| 12 | STRING | UTF-8/UTF-16/Latin1 encoded string |
| 13 | ENUM | Enum registered by numeric ID |
| 14 | NAMED_ENUM | Enum registered by namespace + type name |
| 15 | STRUCT | Struct registered by numeric ID (schema consistent) |
| 16 | COMPATIBLE_STRUCT | Struct with schema evolution support (by ID) |
| 17 | NAMED_STRUCT | Struct registered by namespace + type name |
| 18 | NAMED_COMPATIBLE_STRUCT | Struct with schema evolution (by name) |
| 19 | EXT | Extension type registered by numeric ID |
| 20 | NAMED_EXT | Extension type registered by namespace + type name |
| 21 | LIST | Ordered collection (List, Array, Vector) |
| 22 | SET | Unordered collection of unique elements |
| 23 | MAP | Key-value mapping |
| 24 | DURATION | Time duration (seconds + nanoseconds) |
| 25 | TIMESTAMP | Point in time (nanoseconds since epoch) |
| 26 | LOCAL_DATE | Date without timezone (days since epoch) |
| 27 | DECIMAL | Arbitrary precision decimal |
| 28 | BINARY | Raw binary data |
| 29 | ARRAY | Generic array type |
| 30 | BOOL_ARRAY | 1D boolean array |
| 31 | INT8_ARRAY | 1D int8 array |
| 32 | INT16_ARRAY | 1D int16 array |
| 33 | INT32_ARRAY | 1D int32 array |
| 34 | INT64_ARRAY | 1D int64 array |
| 35 | FLOAT16_ARRAY | 1D float16 array |
| 36 | FLOAT32_ARRAY | 1D float32 array |
| 37 | FLOAT64_ARRAY | 1D float64 array |
| 38 | UNION | Tagged union type (one of several alternatives) |
| 39 | NONE | Empty/unit type (no data) |
When registering user types (struct/ext/enum), the full type ID combines user ID and internal type ID:
Full Type ID = (user_type_id << 8) | internal_type_id
Examples:
| User ID | Type | Internal ID | Full Type ID | Decimal |
|---|---|---|---|---|
| 0 | STRUCT | 15 | (0 << 8) | 15 | 15 |
| 0 | ENUM | 13 | (0 << 8) | 13 | 13 |
| 1 | STRUCT | 15 | (1 << 8) | 15 | 271 |
| 1 | COMPATIBLE_STRUCT | 16 | (1 << 8) | 16 | 272 |
| 2 | NAMED_STRUCT | 17 | (2 << 8) | 17 | 529 |
When reading type IDs:
internal_type_id = full_type_id & 0xFFuser_type_id = full_type_id >> 8See Type mapping
Here is the overall format:
| fory header | object ref meta | object type meta | object value data |
The data are serialized using little endian byte order overall. If bytes swap is costly for some object, Fory will write the byte order for that object into the data instead of converting it to little endian.
Fory header format for xlang serialization:
| 2 bytes | 1 byte bitmap | 1 byte | optional 4 bytes | +--------------+--------------------------------+------------+------------------------------------+ | magic number | 4 bits reserved | 4 bits meta | language | unsigned int for meta start offset |
Detailed byte layout:
Byte 0-1: Magic number (0x62d4) - little endian
Byte 2: Bitmap flags
- Bit 0: null flag (0x01)
- Bit 1: endian flag (0x02)
- Bit 2: xlang flag (0x04)
- Bit 3: oob flag (0x08)
- Bits 4-7: reserved
Byte 3: Language ID (only present when xlang flag is set)
Byte 4-7: Meta start offset (only present when meta share mode is enabled)
0x62d4 (2 bytes, little endian) - used to identify fory xlang serialization protocol.| Language | ID |
|---|---|
| XLANG | 0 |
| JAVA | 1 |
| PYTHON | 2 |
| CPP | 3 |
| GO | 4 |
| JAVASCRIPT | 5 |
| RUST | 6 |
| DART | 7 |
If compatible mode is enabled, an uncompressed unsigned int32 (4 bytes, little endian) is appended to indicate the start offset of metadata. During serialization, this is initially written as a placeholder (e.g., -1 or 0), then updated after all objects are serialized and metadata is collected.
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
| Flag | Byte Value (int8) | Hex | Description |
|---|---|---|---|
| NULL FLAG | -3 | 0xFD | Object is null. No further bytes are written for this object. |
| REF FLAG | -2 | 0xFE | Object was already serialized. Followed by unsigned varint32 reference ID. |
| NOT_NULL VALUE FLAG | -1 | 0xFF | Object is non-null but reference tracking is disabled for this type. Object data follows immediately. |
| REF VALUE FLAG | 0 | 0x00 | Object is referencable and this is its first occurrence. Object data follows. Assigns next reference ID. |
Writing:
function write_ref_or_null(buffer, obj):
if obj is null:
buffer.write_int8(NULL_FLAG) // -3
return true // done, no more data to write
if reference_tracking_enabled:
ref_id = lookup_written_objects(obj)
if ref_id exists:
buffer.write_int8(REF_FLAG) // -2
buffer.write_varuint32(ref_id)
return true // done, reference written
else:
buffer.write_int8(REF_VALUE_FLAG) // 0
add_to_written_objects(obj, next_ref_id++)
return false // continue to serialize object data
else:
buffer.write_int8(NOT_NULL_VALUE_FLAG) // -1
return false // continue to serialize object data
Reading:
function read_ref_or_null(buffer):
flag = buffer.read_int8()
switch flag:
case NULL_FLAG (-3):
return (null, true) // null object, done
case REF_FLAG (-2):
ref_id = buffer.read_varuint32()
obj = get_from_read_objects(ref_id)
return (obj, true) // referenced object, done
case NOT_NULL_VALUE_FLAG (-1):
return (null, false) // non-null, continue reading
case REF_VALUE_FLAG (0):
reserve_ref_slot() // will be filled after reading
return (null, false) // non-null, continue reading
0REF_VALUE_FLAG is written (first occurrence)When reference tracking is disabled globally or for specific types, only the NULL and NOT_NULL VALUE flags will be used for reference meta. This reduces overhead for types that are known not to have references.
Languages with nullable and reference types by default (Java, Python, JavaScript):
In xlang mode, for cross-language compatibility:
Optional types (e.g., java.util.Optional, typing.Optional) are treated as nullableAnnotation examples:
// Java: use @ForyField annotation public class MyClass { @ForyField(nullable = true, ref = true) private Object refField; @ForyField(nullable = false) private String requiredField; }
# Python: use typing with fory field descriptors from pyfory import Fory, ForyField class MyClass: ref_field: ForyField(SomeType, nullable=True, ref=True) required_field: ForyField(str, nullable=False)
Languages with non-nullable types by default:
| Language | Null Representation | Reference Tracking Support |
|---|---|---|
| Rust | Option::None | Via Rc<T>, Arc<T>, Weak<T> |
| C++ | std::nullopt, nullptr | Via std::shared_ptr<T>, weak_ptr<T> |
| Go | nil interface/pointer | Via pointer/interface types |
Important: For languages like Rust that don't have implicit reference semantics, reference tracking must use explicit smart pointers (Rc, Arc).
For every type to be serialized, it have a type id to indicate its type.
Type.ENUM + registered idType.NAMED_ENUM + registered namespace+typenameType.ListType.SETType.MAPType.EXT + registered idType.NAMED_EXT + registered namespace+typenameType.STRUCT + struct metaType.NAMED_STRUCT + struct metaEvery type must be registered with an ID or name first. The registration can be used for security check and type identification.
Struct is a special type, depending whether schema compatibility is enabled, Fory will write struct meta differently.
Only ext/enum/struct can be registered using namespaced type.
type_id. Schema evolution related meta will be ignored.struct vs table in flatbuffers:captured_type_defs: captured_type_defs[type def stub] = map size ahead when registering type.captured_type_defs, write that index as | unsigned varint: index |.If schema evolution mode is enabled globally when creating fory, and enabled for current type, type meta will be written using one of the following mode. Which mode to use is configured when creating fory.
Normal mode(meta share not enabled):
type def to captured_type_defs: captured_type_defs[type def] = map size.captured_type_defs, write that index as | unsigned varint: index |.captured_type_defs:Firstly, set current to meta start offset of fory header
Then write captured_type_defs one by one:
buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs)) for type_meta in writting_type_defs: if not type_meta.is_stub(): type_meta.write_type_def(buffer) writing_type_defs = copy(schema_consistent_type_def_stubs)
Meta share mode: the writing steps are same as the normal mode, but captured_type_defs will be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize:
captured_type_defs = {} stream = ... # add `Type1` to `captured_type_defs` and write `Type1` fory.serialize(stream, [Type1()]) # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before. fory.serialize(stream, [Type1(), Type2()]) # `Type1` and `Type2` are written before, no need to write meta. fory.serialize(stream, [Type1(), Type2()])
Streaming mode(streaming mode doesn't support meta share):
If type meta hasn't been written before, the data will be written as:
| unsigned varint: 0b11111111 | type def |
If type meta has been written before, the data will be written as:
| unsigned varint: written index << 1 |
written index is the id in captured_type_defs.
With this mode, meta start offset can be omitted.
The normal mode and meta share mode will forbid streaming writing since it needs to look back for update the start offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure deserialization failure in meta share mode doesn't lost shared meta.
Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes header | variable bytes | variable bytes | +----------------------+--------------------+-------------------+ | global binary header | meta header | fields meta |
For languages which support inheritance, if parent class and subclass has fields with same name, using field in subclass.
50 bits hash + 1bit compress flag + write fields meta + 12 bits meta size. Right is the lower bits.
>= 0b1111_1111_1111, then write meta_ size - 0b1111_1111_1111 next.flags + all layers class meta.Meta header is a 8 bits number value.
0b00000~0b11110 are used to record num fields. 0b11111 is preserved to indicate that Fory need to read more bytes for length using Fory unsigned int encoding. Note that num_fields is the number of compatible fields. Users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields.Format:
| field info: variable bytes | variable bytes | ... | +---------------------------------+-----------------+-----+ | header + type info + field name | next field info | ... |
Field Header is 8 bits, annotation can be used to provide more specific info. If annotation not exists, fory will infer those info automatically.
The format for field header is:
2 bits field name encoding + 4 bits size + nullability flag + ref tracking flag
Detailed spec:
UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID11.4 bits size: 0~14 will be used to indicate length 1~15, the value 15 indicates to read more bytes, the encoding will encode size - 15 as a varint next.TAG_ID, then num_bytes of field name will be used to store tag id.Field type info is written as unsigned int8. Detailed id spec is:
Type.STRUCT.Type.NAMED_STRUCT.Type.ENUM.Type.NAMED_ENUM.Type.EXT.Type.NAMED_EXT.Type.LIST/SET, then write element type recursively.Type.XXX_ARRAY.Type.TENSOR.Type.LIST, then write element type recursively.Type.MAP, then write key and value type recursively.Type.UNKNOWN instead. For such types, actual type will be written when serializing such field values.Polymorphism spec:
struct/named_struct/ext/named_ext are taken as polymorphic, the meta for those types are written separately instead of inlining here to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too.enum is taken as morphic, if deserialization doesn't have this field, or the type is not enum, enum value will be skipped.list/map/set are taken as morphic, when serializing values of those type, the concrete types won't be written again.List/Set/Map nested type spec:
list: | list type id | nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |set: | set type id | nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |map: | set type id | key type info | value type info || nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info || nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |If tag id is set, tag id will be used instead. Otherwise meta string of field name will be written instead.
Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fory fields sort algorithms. In this way, fory can compute statistics for field names or types and using a more compact encoding.
If one want to support inheritance for struct, one can implement following spec.
Fields are serialized from parent type to leaf type. Fields are sorted using fory struct fields sort algorithms.
Meta layout for schema evolution mode:
| 8 bytes header | variable bytes | variable bytes | variable bytes | variable bytes | +----------------------+----------------+----------------+--------------------+--------------------+ | global binary header | meta header | fields meta | parent meta header | parent fields meta |
Meta header is a 64 bits number value encoded in little endian order.
0b0000~0b1110 are used to record num classes. 0b1111 is preserved to indicate that Fory need to read more bytes for length using Fory unsigned int encoding. If current type doesn‘t has parent type, or parent type doesn’t have fields to serialize, or we're in a context which serialize fields of current type only, num classes will be 1.flags + all layers type meta.| unsigned varint | var uint | field info: variable bytes | variable bytes | ... | +-----------------+----------+-------------------------------+-----------------+-----+ | num_fields | type id | header + type id + field name | next field info | ... |
Same encoding algorithm as the previous layer.
Meta string is a compressed encoding for metadata strings such as field names, type names, and namespaces. This compression significantly reduces the size of type metadata in serialized data.
| ID | Name | Bits/Char | Character Set |
|---|---|---|---|
| 0 | UTF8 | 8 | Any UTF-8 character |
| 1 | LOWER_SPECIAL | 5 | a-z . _ $ | |
| 2 | LOWER_UPPER_DIGIT_SPECIAL | 6 | a-z A-Z 0-9 . _ |
| 3 | FIRST_TO_LOWER_SPECIAL | 5 | First char uppercase, rest a-z . _ |
| 4 | ALL_TO_LOWER_SPECIAL | 5 | a-z A-Z . _ (uppercase escaped) |
| Character | Code (binary) | Code (decimal) |
|---|---|---|
| a-z | 00000-11001 | 0-25 |
| . | 11010 | 26 |
| _ | 11011 | 27 |
| $ | 11100 | 28 |
| | | 11101 | 29 |
Note: The | character is used as an escape sequence in ALL_TO_LOWER_SPECIAL encoding.
| Character | Code (binary) | Code (decimal) |
|---|---|---|
| a-z | 000000-011001 | 0-25 |
| A-Z | 011010-110011 | 26-51 |
| 0-9 | 110100-111101 | 52-61 |
| . | 111110 | 62 |
| _ | 111111 | 63 |
For strings containing only a-z, ., _, $, |:
function encode_lower_special(str):
bits = []
for char in str:
bits.append(lookup_lower_special[char]) // 5 bits each
// Pad to byte boundary
total_bits = len(str) * 5
padding_bits = (8 - (total_bits % 8)) % 8
// First bit indicates if last char should be stripped (due to padding)
strip_last = (padding_bits >= 5)
if strip_last:
prepend bit 1
else:
prepend bit 0
return pack_bits_to_bytes(bits)
For strings like MyFieldName where only the first character is uppercase:
function encode_first_to_lower_special(str):
// Convert first char to lowercase
modified = str[0].lower() + str[1:]
// Then use LOWER_SPECIAL encoding
return encode_lower_special(modified)
For strings with multiple uppercase characters like MyTypeName:
function encode_all_to_lower_special(str):
result = ""
for char in str:
if char.is_upper():
result += "|" + char.lower() // Escape uppercase with |
else:
result += char
return encode_lower_special(result)
Example: MyType → |my|type → encoded with LOWER_SPECIAL
function choose_encoding(str):
if all chars in str are in [a-z . _ $ |]:
return LOWER_SPECIAL
if first char is uppercase AND rest are in [a-z . _]:
return FIRST_TO_LOWER_SPECIAL
if all chars are in [a-z A-Z . _]:
lower_special_size = encode_all_to_lower_special(str).size
luds_size = encode_lower_upper_digit_special(str).size
if lower_special_size <= luds_size:
return ALL_TO_LOWER_SPECIAL
else:
return LOWER_UPPER_DIGIT_SPECIAL
if all chars are in [a-z A-Z 0-9 . _]:
return LOWER_UPPER_DIGIT_SPECIAL
return UTF8
Meta strings are written with a header that includes the encoding type:
| 3 bits encoding | 5+ bits length | encoded bytes |
Or for larger strings:
| varuint: (length << 3) | encoding | encoded bytes |
Different contexts use different special characters:
| Context | Special Chars | Notes |
|---|---|---|
| Field Name | . _ $ | | $ for inner classes, | for escape |
| Namespace | . _ | Package/module separators |
| Type Name | $ _ | $ for inner classes in Java |
Meta strings are deduplicated within a serialization session:
First occurrence: | (length << 1) | [hash if large] | encoding | bytes | Reference: | ((id + 1) << 1) | 1 |
false, 1 for trueb & 0x80 == 0x80), then the next byte should be read until a byte with unset continuation bit.Encoding Algorithm:
function write_varuint32(value):
while value >= 0x80:
buffer.write_byte((value & 0x7F) | 0x80) // 7 bits of data + continuation bit
value = value >> 7
buffer.write_byte(value) // final byte without continuation bit
Decoding Algorithm:
function read_varuint32():
result = 0
shift = 0
while true:
byte = buffer.read_byte()
result = result | ((byte & 0x7F) << shift)
if (byte & 0x80) == 0:
break
shift = shift + 7
return result
Byte sizes by value range:
| Value Range | Bytes |
|---|---|
| 0 ~ 127 | 1 |
| 128 ~ 16383 | 2 |
| 16384 ~ 2097151 | 3 |
| 2097152 ~ 268435455 | 4 |
| 268435456 ~ 4294967295 | 5 |
ZigZag Encoding:
// Encode: convert signed to unsigned zigzag_value = (value << 1) ^ (value >> 31) // Decode: convert unsigned back to signed original = (zigzag_value >> 1) ^ (-(zigzag_value & 1)) // Or equivalently: original = (zigzag_value >> 1) ^ (~(zigzag_value & 1) + 1)
ZigZag encoding maps signed integers to unsigned integers so that small absolute values (positive or negative) have small encoded values:
| Original | ZigZag Encoded |
|---|---|
| 0 | 0 |
| -1 | 1 |
| 1 | 2 |
| -2 | 3 |
| 2 | 4 |
| ... | ... |
Fory supports two encoding schemes for 64-bit integers:
Fory SLI (Small Long as Int) Encoding:
Optimized for values that fit in 31 bits (common case for IDs, timestamps, etc.):
if value in [0, 2147483647]: // fits in 31 bits
write 4 bytes: ((int32) value) << 1 // bit 0 is 0, indicating 4-byte encoding
else:
write 1 byte: 0x01 // bit 0 is 1, indicating 9-byte encoding
write 8 bytes: value as little-endian int64
Reading:
first_int32 = read_int32_le()
if (first_int32 & 1) == 0:
return first_int32 >> 1 // 4-byte encoding
else:
return read_int64_le() // read remaining 8 bytes
Fory PVL (Progressive Variable-Length) Encoding:
Standard varint encoding extended to 64 bits:
function write_varuint64(value):
while value >= 0x80:
buffer.write_byte((value & 0x7F) | 0x80)
value = value >> 7
buffer.write_byte(value)
| Value Range | Bytes |
|---|---|
| 0 ~ 127 | 1 |
| 128 ~ 16383 | 2 |
| ... | ... |
| 2^56 ~ 2^63-1 | 9 |
A specialized encoding used for string headers that combines size (up to 36 bits) with encoding flags:
// Write: encodes (size << 2) | encoding_flags
function write_varuint36_small(value):
if value < 0x80:
buffer.write_byte(value)
else:
// Standard varint encoding for values >= 128
write_varuint64(value)
This encoding is optimized for the common case where string length fits in 7 bits (strings < 32 characters).
Fory SLI (Small Long as Int) Encoding for signed:
Optimized for small signed values:
if value in [-1073741824, 1073741823]: // fits in 31 bits signed
write 4 bytes: ((int32) value) << 1 // bit 0 is 0
else:
write 1 byte: 0x01 // bit 0 is 1
write 8 bytes: value as little-endian int64
Fory PVL (Progressive Variable-Length) Encoding for signed:
Uses ZigZag encoding first, then varint:
// Encode zigzag_value = (value << 1) ^ (value >> 63) write_varuint64(zigzag_value) // Decode zigzag_value = read_varuint64() value = (zigzag_value >> 1) ^ (-(zigzag_value & 1))
Format:
| varuint36_small: (size << 2) | encoding | binary data |
The header is encoded using varuint36_small format, which combines the byte length and encoding type:
header = (byte_length << 2) | encoding_type
| Encoding Type | Value | Description |
|---|---|---|
| LATIN1 | 0 | ISO-8859-1 single-byte encoding |
| UTF16 | 1 | UTF-16 encoding (2 bytes per code unit) |
| UTF8 | 2 | UTF-8 variable-length encoding |
| Reserved | 3 | Reserved for future use |
Writing:
function write_string(str):
bytes = encode_to_bytes(str, chosen_encoding)
header = (bytes.length << 2) | encoding_type
buffer.write_varuint36_small(header)
buffer.write_bytes(bytes)
Reading:
function read_string():
header = buffer.read_varuint36_small()
encoding = header & 0x03
byte_length = header >> 2
bytes = buffer.read_bytes(byte_length)
return decode_bytes(bytes, encoding)
Writing:
| Language | Encoding Strategy |
|---|---|
| Java (JDK8) | Detect at runtime: LATIN1 if all chars < 256, else UTF16 |
| Java (JDK9+) | Use String's internal coder: LATIN1 or UTF16 |
| Python | Can write LATIN1, UTF16, or UTF8 based on string content |
| C++ | UTF8 (std::string) or UTF16 (std::u16string) |
| Rust | UTF8 (String) |
| Go | UTF8 (string) |
| JavaScript | UTF8 |
Reading: All languages support decoding all three encodings (LATIN1, UTF16, UTF8).
Recommendation: Select encoding based on maximum performance - use the encoding that matches the language's native string representation to avoid conversion overhead.
Empty strings are encoded with header 0 (length 0, any encoding) followed by no data bytes.
Duration is an absolute length of time, independent of any calendar/timezone, as a count of seconds and nanoseconds.
Format:
| signed varint64: seconds | signed int32: nanoseconds |
seconds: Number of seconds in the duration, encoded as a signed varint64. Can be positive or negative.nanoseconds: Nanosecond adjustment to the duration, encoded as a signed int32. Value range is [0, 999,999,999] for positive durations, and [-999,999,999, 0] for negative durations.Notes:
Format:
| varuint32: length | 1 byte elements header | [optional type info] | elements data |
The elements header is a single byte that encodes metadata about the collection elements to optimize serialization:
| bit 7-4 (reserved) | bit 3 | bit 2 | bit 1 | bit 0 | +--------------------+-------------+------------------+----------+-----------+ | reserved | is_same_type| is_decl_elem_type| has_null | track_ref |
| Bit | Name | Value | Meaning when SET (1) | Meaning when UNSET (0) |
|---|---|---|---|---|
| 0 | track_ref | 0x01 | Track references for elements | Don't track element references |
| 1 | has_null | 0x02 | Collection may contain null elements | No null elements (skip null checks) |
| 2 | is_decl_elem_type | 0x04 | Elements are the declared generic type | Element types differ from declared type |
| 3 | is_same_type | 0x08 | All elements have the same runtime type | Elements have different runtime types |
Common header values:
| Header | Hex | Meaning |
|---|---|---|
| 0x0C | 12 | Declared type + same type, non-null, no ref tracking (optimal) |
| 0x0D | 13 | Declared type + same type, non-null, with ref tracking |
| 0x0E | 14 | Declared type + same type, may have nulls, no ref tracking |
| 0x08 | 8 | Same type but not declared type (type info written once) |
| 0x00 | 0 | Different types, non-null, no ref tracking (type per element) |
When is_decl_elem_type (bit 2) is NOT set, the element type info is written once after the header if is_same_type (bit 3) is set:
| header (0x08) | type_id (varuint32) | elements... |
When both is_decl_elem_type and is_same_type are NOT set, type info is written per element.
The header determines how each element is serialized:
Based on the elements header, the serialization of elements data may skip ref flag/null flag/element type info.
fory = ... buffer = ... elems = ... if element_type_is_same: if not is_declared_type: fory.write_type(buffer, elem_type) elem_serializer = get_serializer(...) if track_ref: for elem in elems: if not ref_resolver.write_ref_or_null(buffer, elem): elem_serializer.write(buffer, elem) elif has_null: for elem in elems: if elem is None: buffer.write_byte(null_flag) else: buffer.write_byte(not_null_flag) elem_serializer.write(buffer, elem) else: for elem in elems: elem_serializer.write(buffer, elem) else: if track_ref: for elem in elems: fory.write_ref(buffer, elem) elif has_null: for elem in elems: fory.write_nullable(buffer, elem) else: for elem in elems: fory.write_value(buffer, elem)
CollectionSerializer#writeElements can be taken as an example.
Primitive array are taken as a binary buffer, serialization will just write the length of array size as an unsigned int, then copy the whole buffer into the stream.
Such serialization won't compress the array. If users want to compress primitive array, users need to register custom serializers for such types or mark it as list type.
Tensor is a special primitive multi-dimensional array which all dimensions have same size and type. The serialization format is:
| num_dims(unsigned varint) | shape[0](unsigned varint) | shape[...] | shape[N] | element type | data |
The data is continuous to reduce copy and may zero-copy in some cases.
Object array is serialized using the list format. Object component type will be taken as list element generic type.
Map uses a chunk-based format to handle heterogeneous key-value pairs efficiently:
| varuint32: total_size | chunk_1 | chunk_2 | ... | chunk_n |
Each chunk contains up to 255 key-value pairs with the same metadata characteristics:
| 1 byte | 1 byte | variable bytes | +--------------+----------------+------------------------------+ | KV header | chunk size N | N key-value pairs (N*2 obj) |
The KV header is a single byte encoding metadata for both keys and values:
| bit 7-6 | bit 5 | bit 4 | bit 3 | bit 2 | bit 1 | bit 0 | +------------+---------------+--------------+---------------+---------------+--------------+---------------+ | reserved | val_decl_type | val_has_null | val_track_ref | key_decl_type | key_has_null | key_track_ref |
| Bit | Name | Value | Meaning when SET (1) |
|---|---|---|---|
| 0 | key_track_ref | 0x01 | Track references for keys |
| 1 | key_has_null | 0x02 | Keys may be null (rare, usually invalid) |
| 2 | key_decl_type | 0x04 | Key is the declared generic type |
| 3 | val_track_ref | 0x08 | Track references for values |
| 4 | val_has_null | 0x10 | Values may be null |
| 5 | val_decl_type | 0x20 | Value is the declared generic type |
Common KV header values:
| Header | Hex | Meaning |
|---|---|---|
| 0x24 | 36 | Key + value are declared types, non-null, no ref tracking (optimal) |
| 0x2C | 44 | Key + value declared types, value tracks refs |
| 0x34 | 52 | Key + value declared types, value may be null |
| 0x00 | 0 | Key + value not declared types, non-null, no ref tracking |
Map iteration is expensive. Computing a single header for all pairs would require two passes. The chunk-based approach allows:
When fory will use first key-value pair to predict header optimistically, it can‘t know how many pairs have same meta(tracking kef ref, key has null and so on). If we don’t write chunk by chunk with max chunk size, we must write at least X bytes to take up a place for later to update the number which has same elements, X is the num_bytes for encoding varint encoding of map size.
And most map size are smaller than 255, if all pairs have same data, the chunk will be 1. This is common in golang/rust, which object are not reference by default.
Also, if only one or two keys have different meta, we can make it into a different chunk, so that most pairs can share meta.
The implementation can accumulate read count with map size to decide whether to read more chunks.
Enums are serialized as an unsigned var int. If the order of enum values change, the deserialized enum value may not be the value users expect. In such cases, users must register enum serializer by make it write enum value as an enumerated string with unique hash disabled.
Not supported for now.
Struct means object of class/pojo/struct/bean/record type. Struct will be serialized by writing its fields data in fory order.
Depending on schema compatibility, structs will have different formats.
Field will be ordered as following, every group of fields will have its own order:
If two fields have same type, then sort by snake_case styled field name.
Object will be written as:
| 4 byte | variable bytes | +---------------+------------------+ | type hash | field values |
Type hash is used to check the type schema consistency across languages. Type hash will be the first 32 bits of 56 bits value of the type meta.
Object fields will be serialized one by one using following format:
not null primitive field value: | var bytes | +----------------+ | value data | +----------------+ nullable primitive field value: | one byte | var bytes | +-----------+---------------+ | null flag | field value | +-----------+---------------+ other interal types supported by fory | var bytes | var objects | +-----------+-------------+ | null flag | value data | +-----------+-------------+ list field type: | one byte | var objects | +-----------+-------------+ | ref meta | value data | set field type: | one byte | var objects | +-----------+-------------+ | ref meta | value data | map field type: | one byte | var objects | +-----------+-------------+ | ref meta | value data | +-----------+-------------+-------------+ other types such as enum/struct/ext | one byte | var bytes | var objects | +-----------+------------+------------+ | ref flag | type meta | value data | +-----------+------------+------------+
Type hash algorithm:
""snow_case(field_name),. For camelcase name, convert it to snow_case first.$type_id,, for other fields, use type id TypeId::UNKNOWN instead.$nullable;, 1 if nullable, 0 otherwise.Schema evolution have similar format as schema consistent mode for object except:
schema consistent mode will write type by id only, but schema evolution mode will write type consisting of field names, types and other meta too, see Type meta.final custom type needs to be written too, because peers may not have this type defined.Type will be serialized using type meta format.
For type evolution, the serializer will encode the type meta into the serialized data. The deserializer will compare this meta with class meta in the current process, and use the diff to determine how to deserialize the data.
For java/javascript/python, we can use the diff to generate serializer code at runtime and load it as class/function for deserialization. In this way, the type evolution will be as fast as type consist mode.
For C++/Rust, we can‘t generate the serializer code at runtime. So we need to generate the code at compile-time using meta programming. But at that time, we don’t know the type schema in other processes, so we can't generate the serializer code for such inconsistent types. We may need to generate the code which has a loop and compare field name one by one to decide whether to deserialize and assign the field or skip the field value.
One fast way is that we can optimize the string comparison into jump instructions:
n fields, and the peer type has n1 fields.field id from 0 for every sorted field in the current type at the compile time.n, cache this meta at runtime.switch to compare the field id to deserialize data and assign/skip field value. Continuous field id will be optimized into jump in switch block, so it will very fast.Here is an example, suppose process A has a class Foo with version 1 defined as Foo1, process B has a class Foo with version 2 defined as Foo2:
// class Foo with version 1 class Foo1 { int32_t v1; // id 0 std::string v2; // id 1 }; // class Foo with version 2 class Foo2 { // id 0, but will have id 2 in process A bool v0; // id 1, but will have id 0 in process A int32_t v1; // id 2, but will have id 3 in process A int64_t long_value; // id 3, but will have id 1 in process A std::string v2; // id 4, but will have id 4 in process A std::vector<std::string> list; };
When process A received serialized Foo2 from process B, here is how it deserialize the data:
Foo1 foo1 = ...; const std::vector<fory::FieldInfo> &field_infos = type_meta.field_infos; for (const auto &field_info : field_infos) { switch (field_info.field_id) { case 0: foo1.v1 = buffer.read_varint32(); break; case 1: foo1.v2 = fory.read_string(); break; default: fory.skip_data(field_info); } }
This section provides a step-by-step guide for implementing Fory xlang serialization in a new language.
Buffer Implementation
write_int8, write_int16, write_int32, write_int64write_float32, write_float64read_* counterparts for all write methodsVarint Encoding
write_varuint32 / read_varuint32write_varint32 / read_varint32 (with ZigZag)write_varuint64 / read_varuint64write_varint64 / read_varint64 (with ZigZag)write_varuint36_small / read_varuint36_small (for strings)Header Handling
0x62d4Primitive Types
String Serialization
(byte_length << 2) | encodingTemporal Types
Reference Tracking
List/Array Serialization
Map Serialization
Set Serialization
Meta strings are required for enum and struct serialization (encoding field names, type names, namespaces).
Type Registration
(user_id << 8) | internal_type_idField Ordering
Schema Consistent Mode
Schema Evolution Mode (Optional)
ThreadSafeFory wrapperid(obj) for reference trackingdataclass support via code generationFORY_STRUCT, FORY_FIELD_INFO)std::shared_ptr for reference tracking#[derive(ForyObject)])Rc<T> / Arc<T> for reference tracking