Fury Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Fury will convert an object from/to fury java serialization binary format. Fury has two core concepts for java serialization:
The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Fury flexible, much more easy to use, but also introduce more complexities compared to static serialization frameworks. So the format will be more complex.
Here is the overall format:
| fury header | object ref meta | object class meta | object value data |
The data are serialized using little endian byte order overall. If bytes swap is costly for some object, Fury will write the byte order for that object into the data instead of converting it to little endian.
Fury header consists starts one byte:
| 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | optional 4 bytes | +---------------+-------+-------+--------+-------+------------------------------------+ | reserved bits | oob | xlang | endian | null | unsigned int for meta start offset |
BufferCallback
is not null, 0 otherwise.If meta share mode is enabled, an uncompressed unsigned int is appended to indicate the start offset of metadata.
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
Reference flags:
Flag | Byte Value | Description |
---|---|---|
NULL FLAG | -3 | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. |
REF FLAG | -2 | This flag indicates the object is already serialized previously, and fury will write a ref id with unsigned varint format instead of serialize it again |
NOT_NULL VALUE FLAG | -1 | This flag indicates the object is a non-null value and fury doesn't track ref for this type of object. |
REF VALUE FLAG | 0 | This flag indicates the object is referencable and the first time to serialize. |
When reference tracking is disabled globally or for specific types, or for certain types within a particular context(e.g., a field of a class), only the NULL
and NOT_NULL VALUE
flags will be used for reference meta.
Fury supports to register class by an optional id, the registration can be used for security check and class identification. If a class is registered, it will have a user-provided or an auto-growing unsigned int i.e. class_id
.
Depending on whether meta share mode and registration is enabled for current class, Fury will write class meta differently.
If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows:
class_id << 1
.0bxxxxxxx1
first, then write class name.1
, which is different from first bit 0
of encoded class id. Fury can use this information to determine whether to read class by class id for deserialization.dimensions << 1 | 1
first, then write component class subsequently. This can reduce array class name cost if component class is or will be serialized.package name
and class name
. If meta share mode is enabled, class will be written as an unsigned varint which points to index in MetaContext
.If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows:
This mode will forbid streaming writing since it needs to look back for update the start offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure deserialization failure doesn‘t lost shared meta. Meta streamline will be supported in the future for enclosed meta sharing which doesn’t cross multiple serializations of different objects.
For Schema consistent mode, class will be encoded as an enumerated string by full class name. Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes meta header | meta size | variable bytes | variable bytes | variable bytes | +-------------------------------+-----------|--------------------+-------------------+----------------+ | 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta | ... |
Class meta are encoded from parent class to leaf class, only class with serializable fields will be encoded.
Meta header is a 64 bits number value encoded in little endian order.
0b0000~0b1110
are used to record num classes. 0b1111
is preserved to indicate that Fury need to read more bytes for length using Fury unsigned int encoding. If current class doesn‘t has parent class, or parent class doesn’t have fields to serialize, or we're in a context which serialize fields of current class only( ObjectStreamSerializer#SlotInfo
is an example), num classes will be 1.flags + all layers class meta
.| unsigned varint | meta string | meta string | field info: variable bytes | variable bytes | ... | +----------------------------+-----------------------+---------------------+-------------------------------+-----------------+-----+ | num fields + register flag | header + package name | header + class name | header + type id + field name | next field info | ... |
num fields << 1 | register flag(1 when class registered)
as unsigned varint.0
to flag it.UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL
6 bits size | 2 bits encoding flags
. The 6 bits size: 0~63
will be used to indicate size 0~62
, the value 63
the size need more byte to read, the encoding will encode size - 62
as a varint next.UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL
6 bits size | 2 bits encoding flags
. The 6 bits size: 0~63
will be used to indicate size 1~64
, the value 63
the size need more byte to read, the encoding will encode size - 63
as a varint next.3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag
. Users can use annotation to provide those info.UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID
11
.3 bits size: 0~7
will be used to indicate length 1~7
, the value 6
the size read more bytes, the encoding will encode size - 7
as a varint next.TAG_ID
, then num_bytes of field name will be used to store tag id.final
.OBJECT_ID
if it isn‘t final
and FINAL_OBJECT_ID
if it’s final
. The meta for such types is written separately instead of inlining here is to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too.Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and using a more compact encoding.
Same encoding algorithm as the previous layer except:
varint index + sharing flag(set)
will be writtenLOWER_SPECIAL
and the length of encoded string <=
64, then header will be 6 bits size + encoding flag(set) + sharing flag(unset)
.3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)
Meta string is mainly used to encode meta strings such as class name and field names.
String binary encoding algorithm:
Algorithm | Pattern | Description |
---|---|---|
LOWER_SPECIAL | a-z._$| | every char is written using 5 bits, a-z : 0b00000~0b11001 , ._$| : 0b11010~0b11101 , prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
LOWER_UPPER_DIGIT_SPECIAL | a-zA-Z0~9._ | every char is written using 6 bits, a-z : 0b00000~0b11001 , A-Z : 0b11010~0b110011 , 0~9 : 0b110100~0b111101 , ._ : 0b111110~0b111111 , prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
UTF-8 | any chars | UTF-8 encoding |
Encoding flags:
Encoding Flag | Pattern | Encoding Algorithm |
---|---|---|
LOWER_SPECIAL | every char is in a-z._$| | LOWER_SPECIAL |
FIRST_TO_LOWER_SPECIAL | every char is in a-z[c1,c2] except first char is upper case | replace first upper case char to lower case, then use LOWER_SPECIAL |
ALL_TO_LOWER_SPECIAL | every char is in a-zA-Z[c1,c2] | replace every upper case char by | + lower case , then use LOWER_SPECIAL , use this encoding if it's smaller than Encoding LOWER_UPPER_DIGIT_SPECIAL |
LOWER_UPPER_DIGIT_SPECIAL | every char is in a-zA-Z[c1,c2] | use LOWER_UPPER_DIGIT_SPECIAL encoding if it's smaller than Encoding FIRST_TO_LOWER_SPECIAL |
UTF8 | any utf-8 char | use UTF-8 encoding |
Compression | any utf-8 char | lossless compression |
Notes:
c1,c2
should be ._
; For field/type name encoding, c1,c2
should be _$
;flags + data
jointly, uses 3 bits of first byte for flags and other bytes for data.The shared meta string format consists of header and encoded string binary. Header of encoded string binary will be inlined in shared meta header.
Header is written using little endian order, Fury can read this flag first to determine how to deserialize the data.
If string hasn't been written before, the data will be written as follows:
| unsigned varint: string binary size + 1 bit: not written before | 56 bits: unique hash | 3 bits encoding flags + string binary |
If string binary size is less than 16
bytes, the hash will be omitted to save spaces. Unique hash can be omitted too if caller pass a flag to disable it. In such cases, the format will be:
| unsigned varint: string binary size + 1 bit: not written before | 3 bits encoding flags + string binary |
If string has been written before, the data will be written as follows:
| unsigned varint: written string id + 1 bit: written before |
false
, 1 for true
b & 0x80 == 0x80
, then the next byte should be read until the first bit of the next byte is unset.(v << 1) ^ (v >> 31)
ZigZag algorithm, then encoding it as an unsigned int.b & 0x80 == 0x80
, then the next byte should be read until the first bit is unset.| little-endian: ((int) value) << 1 |
| 0b1 | little-endian 8 bytes long |
(v << 1) ^ (v >> 63)
ZigZag algorithm to reduce cost of small negative numbers, then encoding it as an unsigned long.Float.floatToRawIntBits
, then write as binary by little endian order.Double.doubleToRawLongBits
, then write as binary by little endian order.Format:
| header: size << 2 | 2 bits encoding flags | binary data |
size + encoding
will be concat as a long and encoded as an unsigned var long. The little 2 bits is used for encoding: 0 for latin
, 1 for utf-16
, 2 for utf-8
.latin/utf-16/utf-8
.Which encoding to choose:
latin
at runtime, if string is latin
string, then use latin
encoding, otherwise use utf-16
.coder
in String
object for encoding, latin
/utf-16
will be used for encoding.utf-8
, then fury will use utf-8
to decode the data. But currently fury doesn't enable utf-8 encoding by default for java. Cross-language string serialization of fury uses utf-8
by default.All collection serializers must extend
AbstractCollectionSerializer
.
Format:
length(unsigned varint) | collection header | elements header | elements data
ArrayList/LinkedArrayList/HashSet/LinkedHashSet
, this will be empty.TreeSet
, this will be Comparator
ArrayList
, this may be extra object field info.In most cases, all collection elements are same type and not null, elements header will encode those homogeneous information to avoid the cost of writing it for every element. Specifically, there are four kinds of information which will be encoded by elements header, each use one bit:
0b1
of the header to flag it.0b10
of the header to flag it. If ref tracking is enabled for this element type, this flag is invalid.0b100
of the header to flag it.0b1000
header to flag it.By default, all bits are unset, which means all elements won't track ref, all elements are same type, not null and the actual element is the declared type in the custom class field.
The implementation can generate different deserialization code based read header, and look up the generated code from a linear map/list.
Based on the elements header, the serialization of elements data may skip ref flag
/null flag
/element class info
.
CollectionSerializer#write/read
can be taken as an example.
Primitive array are taken as a binary buffer, serialization will just write the length of array size as an unsigned int, then copy the whole buffer into the stream.
Such serialization won't compress the array. If users want to compress primitive array, users need to register custom serializers for such types.
Object array is serialized using the collection format. Object component type will be taken as collection element generic type.
All Map serializers must extend
AbstractMapSerializer
.
Format:
| length(unsigned varint) | map header | key value pairs data |
HashMap/LinkedHashMap
, this will be empty.TreeMap
, this will be Comparator
Map
, this may be extra object field info.Map iteration is too expensive, Fury won't compute the header like for collection before since it introduce considerable overhead. Users can use MapFieldInfo
annotation to provide header in advance. Otherwise Fury will use first key-value pair to predict header optimistically, and update the chunk header if the prediction failed at some pair.
Fury will serialize map chunk by chunk, every chunk has 127 pairs at most.
| 1 byte | 1 byte | variable bytes | +----------------+----------------+-----------------+ | chunk size: N | KV header | N*2 objects |
KV header:
0b1
of the header to flag it.0b10
of the header to flag it. If ref tracking is enabled for this key type, this flag is invalid.0b100
of the header to flag it.0b1000
of the header to flag it.0b10000
of the header to flag it.0b100000
of the header to flag it. If ref tracking is enabled for this value type, this flag is invalid.0b1000000
header to flag it.0b10000000
of the header to flag it.If streaming write is enabled, which means Fury can't update written chunk size
. In such cases, map key-value data format will be:
| 1 byte | variable bytes | +----------------+-----------------+ | KV header | N*2 objects |
KV header
will be a header marked by MapFieldInfo
in java. The implementation can generate different deserialization code based read header, and look up the generated code from a linear map/list.
Enums are serialized as an unsigned var int. If the order of enum values change, the deserialized enum value may not be the value users expect. In such cases, users must register enum serializer by make it write enum value as an enumerated string with unique hash disabled.
Object means object of pojo/struct/bean/record
type. Object will be serialized by writing its fields data in fury order.
Depending on schema compatibility, objects will have different formats.
Field will be ordered as following, every group of fields will have its own order:
Object fields will be serialized one by one using following format:
Primitive field value: | var bytes | +----------------+ | value data | +----------------+ Boxed field value: | one byte | var bytes | +-----------+---------------+ | null flag | field value | +-----------+---------------+ field value of final type with ref tracking: | var bytes | var objects | +-----------+-------------+ | ref meta | value data | +-----------+-------------+ field value of final type without ref tracking: | one byte | var objects | +-----------+-------------+ | null flag | field value | +-----------+-------------+ field value of non-final type with ref tracking: | one byte | var bytes | var objects | +-----------+-------------+-------------+ | ref meta | class meta | value data | +-----------+-------------+-------------+ field value of non-final type without ref tracking: | one byte | var bytes | var objects | +-----------+------------+------------+ | null flag | class meta | value data | +-----------+------------+------------+
Schema evolution have similar format as schema consistent mode for object except:
schema consistent
mode will write class by id/name, but schema evolution
mode will write class field names, types and other meta too, see Class meta.final custom type
needs to be written too, because peers may not have this class defined.Class will be serialized using class meta format.