blob: dd2f57cce9b2692d0f61c704c9f9a6f64ae22fdf [file] [log] [blame] [view]
---
title: Java Serialization Format
sidebar_position: 1
id: fory_java_serialization_spec
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
## Spec overview
Apache Fory™ Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Apache Fory™
will
convert an object from/to fory java serialization binary format. Apache Fory™ has two core concepts for java serialization:
- **Apache Fory™ Java Binary format**
- **Framework to convert object to/from Apache Fory™ Java Binary format**
The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Apache Fory™ flexible,
much more easy to use, but
also introduce more complexities compared to static serialization frameworks. So the format will be more complex.
Here is the overall format:
```
| fory header | object ref meta | object class meta | object value data |
```
The data are serialized using little endian byte order overall. If bytes swap is costly for some object,
Fory will write the byte order for that object into the data instead of converting it to little endian.
## Fory header
Fory header consists starts one byte:
```
| 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | optional 4 bytes |
+---------------+-------+-------+--------+-------+------------------------------------+
| reserved bits | oob | xlang | endian | null | unsigned int for meta start offset |
```
- null flag: 1 when object is null, 0 otherwise. If an object is null, other bits won't be set.
- endian flag: 1 when data is encoded by little endian, 0 for big endian.
- xlang flag: 1 when serialization uses xlang format, 0 when serialization uses Fory java format.
- oob flag: 1 when passed `BufferCallback` is not null, 0 otherwise.
If meta share mode is enabled, an uncompressed unsigned int is appended to indicate the start offset of metadata.
## Reference Meta
Reference tracking handles whether the object is null, and whether to track reference for the object by writing
corresponding flags and maintaining internal state.
Reference flags:
| Flag | Byte Value | Description |
| ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| NULL FLAG | `-3` | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. |
| REF FLAG | `-2` | This flag indicates the object is already serialized previously, and fory will write a ref id with unsigned varint format instead of serialize it again |
| NOT_NULL VALUE FLAG | `-1` | This flag indicates the object is a non-null value and fory doesn't track ref for this type of object. |
| REF VALUE FLAG | `0` | This flag indicates the object is referencable and the first time to serialize. |
When reference tracking is disabled globally or for specific types, or for certain types within a particular
context(e.g., a field of a class), only the `NULL` and `NOT_NULL VALUE` flags will be used for reference meta.
## Class Meta
Fory supports to register class by an optional id, the registration can be used for security check and class
identification.
If a class is registered, it will have a user-provided or an auto-growing unsigned int i.e. `class_id`.
Depending on whether meta share mode and registration is enabled for current class, Fory will write class meta
differently.
### Schema consistent
If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows:
- If class is registered, it will be written as a fory unsigned varint: `class_id << 1`.
- If class is not registered:
- If class is not an array, fory will write one byte `0bxxxxxxx1` first, then write class name.
- The first little bit is `1`, which is different from first bit `0` of
encoded class id. Fory can use this information to determine whether to read class by class id for
deserialization.
- If class is not registered and class is an array, fory will write one byte `dimensions << 1 | 1` first, then write
component
class subsequently. This can reduce array class name cost if component class is or will be serialized.
- Class will be written as two enumerated fory unsigned by default: `package name` and `class name`. If meta share
mode is
enabled,
class will be written as an unsigned varint which points to index in `MetaContext`.
### Schema evolution
If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows:
- If meta share mode is not enabled, class meta will be written as schema consistent mode. Additionally, field meta such
as field type
and name will be written with the field value using a key-value like layout.
- If meta share mode is enabled, class meta will be written as a meta-share encoded binary if class hasn't been written
before, otherwise an unsigned varint id which references to previous written class meta will be written.
## Meta share
> This mode will forbid streaming writing since it needs to look back for update the start offset after the whole object
> graph
> writing and meta collecting is finished. Only in this way we can ensure deserialization failure doesn't lost shared
> meta.
> Meta streamline will be supported in the future for enclosed meta sharing which doesn't cross multiple serializations
> of different objects.
For Schema consistent mode, class will be encoded as an enumerated string by full class name. Here we mainly describe
the meta layout for schema evolution mode:
```
| 8 bytes global meta header | 1~2 bytes | variable bytes | variable bytes | variable bytes |
+-------------------------------+-------------|--------------------+-------------------+----------------+
| 50 bits hash + 14 bits header | type header | current class meta | parent class meta | ... |
```
Class meta are encoded from parent class to leaf class, only class with serializable fields will be encoded.
### Global meta header
Meta header is a 64 bits number value encoded in little endian order.
- lower 12 bits are used to encode meta size. If meta size `>= 0b1111_1111_1111`, then write
`meta_ size - 0b1111_1111_1111` next.
- 13rd bit is used to indicate whether to write fields meta. When this class is schema-consistent or use registered
serializer, fields meta will be skipped. Class Meta will be used for share namespace + type name only.
- 14rd bit is used to indicate whether meta is compressed.
- Other 50 bits is used to store the unique hash of `flags + all layers class meta`.
### Type header
- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is preserved to indicate that Fory need to
read more bytes for length using Fory unsigned int encoding. If current class doesn't has parent class, or parent
class doesn't have fields to serialize, or we're in a context which serialize fields of current class
only(`ObjectStreamSerializer#SlotInfo` is an example), num classes will be 1.
- Other 4 bits are preserved to future extensions.
- If num classes are greater than or equal to `0b1111`, write `num_classes - 0b1111` as varuint next.
### Single layer class meta
```
| unsigned varint | meta string | meta string | field info: variable bytes | variable bytes | ... |
+----------------------------+-----------------------+---------------------+-------------------------------+-----------------+-----+
| num fields + register flag | header + package name | header + class name | header + type id + field name | next field info | ... |
```
- num fields: encode `num fields << 1 | register flag(1 when class registered)` as unsigned varint.
- If class is registered, then an unsigned varint class id will be written next, package and class name will be
omitted.
- If current class is schema consistent, then num field will be `0` to flag it.
- If current class isn't schema consistent, then num field will be the number of compatible fields. For example,
users
can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema
consistent
fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use
fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use
fields info in meta for deserializing compatible fields.
- Package name encoding(omitted when class is registered):
- encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`
- Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`,
the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next.
- Class name encoding(omitted when class is registered):
- encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL`
- header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`,
the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next.
- Field info:
- header(8
bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`.
Users can use annotation to provide those info.
- 2 bits field name encoding:
- encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
- If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`.
- size of field name:
- The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes,
the encoding will encode `size - 7` as a varint next.
- If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id.
- ref tracking: when set to 1, ref tracking will be enabled for this field.
- nullability: when set to 1, this field can be null.
- polymorphism: when set to 1, the actual type of field will be the declared field type even the type if
not `final`.
- type id:
- For registered type-consistent classes, it will be the registered class id.
- Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The
meta for such types is written separately instead of inlining here is to reduce meta space cost if object of
this type is serialized in current object graph multiple times, and the field value may be null too.
- Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will
be written instead.
Field order are left as implementation details, which is not exposed to specification, the deserialization need to
resort fields based on Fory field comparator. In this way, fory can compute statistics for field names or types and
using a more compact encoding.
### Other layers class meta
Same encoding algorithm as the previous layer except:
- header + package name:
- Header:
- If package name has been written before: `varint index + sharing flag(set)` will be written
- If package name hasn't been written before:
- If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 64, then header will be
`6 bits size + encoding flag(set) + sharing flag(unset)`.
- Otherwise, header will
be `3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)`
## Meta String
Meta string is mainly used to encode meta strings such as class name and field names.
### Encoding Algorithms
String binary encoding algorithm:
| Algorithm | Pattern | Description |
| ------------------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| UTF-8 | any chars | UTF-8 encoding |
Encoding flags:
| Encoding Flag | Pattern | Encoding Algorithm |
| ------------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| LOWER_SPECIAL | every char is in `a-z._$\|` | `LOWER_SPECIAL` |
| FIRST_TO_LOWER_SPECIAL | every char is in `a-z[c1,c2]` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` |
| ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z[c1,c2]` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `LOWER_UPPER_DIGIT_SPECIAL` |
| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z[c1,c2]` | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `FIRST_TO_LOWER_SPECIAL` |
| UTF8 | any utf-8 char | use `UTF-8` encoding |
| Compression | any utf-8 char | lossless compression |
Notes:
- For package name encoding, `c1,c2` should be `._`; For field/type name encoding, `c1,c2` should be `_$`;
- Depending on cases, one can choose encoding `flags + data` jointly, uses 3 bits of first byte for flags and other
bytes
for data.
### Shared meta string
The shared meta string format consists of header and encoded string binary. Header of encoded string binary will be
inlined
in shared meta header.
Header is written using little endian order, Fory can read this flag first to determine how to deserialize the data.
#### Write by data
If string hasn't been written before, the data will be written as follows:
```
| unsigned varint: string binary size + 1 bit: not written before | 56 bits: unique hash | 3 bits encoding flags + string binary |
```
If string binary size is less than `16` bytes, the hash will be omitted to save spaces. Unique hash can be omitted too
if caller pass a flag to disable it. In such cases, the format will be:
```
| unsigned varint: string binary size + 1 bit: not written before | 3 bits encoding flags + string binary |
```
#### Write by ref
If string has been written before, the data will be written as follows:
```
| unsigned varint: written string id + 1 bit: written before |
```
## Value Format
### Basic types
#### Bool
- size: 1 byte
- format: 0 for `false`, 1 for `true`
#### Byte
- size: 1 byte
- format: write as pure byte.
#### Short
- size: 2 byte
- byte order: little endian order
#### Char
- size: 2 byte
- byte order: little endian order
#### Unsigned int
- size: 1~5 byte
- Format: The most significant bit (MSB) in every byte indicates whether to have the next byte. If first bit is set
i.e. `b & 0x80 == 0x80`, then
the next byte should be read until the first bit of the next byte is unset.
#### Signed int
- size: 1~5 byte
- Format: First convert the number into positive unsigned int by `(v << 1) ^ (v >> 31)` ZigZag algorithm, then encoding
it as an unsigned int.
#### Unsigned long
- size: 1~9 byte
- Fory PVL(Progressive Variable-length Long) Encoding:
- positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set
i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset.
#### Signed long
- size: 1~9 byte
- Fory SLI(Small long as int) Encoding:
- If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |`
- Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
- Fory PVL(Progressive Variable-length Long) Encoding:
- First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of
small negative numbers, then encoding it as an unsigned long.
#### Float
- size: 4 byte
- format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then write as binary by little endian order.
#### Double
- size: 8 byte
- format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then write as binary by little endian order.
### String
Format:
```
| header: size << 2 | 2 bits encoding flags | binary data |
```
- `size + encoding` will be concat as a long and encoded as an unsigned var long. The little 2 bits is used for
encoding:
0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
- encoded string binary data based on encoding: `latin/utf-16/utf-8`.
Which encoding to choose:
- For JDK8: fory detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise
use `utf-16`.
- For JDK9+: fory use `coder` in `String` object for encoding, `latin`/`utf-16` will be used for encoding.
- If the string is encoded by `utf-8`, then fory will use `utf-8` to decode the data. But currently fory doesn't enable
utf-8 encoding by default for java. Cross-language string serialization of fory uses `utf-8` by default.
### Collection
> All collection serializers must extend `CollectionLikeSerializer`.
Format:
```
length(unsigned varint) | collection header | elements header | elements data
```
#### Collection header
- For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty.
- For `TreeSet`, this will be `Comparator`
- For subclass of `ArrayList`, this may be extra object field info.
#### Elements header
In most cases, all collection elements are same type and not null, elements header will encode those homogeneous
information to avoid the cost of writing it for every element. Specifically, there are four kinds of information
which will be encoded by elements header, each use one bit:
- If track elements ref, use the first bit `0b1` of the header to flag it.
- If the collection has null, use the second bit `0b10` of the header to flag it. If ref tracking is enabled for this
element type, this flag is invalid.
- If the collection element types are the declared type, use the 3rd bit `0b100` of the header to flag it.
- If the collection element types are same, use the 4th bit `0b1000` header to flag it.
By default, all bits are unset, which means all elements won't track ref, all elements are same type, not null and
the actual element is the declared type in the custom class field.
The implementation can generate different deserialization code based read header, and look up the generated code from a
linear map/list.
#### Elements data
Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`.
`CollectionSerializer#write/read` can be taken as an example.
### Array
#### Primitive array
Primitive array are taken as a binary buffer, serialization will just write the length of array size as an unsigned int,
then copy the whole buffer into the stream.
Such serialization won't compress the array. If users want to compress primitive array, users need to register custom
serializers for such types.
#### Object array
Object array is serialized using the collection format. Object component type will be taken as collection element
generic
type.
### Map
> All Map serializers must extend `MapLikeSerializer`.
Format:
```
| length(unsigned varint) | map header | key value pairs data |
```
#### Map header
- For `HashMap/LinkedHashMap`, this will be empty.
- For `TreeMap`, this will be `Comparator`
- For other `Map`, this may be extra object field info.
#### Map Key-Value data
Map iteration is too expensive, Fory won't compute the header like for collection before since it introduce
[considerable overhead](https://github.com/apache/fory/issues/925).
Users can use `MapFieldInfo` annotation to provide header in advance. Otherwise Fory will use first key-value pair to
predict header optimistically, and update the chunk header if the prediction failed at some pair.
Fory will serialize map chunk by chunk, every chunk has 127 pairs at most.
```
| 1 byte | 1 byte | variable bytes |
+----------------+----------------+-----------------+
| KV header | chunk size: N | N*2 objects |
```
KV header:
- If track key ref, use the first bit `0b1` of the header to flag it.
- If the key has null, use the second bit `0b10` of the header to flag it. If ref tracking is enabled for this
key type, this flag is invalid.
- If the actual key type of map is the declared key type, use the 3rd bit `0b100` of the header to flag it.
- If track value ref, use the 4th bit `0b1000` of the header to flag it.
- If the value has null, use the 5th bit `0b10000` of the header to flag it. If ref tracking is enabled for this
value type, this flag is invalid.
- If the value type of map is the declared value type, use the 6rd bit `0b100000` of the header to flag it.
- If key or value is null, that key and value will be written as a separate chunk, and chunk size writing will be
skipped too.
If streaming write is enabled, which means Fory can't update written `chunk size`. In such cases, map key-value data
format will be:
```
| 1 byte | variable bytes |
+----------------+-----------------+
| KV header | N*2 objects |
```
`KV header` will be a header marked by `MapFieldInfo` in java. The implementation can generate different deserialization
code based read header, and look up the generated code from a linear map/list.
### Enum
Enums are serialized as an unsigned var int. If the order of enum values change, the deserialized enum value may not be
the value users expect. In such cases, users must register enum serializer by make it write enum value as an enumerated
string with unique hash disabled.
### Object
Object means object of `pojo/struct/bean/record` type.
Object will be serialized by writing its fields data in fory order.
Depending on schema compatibility, objects will have different formats.
#### Field order
Field will be ordered as following, every group of fields will have its own order:
- primitive fields: larger size type first, smaller later, variable size type last.
- boxed primitive fields: same order as primitive fields
- final fields: same type together, then sorted by field name lexicographically.
- collection fields: same order as final fields
- map fields: same order as final fields
- other fields: same order as final fields
#### Schema consistent
Object fields will be serialized one by one using following format:
```
Primitive field value:
| var bytes |
+----------------+
| value data |
+----------------+
Boxed field value:
| one byte | var bytes |
+-----------+---------------+
| null flag | field value |
+-----------+---------------+
field value of final type with ref tracking:
| var bytes | var objects |
+-----------+-------------+
| ref meta | value data |
+-----------+-------------+
field value of final type without ref tracking:
| one byte | var objects |
+-----------+-------------+
| null flag | field value |
+-----------+-------------+
field value of non-final type with ref tracking:
| one byte | var bytes | var objects |
+-----------+-------------+-------------+
| ref meta | class meta | value data |
+-----------+-------------+-------------+
field value of non-final type without ref tracking:
| one byte | var bytes | var objects |
+-----------+------------+------------+
| null flag | class meta | value data |
+-----------+------------+------------+
```
#### Schema evolution
Schema evolution have similar format as schema consistent mode for object except:
- For this object type itself, `schema consistent` mode will write class by id/name, but `schema evolution` mode will
write class field names, types and other meta too, see [Class meta](#class-meta).
- Class meta of `final custom type` needs to be written too, because peers may not have this class defined.
### Class
Class will be serialized using class meta format.
## Implementation guidelines
- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost.
- Read multiple bytes as an int/long, then split into multiple bytes to reduce memory IO and bound check cost.
- Try to use one varint/long to write flags and length together to save one byte cost and reduce memory io.
- Condition branches are less expensive compared to memory IO cost unless there are too many branches.