by Raymie Stata
In the early days of Avro, Schema resolution was implemented in a number of places, e.g., GenericDatumReader
as well as ResolvingGrammarGenerator
. However, Schema resolution is complicated and thus error prone. Multiple implementations were hard to maintain, both for correctness and for updates to the schema-resolution spec.
To address the problems of multiple implementations, we converged on the implementation found in ResolvingGrammarGenerator
(together with ResolvingDecoder
) as the single implementation, and refactored other parts of Avro to depend on this implementation.
Converging on a single implementation solved the maintenance problem, and has served well for a number of years. However, the logic in ResolvingGrammarGenerator
does two things: it contains the logic for schema resolution itself, and it contains the logic for embedding that logic into a grammar that can be used by ResolvingDecoder
.
Recently, Avro contributors have wanted access to the logic of schema resolution apart from ResolvingDecoder
. For example, AVRO-2247 proposes a new, faster approach to implementing DatumReaders
. The initial implementation of AVRO-2247 was forced to reimplement Schema resolution -- going back to the world of multiple implementations -- because there isn't a reusable implementation of our resolution logic.
Similarly, as I‘ve been working on extending the performance improvements of AVRO-2090 when writing data, I’ve been thinking about the possibilities of dynamic code generation. Here too, I can't reuse ResolvingGrammarGenerator
, which would force me to reimplement the schema-resolution logic.
We introduce a new class to encapsulate the logic of schema resolution independent from the logic of implementing schema resolution as a ResolvingDecoder
grammar. In particular, we introduce a new class org.apache.avro.Resolver
with the following key function:
public static Resolver.Action resolve(Schema writer, Schema reader);
The subclasses of Resolver.Action
encapsulate various ways to resolve schemas. The resolve
function walks the reader‘s and writer’s schema parse trees together, and generate a tree of Resolver.Action
nodes indicating how to resolve each subtree of the writer‘s schema into the corresponding subtree of the reader’s.
Resolve.Action
has the following subclasses:
DoNothing
-- nothing needs to be done to resolve the writer‘s data into the reader’s schema. That is, the reader should read the data written by the writer as if it were written using the reader‘s own schema. This can be generated for any kind of schema -- for example, if the reader’s and writer's schemas are the exact same union schema, a DoNothing
will be generated -- so consumers of Resolver
need to be able to handle DoNothing
for all schemas.
Promote
-- the writer‘s value needs to be promoted to the reader’s schema. Generated only for numeric and byte/string types.
ContainerAction
-- no resolution is needed directly on container schemas, but a ContainerAction
contains the Action
needed for the contained schema
EnumAdjust
-- resolution involves dealing with reordering of symbols and symbols that have been removed from the enumeration. An EnumAdjust
object contains the information needed to do so.
RecordAdjust
-- resolution involves recursively resolving the schemas for each field, and dealing with reordering and removal of fields. A RecordAdjust
object contains the information needed to do so.
SkipAction
-- only generated as a sub-action of a RecordAdjust
action. Used to indicate that a writer‘s field does not appear in the reader’s schema and thus should be skipped.
WriterUnion
-- generated when the writer‘s schema is a union and the reader’s schema is not the identical union. Has subactions for resolving each branch of the writer‘s union against the reader’s schema.
ReaderUnion
-- generated when the reader‘s schema is a union and the writer’s was not. Had information indicating which of the reader‘s union-branch was the best fit for the writer’s schema, and a subaction for resolving the schema of that branch against the writer's schema.
ErrorAction
-- generated when the (sub)schemas can't be resolved.
These new classes aresimilar to the family of Symbol
objects we've defined for ResolvingGrammarGenerator
. For example, Action.RecordAdjust
is similar to Symbol.FieldOrderAction
, and Action.EnumAdjust
in Symbol.EnumAdjustAction
. This similarity is not surprising, since those Symbol
objects were design to encapsulate the logic of schema resolution as well.
However, where ResolvingGrammarGenerator
embeds those Symbol
objects into flattened productions highly optimized for the LL(1) parser implemented by ResolvingDecoder
. The Resolver
, in contrast, captures the schema-resolution logic in a tree-like structure that closely mirrors the syntax-tree of the schemas being resolved. This tree-like representation is easily consumed by multiple implementations of resolution -- be it the grammar-based implementation of ResolvingDecoder
, the “action-sequence”-based implementation of AVRO-2247, or the dynamic code-gen implementation being considered as an extension to AVRO-2090.
We have reimplemented ResolvingGrammarGenerator
to eliminate it's implementaiton of schema-resolution logic and instead consume the output of Resolver.resolve
. Thus, it might be helpful to study ResolvingGrammarGenerator
to better understand how to consume this output in other circumstances.