| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| https://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" |
| "https://forrest.apache.org/dtd/document-v20.dtd" [ |
| <!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN" |
| "../../../../build/avro.ent"> |
| %avro-entities; |
| ]> |
| <document> |
| <header> |
| <title>Apache Avro™ &AvroVersion; Getting Started (Java)</title> |
| </header> |
| <body> |
| <p> |
| This is a short guide for getting started with Apache Avro™ using |
| Java. This guide only covers using Avro for data serialization; see |
| Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro |
| RPC Quick Start</a> for a good introduction to using Avro for RPC. |
| </p> |
| <section id="download_install"> |
| <title>Download</title> |
| <p> |
| Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby can be |
| downloaded from the <a |
| href="https://avro.apache.org/releases.html">Apache Avro™ |
| Releases</a> page. This guide uses Avro &AvroVersion;, the latest |
| version at the time of writing. For the examples in this guide, |
| download <em>avro-&AvroVersion;.jar</em> and |
| <em>avro-tools-&AvroVersion;.jar</em>. |
| </p> |
| <p> |
| Alternatively, if you are using Maven, add the following dependency to |
| your POM: |
| </p> |
| <source> |
| <dependency> |
| <groupId>org.apache.avro</groupId> |
| <artifactId>avro</artifactId> |
| <version>&AvroVersion;</version> |
| </dependency> |
| </source> |
| <p> |
| As well as the Avro Maven plugin (for performing code generation): |
| </p> |
| <source> |
| <plugin> |
| <groupId>org.apache.avro</groupId> |
| <artifactId>avro-maven-plugin</artifactId> |
| <version>&AvroVersion;</version> |
| <executions> |
| <execution> |
| <phase>generate-sources</phase> |
| <goals> |
| <goal>schema</goal> |
| </goals> |
| <configuration> |
| <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory> |
| <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> |
| </configuration> |
| </execution> |
| </executions> |
| </plugin> |
| <plugin> |
| <groupId>org.apache.maven.plugins</groupId> |
| <artifactId>maven-compiler-plugin</artifactId> |
| <configuration> |
| <source>1.8</source> |
| <target>1.8</target> |
| </configuration> |
| </plugin> |
| </source> |
| <p> |
| You may also build the required Avro jars from source. Building Avro is |
| beyond the scope of this guide; see the <a |
| href="https://cwiki.apache.org/AVRO/Build+Documentation">Build |
| Documentation</a> page in the wiki for more information. |
| </p> |
| </section> |
| |
| <section> |
| <title>Defining a schema</title> |
| <p> |
| Avro schemas are defined using JSON. Schemas are composed of <a |
| href="spec.html#schema_primitive">primitive types</a> |
| (<code>null</code>, <code>boolean</code>, <code>int</code>, |
| <code>long</code>, <code>float</code>, <code>double</code>, |
| <code>bytes</code>, and <code>string</code>) and <a |
| href="spec.html#schema_complex">complex types</a> (<code>record</code>, |
| <code>enum</code>, <code>array</code>, <code>map</code>, |
| <code>union</code>, and <code>fixed</code>). You can learn more about |
| Avro schemas and types from the specification, but for now let's start |
| with a simple schema example, <em>user.avsc</em>: |
| </p> |
| <source> |
| {"namespace": "example.avro", |
| "type": "record", |
| "name": "User", |
| "fields": [ |
| {"name": "name", "type": "string"}, |
| {"name": "favorite_number", "type": ["int", "null"]}, |
| {"name": "favorite_color", "type": ["string", "null"]} |
| ] |
| } |
| </source> |
| <p> |
| This schema defines a record representing a hypothetical user. (Note |
| that a schema file can only contain a single schema definition.) At |
| minimum, a record definition must include its type (<code>"type": |
| "record"</code>), a name (<code>"name": "User"</code>), and fields, in |
| this case <code>name</code>, <code>favorite_number</code>, and |
| <code>favorite_color</code>. We also define a namespace |
| (<code>"namespace": "example.avro"</code>), which together with the name |
| attribute defines the "full name" of the schema |
| (<code>example.avro.User</code> in this case). |
| |
| </p> |
| <p> |
| Fields are defined via an array of objects, each of which defines a name |
| and type (other attributes are optional, see the <a |
| href="spec.html#schema_record">record specification</a> for more |
| details). The type attribute of a field is another schema object, which |
| can be either a primitive or complex type. For example, the |
| <code>name</code> field of our User schema is the primitive type |
| <code>string</code>, whereas the <code>favorite_number</code> and |
| <code>favorite_color</code> fields are both <code>union</code>s, |
| represented by JSON arrays. <code>union</code>s are a complex type that |
| can be any of the types listed in the array; e.g., |
| <code>favorite_number</code> can either be an <code>int</code> or |
| <code>null</code>, essentially making it an optional field. |
| </p> |
| </section> |
| |
| <section> |
| <title>Serializing and deserializing with code generation</title> |
| <section> |
| <title>Compiling the schema</title> |
| <p> |
| Code generation allows us to automatically create classes based on our |
| previously-defined schema. Once we have defined the relevant classes, |
| there is no need to use the schema directly in our programs. We use the |
| avro-tools jar to generate code as follows: |
| </p> |
| <source> |
| java -jar /path/to/avro-tools-&AvroVersion;.jar compile schema <schema file> <destination> |
| </source> |
| <p> |
| This will generate the appropriate source files in a package based on |
| the schema's namespace in the provided destination folder. For |
| instance, to generate a <code>User</code> class in package |
| <code>example.avro</code> from the schema defined above, run |
| </p> |
| <source> |
| java -jar /path/to/avro-tools-&AvroVersion;.jar compile schema user.avsc . |
| </source> |
| <p> |
| Note that if you using the Avro Maven plugin, there is no need to |
| manually invoke the schema compiler; the plugin automatically |
| performs code generation on any .avsc files present in the configured |
| source directory. |
| </p> |
| </section> |
| <section> |
| <title>Creating Users</title> |
| <p> |
| Now that we've completed the code generation, let's create some |
| <code>User</code>s, serialize them to a data file on disk, and then |
| read back the file and deserialize the <code>User</code> objects. |
| </p> |
| <p> |
| First let's create some <code>User</code>s and set their fields. |
| </p> |
| <source> |
| User user1 = new User(); |
| user1.setName("Alyssa"); |
| user1.setFavoriteNumber(256); |
| // Leave favorite color null |
| |
| // Alternate constructor |
| User user2 = new User("Ben", 7, "red"); |
| |
| // Construct via builder |
| User user3 = User.newBuilder() |
| .setName("Charlie") |
| .setFavoriteColor("blue") |
| .setFavoriteNumber(null) |
| .build(); |
| </source> |
| <p> |
| As shown in this example, Avro objects can be created either by |
| invoking a constructor directly or by using a builder. Unlike |
| constructors, builders will automatically set any default values |
| specified in the schema. Additionally, builders validate the data as |
| it set, whereas objects constructed directly will not cause an error |
| until the object is serialized. However, using constructors directly |
| generally offers better performance, as builders create a copy of the |
| datastructure before it is written. |
| </p> |
| <p> |
| Note that we do not set <code>user1</code>'s favorite color. Since |
| that record is of type <code>["string", "null"]</code>, we can either |
| set it to a <code>string</code> or leave it <code>null</code>; it is |
| essentially optional. Similarly, we set <code>user3</code>'s favorite |
| number to null (using a builder requires setting all fields, even if |
| they are null). |
| </p> |
| </section> |
| <section> |
| <title>Serializing</title> |
| <p> |
| Now let's serialize our <code>User</code>s to disk. |
| </p> |
| <source> |
| // Serialize user1, user2 and user3 to disk |
| DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class); |
| DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter); |
| dataFileWriter.create(user1.getSchema(), new File("users.avro")); |
| dataFileWriter.append(user1); |
| dataFileWriter.append(user2); |
| dataFileWriter.append(user3); |
| dataFileWriter.close(); |
| </source> |
| <p> |
| We create a <code>DatumWriter</code>, which converts Java objects into |
| an in-memory serialized format. The <code>SpecificDatumWriter</code> |
| class is used with generated classes and extracts the schema from the |
| specified generated type. |
| </p> |
| <p> |
| Next we create a <code>DataFileWriter</code>, which writes the |
| serialized records, as well as the schema, to the file specified in the |
| <code>dataFileWriter.create</code> call. We write our users to the file |
| via calls to the <code>dataFileWriter.append</code> method. When we are |
| done writing, we close the data file. |
| </p> |
| </section> |
| <section> |
| <title>Deserializing</title> |
| <p> |
| Finally, let's deserialize the data file we just created. |
| </p> |
| <source> |
| // Deserialize Users from disk |
| DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class); |
| DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader); |
| User user = null; |
| while (dataFileReader.hasNext()) { |
| // Reuse user object by passing it to next(). This saves us from |
| // allocating and garbage collecting many objects for files with |
| // many items. |
| user = dataFileReader.next(user); |
| System.out.println(user); |
| } |
| </source> |
| <p> |
| This snippet will output: |
| </p> |
| <source> |
| {"name": "Alyssa", "favorite_number": 256, "favorite_color": null} |
| {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} |
| {"name": "Charlie", "favorite_number": null, "favorite_color": "blue"} |
| </source> |
| <p> |
| Deserializing is very similar to serializing. We create a |
| <code>SpecificDatumReader</code>, analogous to the |
| <code>SpecificDatumWriter</code> we used in serialization, which |
| converts in-memory serialized items into instances of our generated |
| class, in this case <code>User</code>. We pass the |
| <code>DatumReader</code> and the previously created <code>File</code> |
| to a <code>DataFileReader</code>, analogous to the |
| <code>DataFileWriter</code>, which reads both the schema used by the |
| writer as well as the data from the file on disk. The data will be |
| read using the writer's schema included in the file and the |
| schema provided by the reader, in this case the <code>User</code> |
| class. The writer's schema is needed to know the order in which |
| fields were written, while the reader's schema is needed to know what |
| fields are expected and how to fill in default values for fields |
| added since the file was written. If there are differences between |
| the two schemas, they are resolved according to the |
| <a href="spec.html#Schema+Resolution">Schema Resolution</a> |
| specification. |
| </p> |
| <p> |
| Next we use the <code>DataFileReader</code> to iterate through the |
| serialized <code>User</code>s and print the deserialized object to |
| stdout. Note how we perform the iteration: we create a single |
| <code>User</code> object which we store the current deserialized user |
| in, and pass this record object to every call of |
| <code>dataFileReader.next</code>. This is a performance optimization |
| that allows the <code>DataFileReader</code> to reuse the same |
| <code>User</code> object rather than allocating a new |
| <code>User</code> for every iteration, which can be very expensive in |
| terms of object allocation and garbage collection if we deserialize a |
| large data file. While this technique is the standard way to iterate |
| through a data file, it's also possible to use <code>for (User user : |
| dataFileReader)</code> if performance is not a concern. |
| </p> |
| </section> |
| <section> |
| <title>Compiling and running the example code</title> |
| <p> |
| This example code is included as a Maven project in the |
| <em>examples/java-example</em> directory in the Avro docs. From this |
| directory, execute the following commands to build and run the |
| example: |
| </p> |
| <source> |
| $ mvn compile # includes code generation via Avro Maven plugin |
| $ mvn -q exec:java -Dexec.mainClass=example.SpecificMain |
| </source> |
| </section> |
| <section> |
| <title>Beta feature: Generating faster code</title> |
| <p> |
| In this release we have introduced a new approach to |
| generating code that speeds up decoding of objects by more |
| than 10% and encoding by more than 30% (future performance |
| enhancements are underway). To ensure a smooth introduction |
| of this change into production systems, this feature is |
| controlled by a feature flag, the system |
| property <code>org.apache.avro.specific.use_custom_coders</code>. |
| In this first release, this feature is off by default. To |
| turn it on, set the system flag to <code>true</code> at |
| runtime. In the sample above, for example, you could enable |
| the fater coders as follows: |
| </p> |
| <source> |
| $ mvn -q exec:java -Dexec.mainClass=example.SpecificMain \ |
| -Dorg.apache.avro.specific.use_custom_coders=true |
| </source> |
| <p> |
| Note that you do <em>not</em> have to recompile your Avro |
| schema to have access to this feature. The feature is |
| compiled and built into your code, and you turn it on and |
| off at runtime using the feature flag. As a result, you can |
| turn it on during testing, for example, and then off in |
| production. Or you can turn it on in production, and |
| quickly turn it off if something breaks. |
| </p> |
| <p> |
| We encourage the Avro community to exercise this new feature |
| early to help build confidence. (For those paying |
| one-demand for compute resources in the cloud, it can lead |
| to meaningful cost savings.) As confidence builds, we will |
| turn this feature on by default, and eventually eliminate |
| the feature flag (and the old code). |
| </p> |
| </section> |
| </section> |
| |
| <section> |
| <title>Serializing and deserializing without code generation</title> |
| <p> |
| Data in Avro is always stored with its corresponding schema, meaning we |
| can always read a serialized item regardless of whether we know the |
| schema ahead of time. This allows us to perform serialization and |
| deserialization without code generation. |
| </p> |
| <p> |
| Let's go over the same example as in the previous section, but without |
| using code generation: we'll create some users, serialize them to a data |
| file on disk, and then read back the file and deserialize the users |
| objects. |
| </p> |
| <section> |
| <title>Creating users</title> |
| <p> |
| First, we use a <code>Parser</code> to read our schema definition and |
| create a <code>Schema</code> object. |
| </p> |
| <source> |
| Schema schema = new Schema.Parser().parse(new File("user.avsc")); |
| </source> |
| <p> |
| Using this schema, let's create some users. |
| </p> |
| <source> |
| GenericRecord user1 = new GenericData.Record(schema); |
| user1.put("name", "Alyssa"); |
| user1.put("favorite_number", 256); |
| // Leave favorite color null |
| |
| GenericRecord user2 = new GenericData.Record(schema); |
| user2.put("name", "Ben"); |
| user2.put("favorite_number", 7); |
| user2.put("favorite_color", "red"); |
| </source> |
| <p> |
| Since we're not using code generation, we use |
| <code>GenericRecord</code>s to represent users. |
| <code>GenericRecord</code> uses the schema to verify that we only |
| specify valid fields. If we try to set a non-existent field (e.g., |
| <code>user1.put("favorite_animal", "cat")</code>), we'll get an |
| <code>AvroRuntimeException</code> when we run the program. |
| </p> |
| <p> |
| Note that we do not set <code>user1</code>'s favorite color. Since |
| that record is of type <code>["string", "null"]</code>, we can either |
| set it to a <code>string</code> or leave it <code>null</code>; it is |
| essentially optional. |
| </p> |
| </section> |
| <section> |
| <title>Serializing</title> |
| <p> |
| Now that we've created our user objects, serializing and deserializing |
| them is almost identical to the example above which uses code |
| generation. The main difference is that we use generic instead of |
| specific readers and writers. |
| </p> |
| <p> |
| First we'll serialize our users to a data file on disk. |
| </p> |
| <source> |
| // Serialize user1 and user2 to disk |
| File file = new File("users.avro"); |
| DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); |
| DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); |
| dataFileWriter.create(schema, file); |
| dataFileWriter.append(user1); |
| dataFileWriter.append(user2); |
| dataFileWriter.close(); |
| </source> |
| <p> |
| We create a <code>DatumWriter</code>, which converts Java objects into |
| an in-memory serialized format. Since we are not using code |
| generation, we create a <code>GenericDatumWriter</code>. It requires |
| the schema both to determine how to write the |
| <code>GenericRecord</code>s and to verify that all non-nullable fields |
| are present. |
| </p> |
| <p> |
| As in the code generation example, we also create a |
| <code>DataFileWriter</code>, which writes the serialized records, as |
| well as the schema, to the file specified in the |
| <code>dataFileWriter.create</code> call. We write our users to the |
| file via calls to the <code>dataFileWriter.append</code> method. When |
| we are done writing, we close the data file. |
| </p> |
| </section> |
| <section> |
| <title>Deserializing</title> |
| <p> |
| Finally, we'll deserialize the data file we just created. |
| </p> |
| <source> |
| // Deserialize users from disk |
| DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema); |
| DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader); |
| GenericRecord user = null; |
| while (dataFileReader.hasNext()) { |
| // Reuse user object by passing it to next(). This saves us from |
| // allocating and garbage collecting many objects for files with |
| // many items. |
| user = dataFileReader.next(user); |
| System.out.println(user); |
| </source> |
| <p>This outputs:</p> |
| <source> |
| {"name": "Alyssa", "favorite_number": 256, "favorite_color": null} |
| {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} |
| </source> |
| <p> |
| Deserializing is very similar to serializing. We create a |
| <code>GenericDatumReader</code>, analogous to the |
| <code>GenericDatumWriter</code> we used in serialization, which |
| converts in-memory serialized items into <code>GenericRecords</code>. |
| We pass the <code>DatumReader</code> and the previously created |
| <code>File</code> to a <code>DataFileReader</code>, analogous to the |
| <code>DataFileWriter</code>, which reads both the schema used by the |
| writer as well as the data from the file on disk. The data will be |
| read using the writer's schema included in the file, and the reader's |
| schema provided to the <code>GenericDatumReader</code>. The writer's |
| schema is needed to know the order in which fields were written, |
| while the reader's schema is needed to know what fields are expected |
| and how to fill in default values for fields added since the file |
| was written. If there are differences between the two schemas, they |
| are resolved according to the |
| <a href="spec.html#Schema+Resolution">Schema Resolution</a> |
| specification. |
| </p> |
| <p> |
| Next, we use the <code>DataFileReader</code> to iterate through the |
| serialized users and print the deserialized object to stdout. Note |
| how we perform the iteration: we create a single |
| <code>GenericRecord</code> object which we store the current |
| deserialized user in, and pass this record object to every call of |
| <code>dataFileReader.next</code>. This is a performance optimization |
| that allows the <code>DataFileReader</code> to reuse the same record |
| object rather than allocating a new <code>GenericRecord</code> for |
| every iteration, which can be very expensive in terms of object |
| allocation and garbage collection if we deserialize a large data file. |
| While this technique is the standard way to iterate through a data |
| file, it's also possible to use <code>for (GenericRecord user : |
| dataFileReader)</code> if performance is not a concern. |
| </p> |
| </section> |
| <section> |
| <title>Compiling and running the example code</title> |
| <p> |
| This example code is included as a Maven project in the |
| <em>examples/java-example</em> directory in the Avro docs. From this |
| directory, execute the following commands to build and run the |
| example: |
| </p> |
| <source> |
| $ mvn compile |
| $ mvn -q exec:java -Dexec.mainClass=example.GenericMain |
| </source> |
| </section> |
| </section> |
| </body> |
| </document> |