| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| https://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" |
| "https://forrest.apache.org/dtd/document-v20.dtd" [ |
| <!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN" |
| "../../../../build/avro.ent"> |
| %avro-entities; |
| ]> |
| <document> |
| <header> |
| <title>Apache Avro™ &AvroVersion; Getting Started (Python)</title> |
| </header> |
| <body> |
| <p> |
| This is a short guide for getting started with Apache Avro™ using |
| Python. This guide only covers using Avro for data serialization; see |
| Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro |
| RPC Quick Start</a> for a good introduction to using Avro for RPC. |
| </p> |
| |
| <section id="notice_python3"> |
| <title>Notice for Python 3 users</title> |
| <p> |
| A package called "avro-python3" had been provided to support |
| Python 3 previously, but the codebase was consolidated into |
| the "avro" package that supports Python 3 now. |
| |
| The avro-python3 package will be removed in the near future, |
| so users should use the "avro" package instead. |
| They are mostly API compatible, but there's a few minor difference |
| (e.g., function name capitalization, |
| such as avro.schema.Parse vs avro.schema.parse). |
| </p> |
| </section> |
| |
| <section id="download_install"> |
| <title>Download and Install</title> |
| <p> |
| The easiest way to get started in Python is to install <a href="https://pypi.org/project/avro/">avro from PyPI</a> |
| using <a href="https://pip.pypa.io/en/stable/">pip</a>, the Python Package Installer. |
| </p> |
| <source> |
| $ python3 -m pip install avro |
| </source> |
| <p>Consider doing a local install or using a virtualenv to avoid permissions problems and interfering with system packages:</p> |
| <source> |
| $ python3 -m pip install --user install avro |
| </source> |
| <p>or</p> |
| <source> |
| $ python3 -m venv avro-venv |
| $ avro-venv/bin/pip install avro |
| </source> |
| <p> |
| The official releases of the Avro implementations for C, C++, C#, Java, |
| PHP, Python, and Ruby can be downloaded from the <a |
| href="https://avro.apache.org/releases.html">Apache Avro™ |
| Releases</a> page. This guide uses Avro &AvroVersion;, the latest |
| version at the time of writing. Download and install |
| <em>avro-&AvroVersion;-py2.py3-none-any.whl</em> or |
| <em>avro-&AvroVersion;.tar.gz</em> via |
| <code>python -m pip avro-&AvroVersion;-py2.py3-none-any.whl</code> |
| or |
| <code>python -m pip avro-&AvroVersion;.tar.gz</code>. |
| (As above, consider using a virtualenv or user-local install.) |
| </p> |
| <p>Check that you can import avro from a Python prompt.</p> |
| <source> |
| $ python3 -c 'import avro; print(avro.__version__)' |
| </source> |
| <p>The above should print &AvroVersion;. It should not raise an <code>ImportError</code>.</p> |
| <p> |
| Alternatively, you may build the Avro Python library from source. From |
| your the root Avro directory, run the commands |
| </p> |
| <source> |
| $ cd lang/py/ |
| $ python3 -m pip install -e . |
| $ python3 |
| </source> |
| </section> |
| |
| <section> |
| <title>Defining a schema</title> |
| <p> |
| Avro schemas are defined using JSON. Schemas are composed of <a |
| href="spec.html#schema_primitive">primitive types</a> |
| (<code>null</code>, <code>boolean</code>, <code>int</code>, |
| <code>long</code>, <code>float</code>, <code>double</code>, |
| <code>bytes</code>, and <code>string</code>) and <a |
| href="spec.html#schema_complex">complex types</a> (<code>record</code>, |
| <code>enum</code>, <code>array</code>, <code>map</code>, |
| <code>union</code>, and <code>fixed</code>). You can learn more about |
| Avro schemas and types from the specification, but for now let's start |
| with a simple schema example, <em>user.avsc</em>: |
| </p> |
| <source> |
| {"namespace": "example.avro", |
| "type": "record", |
| "name": "User", |
| "fields": [ |
| {"name": "name", "type": "string"}, |
| {"name": "favorite_number", "type": ["int", "null"]}, |
| {"name": "favorite_color", "type": ["string", "null"]} |
| ] |
| } |
| </source> |
| <p> |
| This schema defines a record representing a hypothetical user. (Note |
| that a schema file can only contain a single schema definition.) At |
| minimum, a record definition must include its type (<code>"type": |
| "record"</code>), a name (<code>"name": "User"</code>), and fields, in |
| this case <code>name</code>, <code>favorite_number</code>, and |
| <code>favorite_color</code>. We also define a namespace |
| (<code>"namespace": "example.avro"</code>), which together with the name |
| attribute defines the "full name" of the schema |
| (<code>example.avro.User</code> in this case). |
| |
| </p> |
| <p> |
| Fields are defined via an array of objects, each of which defines a name |
| and type (other attributes are optional, see the <a |
| href="spec.html#schema_record">record specification</a> for more |
| details). The type attribute of a field is another schema object, which |
| can be either a primitive or complex type. For example, the |
| <code>name</code> field of our User schema is the primitive type |
| <code>string</code>, whereas the <code>favorite_number</code> and |
| <code>favorite_color</code> fields are both <code>union</code>s, |
| represented by JSON arrays. <code>union</code>s are a complex type that |
| can be any of the types listed in the array; e.g., |
| <code>favorite_number</code> can either be an <code>int</code> or |
| <code>null</code>, essentially making it an optional field. |
| </p> |
| </section> |
| |
| <section> |
| <title>Serializing and deserializing without code generation</title> |
| <p> |
| Data in Avro is always stored with its corresponding schema, meaning we |
| can always read a serialized item, regardless of whether we know the |
| schema ahead of time. This allows us to perform serialization and |
| deserialization without code generation. Note that the Avro Python |
| library does not support code generation. |
| </p> |
| <p> |
| Try running the following code snippet, which serializes two users to a |
| data file on disk, and then reads back and deserializes the data file: |
| </p> |
| <source> |
| import avro.schema |
| from avro.datafile import DataFileReader, DataFileWriter |
| from avro.io import DatumReader, DatumWriter |
| |
| schema = avro.schema.parse(open("user.avsc", "rb").read()) |
| |
| writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) |
| writer.append({"name": "Alyssa", "favorite_number": 256}) |
| writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"}) |
| writer.close() |
| |
| reader = DataFileReader(open("users.avro", "rb"), DatumReader()) |
| for user in reader: |
| print user |
| reader.close() |
| </source> |
| <p>This outputs:</p> |
| <source> |
| {u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'} |
| {u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'} |
| </source> |
| <p> |
| Do make sure that you open your files in binary mode (i.e. using the modes |
| <code>wb</code> or <code>rb</code> respectively). Otherwise you might |
| generate corrupt files due to |
| <a href="https://docs.python.org/library/functions.html#open"> |
| automatic replacement</a> of newline characters with the |
| platform-specific representations. |
| </p> |
| <p> |
| Let's take a closer look at what's going on here. |
| </p> |
| <source> |
| schema = avro.schema.parse(open("user.avsc", "rb").read()) |
| </source> |
| <p> |
| <code>avro.schema.parse</code> takes a string containing a JSON schema |
| definition as input and outputs a <code>avro.schema.Schema</code> object |
| (specifically a subclass of <code>Schema</code>, in this case |
| <code>RecordSchema</code>). We're passing in the contents of our |
| user.avsc schema file here. |
| </p> |
| <source> |
| writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) |
| </source> |
| <p> |
| We create a <code>DataFileWriter</code>, which we'll use to write |
| serialized items to a data file on disk. The |
| <code>DataFileWriter</code> constructor takes three arguments: |
| </p> |
| <ul> |
| <li>The file we'll serialize to</li> |
| <li>A <code>DatumWriter</code>, which is responsible for actually |
| serializing the items to Avro's binary format |
| (<code>DatumWriter</code>s can be used separately from |
| <code>DataFileWriter</code>s, e.g., to perform IPC with Avro).</li> |
| <li>The schema we're using. The <code>DataFileWriter</code> needs the |
| schema both to write the schema to the data file, and to verify that |
| the items we write are valid items and write the appropriate |
| fields.</li> |
| </ul> |
| <source> |
| writer.append({"name": "Alyssa", "favorite_number": 256}) |
| writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"}) |
| </source> |
| <p> |
| We use <code>DataFileWriter.append</code> to add items to our data |
| file. Avro records are represented as Python <code>dict</code>s. |
| Since the field <code>favorite_color</code> has type <code>["int", |
| "null"]</code>, we are not required to specify this field, as shown in |
| the first append. Were we to omit the required <code>name</code> |
| field, an exception would be raised. Any extra entries not |
| corresponding to a field are present in the <code>dict</code> are |
| ignored. |
| </p> |
| <source> |
| reader = DataFileReader(open("users.avro", "rb"), DatumReader()) |
| </source> |
| <p> |
| We open the file again, this time for reading back from disk. We use |
| a <code>DataFileReader</code> and <code>DatumReader</code> analagous |
| to the <code>DataFileWriter</code> and <code>DatumWriter</code> above. |
| </p> |
| <source> |
| for user in reader: |
| print user |
| </source> |
| <p> |
| The <code>DataFileReader</code> is an iterator that returns |
| <code>dict</code>s corresponding to the serialized items. |
| </p> |
| </section> |
| </body> |
| </document> |