blob: f6216b116d8bd984d5cbd8e0f40f8ca7f7ce6fd8 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
"https://forrest.apache.org/dtd/document-v20.dtd" [
<!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN"
"../../../../build/avro.ent">
%avro-entities;
]>
<document>
<header>
<title>Apache Avro&#153; &AvroVersion; Getting Started (Python)</title>
</header>
<body>
<p>
This is a short guide for getting started with Apache Avro&#153; using
Python. This guide only covers using Avro for data serialization; see
Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro
RPC Quick Start</a> for a good introduction to using Avro for RPC.
</p>
<section id="notice_python3">
<title>Notice for Python 3 users</title>
<p>
A package called "avro-python3" had been provided to support
Python 3 previously, but the codebase was consolidated into
the "avro" package that supports Python 3 now.
The avro-python3 package will be removed in the near future,
so users should use the "avro" package instead.
They are mostly API compatible, but there's a few minor difference
(e.g., function name capitalization,
such as avro.schema.Parse vs avro.schema.parse).
</p>
</section>
<section id="download_install">
<title>Download and Install</title>
<p>
The easiest way to get started in Python is to install <a href="https://pypi.org/project/avro/">avro from PyPI</a>
using <a href="https://pip.pypa.io/en/stable/">pip</a>, the Python Package Installer.
</p>
<source>
$ python3 -m pip install avro
</source>
<p>Consider doing a local install or using a virtualenv to avoid permissions problems and interfering with system packages:</p>
<source>
$ python3 -m pip install --user install avro
</source>
<p>or</p>
<source>
$ python3 -m venv avro-venv
$ avro-venv/bin/pip install avro
</source>
<p>
The official releases of the Avro implementations for C, C++, C#, Java,
PHP, Python, and Ruby can be downloaded from the <a
href="https://avro.apache.org/releases.html">Apache Avro&#153;
Releases</a> page. This guide uses Avro &AvroVersion;, the latest
version at the time of writing. Download and install
<em>avro-&AvroVersion;-py2.py3-none-any.whl</em> or
<em>avro-&AvroVersion;.tar.gz</em> via
<code>python -m pip avro-&AvroVersion;-py2.py3-none-any.whl</code>
or
<code>python -m pip avro-&AvroVersion;.tar.gz</code>.
(As above, consider using a virtualenv or user-local install.)
</p>
<p>Check that you can import avro from a Python prompt.</p>
<source>
$ python3 -c 'import avro; print(avro.__version__)'
</source>
<p>The above should print &AvroVersion;. It should not raise an <code>ImportError</code>.</p>
<p>
Alternatively, you may build the Avro Python library from source. From
your the root Avro directory, run the commands
</p>
<source>
$ cd lang/py/
$ python3 -m pip install -e .
$ python3
</source>
</section>
<section>
<title>Defining a schema</title>
<p>
Avro schemas are defined using JSON. Schemas are composed of <a
href="spec.html#schema_primitive">primitive types</a>
(<code>null</code>, <code>boolean</code>, <code>int</code>,
<code>long</code>, <code>float</code>, <code>double</code>,
<code>bytes</code>, and <code>string</code>) and <a
href="spec.html#schema_complex">complex types</a> (<code>record</code>,
<code>enum</code>, <code>array</code>, <code>map</code>,
<code>union</code>, and <code>fixed</code>). You can learn more about
Avro schemas and types from the specification, but for now let's start
with a simple schema example, <em>user.avsc</em>:
</p>
<source>
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
</source>
<p>
This schema defines a record representing a hypothetical user. (Note
that a schema file can only contain a single schema definition.) At
minimum, a record definition must include its type (<code>"type":
"record"</code>), a name (<code>"name": "User"</code>), and fields, in
this case <code>name</code>, <code>favorite_number</code>, and
<code>favorite_color</code>. We also define a namespace
(<code>"namespace": "example.avro"</code>), which together with the name
attribute defines the "full name" of the schema
(<code>example.avro.User</code> in this case).
</p>
<p>
Fields are defined via an array of objects, each of which defines a name
and type (other attributes are optional, see the <a
href="spec.html#schema_record">record specification</a> for more
details). The type attribute of a field is another schema object, which
can be either a primitive or complex type. For example, the
<code>name</code> field of our User schema is the primitive type
<code>string</code>, whereas the <code>favorite_number</code> and
<code>favorite_color</code> fields are both <code>union</code>s,
represented by JSON arrays. <code>union</code>s are a complex type that
can be any of the types listed in the array; e.g.,
<code>favorite_number</code> can either be an <code>int</code> or
<code>null</code>, essentially making it an optional field.
</p>
</section>
<section>
<title>Serializing and deserializing without code generation</title>
<p>
Data in Avro is always stored with its corresponding schema, meaning we
can always read a serialized item, regardless of whether we know the
schema ahead of time. This allows us to perform serialization and
deserialization without code generation. Note that the Avro Python
library does not support code generation.
</p>
<p>
Try running the following code snippet, which serializes two users to a
data file on disk, and then reads back and deserializes the data file:
</p>
<source>
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
schema = avro.schema.parse(open("user.avsc", "rb").read())
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
writer.close()
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
print user
reader.close()
</source>
<p>This outputs:</p>
<source>
{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
{u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}
</source>
<p>
Do make sure that you open your files in binary mode (i.e. using the modes
<code>wb</code> or <code>rb</code> respectively). Otherwise you might
generate corrupt files due to
<a href="https://docs.python.org/library/functions.html#open">
automatic replacement</a> of newline characters with the
platform-specific representations.
</p>
<p>
Let's take a closer look at what's going on here.
</p>
<source>
schema = avro.schema.parse(open("user.avsc", "rb").read())
</source>
<p>
<code>avro.schema.parse</code> takes a string containing a JSON schema
definition as input and outputs a <code>avro.schema.Schema</code> object
(specifically a subclass of <code>Schema</code>, in this case
<code>RecordSchema</code>). We're passing in the contents of our
user.avsc schema file here.
</p>
<source>
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
</source>
<p>
We create a <code>DataFileWriter</code>, which we'll use to write
serialized items to a data file on disk. The
<code>DataFileWriter</code> constructor takes three arguments:
</p>
<ul>
<li>The file we'll serialize to</li>
<li>A <code>DatumWriter</code>, which is responsible for actually
serializing the items to Avro's binary format
(<code>DatumWriter</code>s can be used separately from
<code>DataFileWriter</code>s, e.g., to perform IPC with Avro).</li>
<li>The schema we're using. The <code>DataFileWriter</code> needs the
schema both to write the schema to the data file, and to verify that
the items we write are valid items and write the appropriate
fields.</li>
</ul>
<source>
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
</source>
<p>
We use <code>DataFileWriter.append</code> to add items to our data
file. Avro records are represented as Python <code>dict</code>s.
Since the field <code>favorite_color</code> has type <code>["int",
"null"]</code>, we are not required to specify this field, as shown in
the first append. Were we to omit the required <code>name</code>
field, an exception would be raised. Any extra entries not
corresponding to a field are present in the <code>dict</code> are
ignored.
</p>
<source>
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
</source>
<p>
We open the file again, this time for reading back from disk. We use
a <code>DataFileReader</code> and <code>DatumReader</code> analagous
to the <code>DataFileWriter</code> and <code>DatumWriter</code> above.
</p>
<source>
for user in reader:
print user
</source>
<p>
The <code>DataFileReader</code> is an iterator that returns
<code>dict</code>s corresponding to the serialized items.
</p>
</section>
</body>
</document>