doc/src/content/xdocs/gettingstartedpython.xml - avro - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
   -->
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
    "https://forrest.apache.org/dtd/document-v20.dtd" [
   <!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN"
 	   "../../../../build/avro.ent">
   %avro-entities;
 ]>
 <document>
   <header>
     <title>Apache Avro&#153; &AvroVersion; Getting Started (Python)</title>
   </header>
   <body>
     <p>
       This is a short guide for getting started with Apache Avro&#153; using
       Python.  This guide only covers using Avro for data serialization; see
       Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro
       RPC Quick Start</a> for a good introduction to using Avro for RPC.
     </p>

     <section id="notice_python3">
       <title>Notice for Python 3 users</title>
       <p>
         A package called "avro-python3" had been provided to support
         Python 3 previously, but the codebase was consolidated into
         the "avro" package that supports Python 3 now.

         The avro-python3 package will be removed in the near future,
         so users should use the "avro" package instead.
         They are mostly API compatible, but there's a few minor difference
         (e.g., function name capitalization,
         such as avro.schema.Parse vs avro.schema.parse).
       </p>
     </section>

     <section id="download_install">
       <title>Download and Install</title>
       <p>
         The easiest way to get started in Python is to install <a href="https://pypi.org/project/avro/">avro from PyPI</a>
         using <a href="https://pip.pypa.io/en/stable/">pip</a>, the Python Package Installer.
       </p>
       <source>
 $ python3 -m pip install avro
       </source>
       <p>Consider doing a local install or using a virtualenv to avoid permissions problems and interfering with system packages:</p>
       <source>
 $ python3 -m pip install --user install avro
       </source>
       <p>or</p>
       <source>
         $ python3 -m venv avro-venv
         $ avro-venv/bin/pip install avro
       </source>
       <p>
         The official releases of the Avro implementations for C, C++, C#, Java,
         PHP, Python, and Ruby can be downloaded from the <a
         href="https://avro.apache.org/releases.html">Apache Avro&#153;
         Releases</a> page. This guide uses Avro &AvroVersion;, the latest
         version at the time of writing. Download and install
         <em>avro-&AvroVersion;-py2.py3-none-any.whl</em> or
         <em>avro-&AvroVersion;.tar.gz</em> via
         <code>python -m pip avro-&AvroVersion;-py2.py3-none-any.whl</code>
         or
         <code>python -m pip avro-&AvroVersion;.tar.gz</code>.
         (As above, consider using a virtualenv or user-local install.)
       </p>
       <p>Check that you can import avro from a Python prompt.</p>
       <source>
 $ python3 -c 'import avro; print(avro.__version__)'
       </source>
       <p>The above should print &AvroVersion;. It should not raise an <code>ImportError</code>.</p>
       <p>
         Alternatively, you may build the Avro Python library from source. From
         your the root Avro directory, run the commands
       </p>
       <source>
 $ cd lang/py/
 $ python3 -m pip install -e .
 $ python3
       </source>
     </section>

     <section>
       <title>Defining a schema</title>
       <p>
         Avro schemas are defined using JSON.  Schemas are composed of <a
         href="spec.html#schema_primitive">primitive types</a>
         (<code>null</code>, <code>boolean</code>, <code>int</code>,
         <code>long</code>, <code>float</code>, <code>double</code>,
         <code>bytes</code>, and <code>string</code>) and <a
         href="spec.html#schema_complex">complex types</a> (<code>record</code>,
         <code>enum</code>, <code>array</code>, <code>map</code>,
         <code>union</code>, and <code>fixed</code>).  You can learn more about
         Avro schemas and types from the specification, but for now let's start
         with a simple schema example, <em>user.avsc</em>:
       </p>
       <source>
 {"namespace": "example.avro",
  "type": "record",
  "name": "User",
  "fields": [
      {"name": "name", "type": "string"},
      {"name": "favorite_number",  "type": ["int", "null"]},
      {"name": "favorite_color", "type": ["string", "null"]}
  ]
 }
       </source>
       <p>
         This schema defines a record representing a hypothetical user.  (Note
         that a schema file can only contain a single schema definition.)  At
         minimum, a record definition must include its type (<code>"type":
         "record"</code>), a name (<code>"name": "User"</code>), and fields, in
         this case <code>name</code>, <code>favorite_number</code>, and
         <code>favorite_color</code>.  We also define a namespace
         (<code>"namespace": "example.avro"</code>), which together with the name
         attribute defines the "full name" of the schema
         (<code>example.avro.User</code> in this case).

       </p>
       <p>
         Fields are defined via an array of objects, each of which defines a name
         and type (other attributes are optional, see the <a
         href="spec.html#schema_record">record specification</a> for more
         details).  The type attribute of a field is another schema object, which
         can be either a primitive or complex type.  For example, the
         <code>name</code> field of our User schema is the primitive type
         <code>string</code>, whereas the <code>favorite_number</code> and
         <code>favorite_color</code> fields are both <code>union</code>s,
         represented by JSON arrays.  <code>union</code>s are a complex type that
         can be any of the types listed in the array; e.g.,
         <code>favorite_number</code> can either be an <code>int</code> or
         <code>null</code>, essentially making it an optional field.
       </p>
     </section>

     <section>
       <title>Serializing and deserializing without code generation</title>
       <p>
         Data in Avro is always stored with its corresponding schema, meaning we
         can always read a serialized item, regardless of whether we know the
         schema ahead of time.  This allows us to perform serialization and
         deserialization without code generation.  Note that the Avro Python
         library does not support code generation.
       </p>
       <p>
         Try running the following code snippet, which serializes two users to a
         data file on disk, and then reads back and deserializes the data file:
       </p>
       <source>
 import avro.schema
 from avro.datafile import DataFileReader, DataFileWriter
 from avro.io import DatumReader, DatumWriter

 schema = avro.schema.parse(open("user.avsc", "rb").read())

 writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
 writer.append({"name": "Alyssa", "favorite_number": 256})
 writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
 writer.close()

 reader = DataFileReader(open("users.avro", "rb"), DatumReader())
 for user in reader:
     print user
 reader.close()
       </source>
       <p>This outputs:</p>
       <source>
 {u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
 {u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}
       </source>
       <p>
         Do make sure that you open your files in binary mode (i.e. using the modes
         <code>wb</code> or <code>rb</code> respectively). Otherwise you might
         generate corrupt files due to
         <a href="https://docs.python.org/library/functions.html#open">
         automatic replacement</a> of newline characters with the
         platform-specific representations.
       </p>
       <p>
         Let's take a closer look at what's going on here.
       </p>
       <source>
 schema = avro.schema.parse(open("user.avsc", "rb").read())
       </source>
       <p>
         <code>avro.schema.parse</code> takes a string containing a JSON schema
         definition as input and outputs a <code>avro.schema.Schema</code> object
         (specifically a subclass of <code>Schema</code>, in this case
         <code>RecordSchema</code>).  We're passing in the contents of our
         user.avsc schema file here.
       </p>
       <source>
 writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
       </source>
       <p>
         We create a <code>DataFileWriter</code>, which we'll use to write
         serialized items to a data file on disk.  The
         <code>DataFileWriter</code> constructor takes three arguments:
       </p>
         <ul>
           <li>The file we'll serialize to</li>
           <li>A <code>DatumWriter</code>, which is responsible for actually
           serializing the items to Avro's binary format
           (<code>DatumWriter</code>s can be used separately from
           <code>DataFileWriter</code>s, e.g., to perform IPC with Avro).</li>
           <li>The schema we're using.  The <code>DataFileWriter</code> needs the
           schema both to write the schema to the data file, and to verify that
           the items we write are valid items and write the appropriate
           fields.</li>
         </ul>
         <source>
 writer.append({"name": "Alyssa", "favorite_number": 256})
 writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
         </source>
         <p>
           We use <code>DataFileWriter.append</code> to add items to our data
           file.  Avro records are represented as Python <code>dict</code>s.
           Since the field <code>favorite_color</code> has type <code>["int",
           "null"]</code>, we are not required to specify this field, as shown in
           the first append.  Were we to omit the required <code>name</code>
           field, an exception would be raised.  Any extra entries not
           corresponding to a field are present in the <code>dict</code> are
           ignored.
         </p>
         <source>
 reader = DataFileReader(open("users.avro", "rb"), DatumReader())
         </source>
         <p>
           We open the file again, this time for reading back from disk.  We use
           a <code>DataFileReader</code> and <code>DatumReader</code> analagous
           to the <code>DataFileWriter</code> and <code>DatumWriter</code> above.
         </p>
         <source>
 for user in reader:
     print user
         </source>
         <p>
           The <code>DataFileReader</code> is an iterator that returns
           <code>dict</code>s corresponding to the serialized items.
         </p>
     </section>
   </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	https://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
	"https://forrest.apache.org/dtd/document-v20.dtd" [
	<!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN"
	"../../../../build/avro.ent">
	%avro-entities;
	]>
	<document>
	<header>
	<title>Apache Avro &AvroVersion; Getting Started (Python)</title>
	</header>
	<body>
	<p>
	This is a short guide for getting started with Apache Avro using
	Python. This guide only covers using Avro for data serialization; see
	Patrick Hunt's <a href="https://github.com/phunt/avro-rpc-quickstart">Avro
	RPC Quick Start</a> for a good introduction to using Avro for RPC.
	</p>

	<section id="notice_python3">
	<title>Notice for Python 3 users</title>
	<p>
	A package called "avro-python3" had been provided to support
	Python 3 previously, but the codebase was consolidated into
	the "avro" package that supports Python 3 now.

	The avro-python3 package will be removed in the near future,
	so users should use the "avro" package instead.
	They are mostly API compatible, but there's a few minor difference
	(e.g., function name capitalization,
	such as avro.schema.Parse vs avro.schema.parse).
	</p>
	</section>

	<section id="download_install">
	<title>Download and Install</title>
	<p>
	The easiest way to get started in Python is to install <a href="https://pypi.org/project/avro/">avro from PyPI</a>
	using <a href="https://pip.pypa.io/en/stable/">pip</a>, the Python Package Installer.
	</p>
	<source>
	$ python3 -m pip install avro
	</source>
	<p>Consider doing a local install or using a virtualenv to avoid permissions problems and interfering with system packages:</p>
	<source>
	$ python3 -m pip install --user install avro
	</source>
	<p>or</p>
	<source>
	$ python3 -m venv avro-venv
	$ avro-venv/bin/pip install avro
	</source>
	<p>
	The official releases of the Avro implementations for C, C++, C#, Java,
	PHP, Python, and Ruby can be downloaded from the <a
	href="https://avro.apache.org/releases.html">Apache Avro
	Releases</a> page. This guide uses Avro &AvroVersion;, the latest
	version at the time of writing. Download and install
	<em>avro-&AvroVersion;-py2.py3-none-any.whl</em> or
	<em>avro-&AvroVersion;.tar.gz</em> via
	<code>python -m pip avro-&AvroVersion;-py2.py3-none-any.whl</code>
	or
	<code>python -m pip avro-&AvroVersion;.tar.gz</code>.
	(As above, consider using a virtualenv or user-local install.)
	</p>
	<p>Check that you can import avro from a Python prompt.</p>
	<source>
	$ python3 -c 'import avro; print(avro.__version__)'
	</source>
	<p>The above should print &AvroVersion;. It should not raise an <code>ImportError</code>.</p>
	<p>
	Alternatively, you may build the Avro Python library from source. From
	your the root Avro directory, run the commands
	</p>
	<source>
	$ cd lang/py/
	$ python3 -m pip install -e .
	$ python3
	</source>
	</section>

	<section>
	<title>Defining a schema</title>
	<p>
	Avro schemas are defined using JSON. Schemas are composed of <a
	href="spec.html#schema_primitive">primitive types</a>
	(<code>null</code>, <code>boolean</code>, <code>int</code>,
	<code>long</code>, <code>float</code>, <code>double</code>,
	<code>bytes</code>, and <code>string</code>) and <a
	href="spec.html#schema_complex">complex types</a> (<code>record</code>,
	<code>enum</code>, <code>array</code>, <code>map</code>,
	<code>union</code>, and <code>fixed</code>). You can learn more about
	Avro schemas and types from the specification, but for now let's start
	with a simple schema example, <em>user.avsc</em>:
	</p>
	<source>
	{"namespace": "example.avro",
	"type": "record",
	"name": "User",
	"fields": [
	{"name": "name", "type": "string"},
	{"name": "favorite_number", "type": ["int", "null"]},
	{"name": "favorite_color", "type": ["string", "null"]}
	]
	}
	</source>
	<p>
	This schema defines a record representing a hypothetical user. (Note
	that a schema file can only contain a single schema definition.) At
	minimum, a record definition must include its type (<code>"type":
	"record"</code>), a name (<code>"name": "User"</code>), and fields, in
	this case <code>name</code>, <code>favorite_number</code>, and
	<code>favorite_color</code>. We also define a namespace
	(<code>"namespace": "example.avro"</code>), which together with the name
	attribute defines the "full name" of the schema
	(<code>example.avro.User</code> in this case).

	</p>
	<p>
	Fields are defined via an array of objects, each of which defines a name
	and type (other attributes are optional, see the <a
	href="spec.html#schema_record">record specification</a> for more
	details). The type attribute of a field is another schema object, which
	can be either a primitive or complex type. For example, the
	<code>name</code> field of our User schema is the primitive type
	<code>string</code>, whereas the <code>favorite_number</code> and
	<code>favorite_color</code> fields are both <code>union</code>s,
	represented by JSON arrays. <code>union</code>s are a complex type that
	can be any of the types listed in the array; e.g.,
	<code>favorite_number</code> can either be an <code>int</code> or
	<code>null</code>, essentially making it an optional field.
	</p>
	</section>

	<section>
	<title>Serializing and deserializing without code generation</title>
	<p>
	Data in Avro is always stored with its corresponding schema, meaning we
	can always read a serialized item, regardless of whether we know the
	schema ahead of time. This allows us to perform serialization and
	deserialization without code generation. Note that the Avro Python
	library does not support code generation.
	</p>
	<p>
	Try running the following code snippet, which serializes two users to a
	data file on disk, and then reads back and deserializes the data file:
	</p>
	<source>
	import avro.schema
	from avro.datafile import DataFileReader, DataFileWriter
	from avro.io import DatumReader, DatumWriter

	schema = avro.schema.parse(open("user.avsc", "rb").read())

	writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
	writer.append({"name": "Alyssa", "favorite_number": 256})
	writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
	writer.close()

	reader = DataFileReader(open("users.avro", "rb"), DatumReader())
	for user in reader:
	print user
	reader.close()
	</source>
	<p>This outputs:</p>
	<source>
	{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
	{u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}
	</source>
	<p>
	Do make sure that you open your files in binary mode (i.e. using the modes
	<code>wb</code> or <code>rb</code> respectively). Otherwise you might
	generate corrupt files due to
	<a href="https://docs.python.org/library/functions.html#open">
	automatic replacement</a> of newline characters with the
	platform-specific representations.
	</p>
	<p>
	Let's take a closer look at what's going on here.
	</p>
	<source>
	schema = avro.schema.parse(open("user.avsc", "rb").read())
	</source>
	<p>
	<code>avro.schema.parse</code> takes a string containing a JSON schema
	definition as input and outputs a <code>avro.schema.Schema</code> object
	(specifically a subclass of <code>Schema</code>, in this case
	<code>RecordSchema</code>). We're passing in the contents of our
	user.avsc schema file here.
	</p>
	<source>
	writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
	</source>
	<p>
	We create a <code>DataFileWriter</code>, which we'll use to write
	serialized items to a data file on disk. The
	<code>DataFileWriter</code> constructor takes three arguments:
	</p>
	<ul>
	<li>The file we'll serialize to</li>
	<li>A <code>DatumWriter</code>, which is responsible for actually
	serializing the items to Avro's binary format
	(<code>DatumWriter</code>s can be used separately from
	<code>DataFileWriter</code>s, e.g., to perform IPC with Avro).</li>
	<li>The schema we're using. The <code>DataFileWriter</code> needs the
	schema both to write the schema to the data file, and to verify that
	the items we write are valid items and write the appropriate
	fields.</li>
	</ul>
	<source>
	writer.append({"name": "Alyssa", "favorite_number": 256})
	writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
	</source>
	<p>
	We use <code>DataFileWriter.append</code> to add items to our data
	file. Avro records are represented as Python <code>dict</code>s.
	Since the field <code>favorite_color</code> has type <code>["int",
	"null"]</code>, we are not required to specify this field, as shown in
	the first append. Were we to omit the required <code>name</code>
	field, an exception would be raised. Any extra entries not
	corresponding to a field are present in the <code>dict</code> are
	ignored.
	</p>
	<source>
	reader = DataFileReader(open("users.avro", "rb"), DatumReader())
	</source>
	<p>
	We open the file again, this time for reading back from disk. We use
	a <code>DataFileReader</code> and <code>DatumReader</code> analagous
	to the <code>DataFileWriter</code> and <code>DatumWriter</code> above.
	</p>
	<source>
	for user in reader:
	print user
	</source>
	<p>
	The <code>DataFileReader</code> is an iterator that returns
	<code>dict</code>s corresponding to the serialized items.
	</p>
	</section>
	</body>
	</document>