docs/source/python/json.rst - arrow - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. currentmodule:: pyarrow.json
 .. _json:

 Reading JSON files
 ==================

 Arrow supports reading columnar data from line-delimited JSON files.
 In this context, a JSON file consists of multiple JSON objects, one per line,
 representing individual data rows.  For example, this file represents
 two rows of data with four columns "a", "b", "c", "d":

 .. code-block:: json

    {"a": 1, "b": 2.0, "c": "foo", "d": false}
    {"a": 4, "b": -5.5, "c": null, "d": true}

 The features currently offered are the following:

 * multi-threaded or single-threaded reading
 * automatic decompression of input files (based on the filename extension,
   such as ``my_data.json.gz``)
 * sophisticated type inference (see below)

 .. note::
    Currently only the line-delimited JSON format is supported.


 Usage
 -----

 JSON reading functionality is available through the :mod:`pyarrow.json` module.
 In many cases, you will simply call the :func:`read_json` function
 with the file path you want to read from::

    >>> from pyarrow import json
    >>> fn = 'my_data.json'
    >>> table = json.read_json(fn)
    >>> table
    pyarrow.Table
    a: int64
    b: double
    c: string
    d: bool
    >>> table.to_pandas()
       a    b     c      d
    0  1  2.0   foo  False
    1  4 -5.5  None   True


 Automatic Type Inference
 ------------------------

 Arrow :ref:`data types <data.types>` are inferred from the JSON types and
 values of each column:

 * JSON null values convert to the ``null`` type, but can fall back to any
   other type.
 * JSON booleans convert to ``bool_``.
 * JSON numbers convert to ``int64``, falling back to ``float64`` if a
   non-integer is encountered.
 * JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert
   to ``timestamp[s]``, falling back to ``utf8`` if a conversion error occurs.
 * JSON arrays convert to a ``list`` type, and inference proceeds recursively
   on the JSON arrays' values.
 * Nested JSON objects convert to a ``struct`` type, and inference proceeds
   recursively on the JSON objects' values.

 Thus, reading this JSON file:

 .. code-block:: json

    {"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
    {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

 returns the following data::

    >>> table = json.read_json("my_data.json")
    >>> table
    pyarrow.Table
    a: list<item: int64>
      child 0, item: int64
    b: struct<c: bool, d: timestamp[s]>
      child 0, c: bool
      child 1, d: timestamp[s]
    >>> table.to_pandas()
               a                                       b
    0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}
    1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}


 Customized parsing
 ------------------

 To alter the default parsing settings in case of reading JSON files with an
 unusual structure, you should create a :class:`ParseOptions` instance
 and pass it to :func:`read_json`.  For example, you can pass an explicit
 :ref:`schema <data.schema>` in order to bypass automatic type inference.

 Similarly, you can choose performance settings by passing a
 :class:`ReadOptions` instance to :func:`read_json`.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. currentmodule:: pyarrow.json
	.. _json:

	Reading JSON files
	==================

	Arrow supports reading columnar data from line-delimited JSON files.
	In this context, a JSON file consists of multiple JSON objects, one per line,
	representing individual data rows. For example, this file represents
	two rows of data with four columns "a", "b", "c", "d":

	.. code-block:: json

	{"a": 1, "b": 2.0, "c": "foo", "d": false}
	{"a": 4, "b": -5.5, "c": null, "d": true}

	The features currently offered are the following:

	* multi-threaded or single-threaded reading
	* automatic decompression of input files (based on the filename extension,
	such as ``my_data.json.gz``)
	* sophisticated type inference (see below)

	.. note::
	Currently only the line-delimited JSON format is supported.


	Usage
	-----

	JSON reading functionality is available through the :mod:`pyarrow.json` module.
	In many cases, you will simply call the :func:`read_json` function
	with the file path you want to read from::

	>>> from pyarrow import json
	>>> fn = 'my_data.json'
	>>> table = json.read_json(fn)
	>>> table
	pyarrow.Table
	a: int64
	b: double
	c: string
	d: bool
	>>> table.to_pandas()
	a b c d
	0 1 2.0 foo False
	1 4 -5.5 None True


	Automatic Type Inference
	------------------------

	Arrow :ref:`data types <data.types>` are inferred from the JSON types and
	values of each column:

	* JSON null values convert to the ``null`` type, but can fall back to any
	other type.
	* JSON booleans convert to ``bool_``.
	* JSON numbers convert to ``int64``, falling back to ``float64`` if a
	non-integer is encountered.
	* JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert
	to ``timestamp[s]``, falling back to ``utf8`` if a conversion error occurs.
	* JSON arrays convert to a ``list`` type, and inference proceeds recursively
	on the JSON arrays' values.
	* Nested JSON objects convert to a ``struct`` type, and inference proceeds
	recursively on the JSON objects' values.

	Thus, reading this JSON file:

	.. code-block:: json

	{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
	{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

	returns the following data::

	>>> table = json.read_json("my_data.json")
	>>> table
	pyarrow.Table
	a: list<item: int64>
	child 0, item: int64
	b: struct<c: bool, d: timestamp[s]>
	child 0, c: bool
	child 1, d: timestamp[s]
	>>> table.to_pandas()
	a b
	0 [1, 2] {'c': True, 'd': 1991-02-03 00:00:00}
	1 [3, 4, 5] {'c': False, 'd': 2019-04-01 00:00:00}


	Customized parsing
	------------------

	To alter the default parsing settings in case of reading JSON files with an
	unusual structure, you should create a :class:`ParseOptions` instance
	and pass it to :func:`read_json`. For example, you can pass an explicit
	:ref:`schema <data.schema>` in order to bypass automatic type inference.

	Similarly, you can choose performance settings by passing a
	:class:`ReadOptions` instance to :func:`read_json`.