| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| ================= |
| Data Manipulation |
| ================= |
| |
| Recipes related to filtering or transforming data in |
| arrays and tables. |
| |
| .. contents:: |
| |
| See :ref:`compute` for a complete list of all available compute functions |
| |
| Computing Mean/Min/Max values of an array |
| ========================================= |
| |
| Arrow provides compute functions that can be applied to arrays. |
| Those compute functions are exposed through the :mod:`pyarrow.compute` |
| module. |
| |
| .. testsetup:: |
| |
| import numpy as np |
| import pyarrow as pa |
| |
| arr = pa.array(np.arange(100)) |
| |
| Given an array with 100 numbers, from 0 to 99 |
| |
| .. testcode:: |
| |
| print(f"{arr[0]} .. {arr[-1]}") |
| |
| .. testoutput:: |
| |
| 0 .. 99 |
| |
| We can compute the ``mean`` using the :func:`pyarrow.compute.mean` |
| function |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| mean = pc.mean(arr) |
| print(mean) |
| |
| .. testoutput:: |
| |
| 49.5 |
| |
| And the ``min`` and ``max`` using the :func:`pyarrow.compute.min_max` |
| function |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| min_max = pc.min_max(arr) |
| print(min_max) |
| |
| .. testoutput:: |
| |
| [('min', 0), ('max', 99)] |
| |
| Counting Occurrences of Elements |
| ================================ |
| |
| Arrow provides compute functions that can be applied to arrays, |
| those compute functions are exposed through the :mod:`pyarrow.compute` |
| module. |
| |
| .. testsetup:: |
| |
| import pyarrow as pa |
| |
| nums_arr = pa.array(list(range(10))*10) |
| |
| Given an array with all numbers from 0 to 9 repeated 10 times |
| |
| .. testcode:: |
| |
| print(f"LEN: {len(nums_arr)}, MIN/MAX: {nums_arr[0]} .. {nums_arr[-1]}") |
| |
| .. testoutput:: |
| |
| LEN: 100, MIN/MAX: 0 .. 9 |
| |
| We can count occurrences of all entries in the array using the |
| :func:`pyarrow.compute.value_counts` function |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| counts = pc.value_counts(nums_arr) |
| for pair in counts: |
| print(pair) |
| |
| .. testoutput:: |
| |
| [('values', 0), ('counts', 10)] |
| [('values', 1), ('counts', 10)] |
| [('values', 2), ('counts', 10)] |
| [('values', 3), ('counts', 10)] |
| [('values', 4), ('counts', 10)] |
| [('values', 5), ('counts', 10)] |
| [('values', 6), ('counts', 10)] |
| [('values', 7), ('counts', 10)] |
| [('values', 8), ('counts', 10)] |
| [('values', 9), ('counts', 10)] |
| |
| Applying arithmetic functions to arrays. |
| ========================================= |
| |
| The compute functions in :mod:`pyarrow.compute` also include |
| common transformations such as arithmetic functions. |
| |
| Given an array with 100 numbers, from 0 to 99 |
| |
| .. testcode:: |
| |
| print(f"{arr[0]} .. {arr[-1]}") |
| |
| .. testoutput:: |
| |
| 0 .. 99 |
| |
| We can multiply all values by 2 using the :func:`pyarrow.compute.multiply` |
| function |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| doubles = pc.multiply(arr, 2) |
| print(f"{doubles[0]} .. {doubles[-1]}") |
| |
| .. testoutput:: |
| |
| 0 .. 198 |
| |
| Appending tables to an existing table |
| ===================================== |
| |
| If you have data split across two different tables, it is possible |
| to concatenate their rows into a single table. |
| |
| If we have the list of Oscar nominations divided between two different tables: |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| oscar_nominations_1 = pa.table([ |
| ["Meryl Streep", "Katharine Hepburn"], |
| [21, 12] |
| ], names=["actor", "nominations"]) |
| |
| oscar_nominations_2 = pa.table([ |
| ["Jack Nicholson", "Bette Davis"], |
| [12, 10] |
| ], names=["actor", "nominations"]) |
| |
| We can combine them into a single table using :func:`pyarrow.concat_tables`: |
| |
| .. testcode:: |
| |
| oscar_nominations = pa.concat_tables([oscar_nominations_1, |
| oscar_nominations_2]) |
| print(oscar_nominations) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| actor: string |
| nominations: int64 |
| ---- |
| actor: [["Meryl Streep","Katharine Hepburn"],["Jack Nicholson","Bette Davis"]] |
| nominations: [[21,12],[12,10]] |
| |
| .. note:: |
| |
| By default, appending two tables is a zero-copy operation that doesn't need to |
| copy or rewrite data. As tables are made of :class:`pyarrow.ChunkedArray`, |
| the result will be a table with multiple chunks, each pointing to the original |
| data that has been appended. Under some conditions, Arrow might have to |
| cast data from one type to another (if `promote=True`). In such cases the data |
| will need to be copied and an extra cost will occur. |
| |
| Adding a column to an existing Table |
| ==================================== |
| |
| If you have a table it is possible to extend its columns using |
| :meth:`pyarrow.Table.append_column` |
| |
| Suppose we have a table with oscar nominations for each actress |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| oscar_nominations = pa.table([ |
| ["Meryl Streep", "Katharine Hepburn"], |
| [21, 12] |
| ], names=["actor", "nominations"]) |
| |
| print(oscar_nominations) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| actor: string |
| nominations: int64 |
| ---- |
| actor: [["Meryl Streep","Katharine Hepburn"]] |
| nominations: [[21,12]] |
| |
| it's possible to append an additional column to track the years the |
| nomination was won using :meth:`pyarrow.Table.append_column` |
| |
| .. testcode:: |
| |
| oscar_nominations = oscar_nominations.append_column( |
| "wonyears", |
| pa.array([ |
| [1980, 1983, 2012], |
| [1934, 1968, 1969, 1982] |
| ]) |
| ) |
| |
| print(oscar_nominations) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| actor: string |
| nominations: int64 |
| wonyears: list<item: int64> |
| child 0, item: int64 |
| ---- |
| actor: [["Meryl Streep","Katharine Hepburn"]] |
| nominations: [[21,12]] |
| wonyears: [[[1980,1983,2012],[1934,1968,1969,1982]]] |
| |
| |
| Replacing a column in an existing Table |
| ======================================= |
| |
| If you have a table it is possible to replace an existing column using |
| :meth:`pyarrow.Table.set_column` |
| |
| Suppose we have a table with information about items sold at a supermarket |
| on a particular day. |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| sales_data = pa.table([ |
| ["Potato", "Bean", "Cucumber", "Eggs"], |
| [21, 12, 10, 30] |
| ], names=["item", "amount"]) |
| |
| print(sales_data) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| item: string |
| amount: int64 |
| ---- |
| item: [["Potato","Bean","Cucumber","Eggs"]] |
| amount: [[21,12,10,30]] |
| |
| it's possible to replace the existing column `amount` |
| in index `1` to update the sales |
| using :meth:`pyarrow.Table.set_column` |
| |
| .. testcode:: |
| |
| new_sales_data = sales_data.set_column( |
| 1, |
| "new_amount", |
| pa.array([30, 20, 15, 40]) |
| ) |
| |
| print(new_sales_data) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| item: string |
| new_amount: int64 |
| ---- |
| item: [["Potato","Bean","Cucumber","Eggs"]] |
| new_amount: [[30,20,15,40]] |
| |
| .. data_group_a_table: |
| |
| Group a Table |
| ================ |
| |
| If you have a table which needs to be grouped by a particular key, |
| you can use :meth:`pyarrow.Table.group_by` followed by an aggregation |
| operation :meth:`pyarrow.TableGroupBy.aggregate`. Learn more about |
| groupby operations `here <https://arrow.apache.org/docs/python/compute.html#grouped-aggregations>`_. |
| |
| For example, let’s say we have some data with a particular set of keys |
| and values associated with that key. And we want to group the data by |
| those keys and apply an aggregate function like sum to evaluate |
| how many items are for each unique key. |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| table = pa.table([ |
| pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]), |
| pa.array([11, 20, 3, 4, 5, 1, 4, 10]), |
| ], names=["keys", "values"]) |
| |
| print(table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values: int64 |
| ---- |
| keys: [["a","a","b","b","c","d","e","c"]] |
| values: [[11,20,3,4,5,1,4,10]] |
| |
| Now we let's apply a groupby operation. The table will be grouped |
| by the field ``key`` and an aggregation operation, ``sum`` is applied |
| on the column ``values``. Note that, an aggregation operation pairs with a column name. |
| |
| .. testcode:: |
| |
| aggregated_table = table.group_by("keys").aggregate([("values", "sum")]) |
| |
| print(aggregated_table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values_sum: int64 |
| ---- |
| keys: [["a","b","c","d","e"]] |
| values_sum: [[31,7,15,1,4]] |
| |
| If you observe carefully, the new table returns the aggregated column |
| as ``values_sum`` which is formed by the column name and aggregation operation name. |
| |
| Aggregation operations can be applied with options. Let's take a case where |
| we have null values included in our dataset, but we want to take the |
| count of the unique groups excluding the null values. |
| |
| A sample dataset can be formed as follows. |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| table = pa.table([ |
| pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]), |
| pa.array([None, 20, 3, 4, 5, 6, 10, 1, 4, None]), |
| ], names=["keys", "values"]) |
| |
| print(table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values: int64 |
| ---- |
| keys: [["a","a","b","b","b","c","d","d","e","c"]] |
| values: [[null,20,3,4,5,6,10,1,4,null]] |
| |
| Let's apply an aggregation operation ``count`` with the option to exclude |
| null values. |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| grouped_table = table.group_by("keys").aggregate( |
| [("values", |
| "count", |
| pc.CountOptions(mode="only_valid"))] |
| ) |
| |
| print(grouped_table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values_count: int64 |
| ---- |
| keys: [["a","b","c","d","e"]] |
| values_count: [[1,3,1,2,1]] |
| |
| |
| Sort a Table |
| ============ |
| |
| Let's discusse how to sort a table. We can sort a table, |
| based on values of a given column. Data can be either sorted ``ascending`` |
| or ``descending``. |
| |
| Prepare data; |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| table = pa.table([ |
| pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]), |
| pa.array([15, 20, 3, 4, 5, 6, 10, 1, 14, 123]), |
| ], names=["keys", "values"]) |
| |
| print(table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values: int64 |
| ---- |
| keys: [["a","a","b","b","b","c","d","d","e","c"]] |
| values: [[15,20,3,4,5,6,10,1,14,123]] |
| |
| Then applying sort with :meth:`pyarrow.Table.sort_by`; |
| |
| .. testcode:: |
| |
| sorted_table = table.sort_by([("values", "ascending")]) |
| |
| print(sorted_table) |
| |
| .. testoutput:: |
| |
| pyarrow.Table |
| keys: string |
| values: int64 |
| ---- |
| keys: [["d","b","b","b","c","d","e","a","a","c"]] |
| values: [[1,3,4,5,6,10,14,15,20,123]] |
| |
| |
| Searching for values matching a predicate in Arrays |
| =================================================== |
| |
| If you have to look for values matching a predicate in Arrow arrays |
| the :mod:`pyarrow.compute` module provides several methods that |
| can be used to find the values you are looking for. |
| |
| For example, given an array with numbers from 0 to 9, if we |
| want to look only for those greater than 5 we could use the |
| :func:`pyarrow.compute.greater` method and get back the elements |
| that fit our predicate |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| import pyarrow.compute as pc |
| |
| arr = pa.array(range(10)) |
| gtfive = pc.greater(arr, 5) |
| |
| print(gtfive.to_string()) |
| |
| .. testoutput:: |
| |
| [ |
| false, |
| false, |
| false, |
| false, |
| false, |
| false, |
| true, |
| true, |
| true, |
| true |
| ] |
| |
| Furthermore we can filter the array to get only the entries |
| that match our predicate with :func:`pyarrow.compute.filter` |
| |
| .. testcode:: |
| |
| filtered_array = pc.filter(arr, gtfive) |
| print(filtered_array) |
| |
| .. testoutput:: |
| |
| [ |
| 6, |
| 7, |
| 8, |
| 9 |
| ] |
| |
| Filtering Arrays using a mask |
| ============================= |
| |
| In many cases, when you are searching for something in an array |
| you will end up with a mask that tells you the positions at which |
| your search matched the values. |
| |
| For example in an array of four items, we might have a mask that |
| matches the first and the last items only: |
| |
| .. testcode:: |
| |
| import pyarrow as pa |
| |
| array = pa.array([1, 2, 3, 4]) |
| mask = pa.array([True, False, False, True]) |
| |
| We can then filter the array according to the mask using |
| :meth:`pyarrow.Array.filter` to get back a new array with |
| only the values matching the mask: |
| |
| .. testcode:: |
| |
| filtered_array = array.filter(mask) |
| print(filtered_array) |
| |
| .. testoutput:: |
| |
| [ |
| 1, |
| 4 |
| ] |
| |
| Most search functions in :mod:`pyarrow.compute` will produce |
| a mask as the output, so you can use them to filter your arrays |
| for the values that have been found by the function. |
| |
| For example we might filter our arrays for the values equal to ``2`` |
| using :func:`pyarrow.compute.equal`: |
| |
| .. testcode:: |
| |
| import pyarrow.compute as pc |
| |
| filtered_array = array.filter(pc.equal(array, 2)) |
| print(filtered_array) |
| |
| .. testoutput:: |
| |
| [ |
| 2 |
| ] |