blob: 0ef194afeedfd36b4739087bfde1834a7909a570 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# DataFusion Python Examples
Some examples rely on data which can be downloaded from the following site:
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Here is a direct link to the file used in the examples:
- https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
### Creating a SessionContext
- [Creating a SessionContext](./create-context.py)
### Executing Queries with DataFusion
- [Query a Parquet file using SQL](./sql-parquet.py)
- [Query a Parquet file using the DataFrame API](./dataframe-parquet.py)
- [Run a SQL query and store the results in a Pandas DataFrame](./sql-to-pandas.py)
- [Query PyArrow Data](./query-pyarrow-data.py)
### Running User-Defined Python Code
- [Register a Python UDF with DataFusion](./python-udf.py)
- [Register a Python UDAF with DataFusion](./python-udaf.py)
### Substrait Support
- [Serialize query plans using Substrait](./substrait.py)
### Executing SQL against DataFrame Libraries (Experimental)
- [Executing SQL on Polars](./sql-on-polars.py)
- [Executing SQL on Pandas](./sql-on-pandas.py)
- [Executing SQL on cuDF](./sql-on-cudf.py)
## TPC-H Examples
Within the subdirectory `tpch` there are 22 examples that reproduce queries in
the TPC-H specification. These include realistic data that can be generated at
arbitrary scale and allow the user to see use cases for a variety of data frame
operations.
In the list below we describe which new operations can be found in the examples.
The queries are designed to be of increasing complexity, so it is recommended to
review them in order. For brevity, the following list does not include operations
found in previous examples.
- [Convert CSV to Parquet](./tpch/convert_data_to_parquet.py)
- Read from a CSV files where the delimiter is something other than a comma
- Specify schema during CVS reading
- Write to a parquet file
- [Pricing Summary Report](./tpch/q01_pricing_summary_report.py)
- Aggregation computing the maximum value, average, sum, and number of entries
- Filter data by date and interval
- Sorting
- [Minimum Cost Supplier](./tpch/q02_minimum_cost_supplier.py)
- Window operation to find minimum
- Sorting in descending order
- [Shipping Priority](./tpch/q03_shipping_priority.py)
- [Order Priority Checking](./tpch/q04_order_priority_checking.py)
- Aggregating multiple times in one data frame
- [Local Supplier Volume](./tpch/q05_local_supplier_volume.py)
- [Forecasting Revenue Change](./tpch/q06_forecasting_revenue_change.py)
- Using collect and extracting values as a python object
- [Volume Shipping](./tpch/q07_volume_shipping.py)
- Finding multiple distinct and mutually exclusive values within one dataframe
- Using `case` and `when` statements
- [Market Share](./tpch/q08_market_share.py)
- The operations in this query are similar to those in the prior examples, but
it is a more complex example of using filters, joins, and aggregates
- Using left outer joins
- [Product Type Profit Measure](./tpch/q09_product_type_profit_measure.py)
- Extract year from a date
- [Returned Item Reporting](./tpch/q10_returned_item_reporting.py)
- [Important Stock Identification](./tpch/q11_important_stock_identification.py)
- [Shipping Modes and Order](./tpch/q12_ship_mode_order_priority.py)
- Finding non-null values using a boolean operation in a filter
- Case statement with default value
- [Customer Distribution](./tpch/q13_customer_distribution.py)
- [Promotion Effect](./tpch/q14_promotion_effect.py)
- [Top Supplier](./tpch/q15_top_supplier.py)
- [Parts/Supplier Relationship](./tpch/q16_part_supplier_relationship.py)
- Using anti joins
- Using regular expressions (regex)
- Creating arrays of literal values
- Determine if an element exists within an array
- [Small-Quantity-Order Revenue](./tpch/q17_small_quantity_order.py)
- [Large Volume Customer](./tpch/q18_large_volume_customer.py)
- [Discounted Revenue](./tpch/q19_discounted_revenue.py)
- Creating a user defined function (UDF)
- Convert pyarrow Array to python values
- Filtering based on a UDF
- [Potential Part Promotion](./tpch/q20_potential_part_promotion.py)
- Extracting part of a string using substr
- [Suppliers Who Kept Orders Waiting](./tpch/q21_suppliers_kept_orders_waiting.py)
- Using array aggregation
- Determining the size of array elements
- [Global Sales Opportunity](./tpch/q22_global_sales_opportunity.py)