DataFusion tables can be queried with SQL or with the Python API.
Let's create a small table and show the different types of queries that can be run.
df = ctx.from_pydict( { "first_name": ["li", "wang", "ron", "amanda"], "age": [25, 75, 68, 18], "country": ["china", "china", "us", "us"], }, name="some_people", )
Here's the data in the table:
+------------+-----+---------+ | first_name | age | country | +------------+-----+---------+ | li | 25 | china | | wang | 75 | china | | ron | 68 | us | | amanda | 18 | us | +------------+-----+---------+
Here's how to find all individuals that are older than 65 years old in the data with SQL:
ctx.sql("select * from some_people where age > 65")
+------------+-----+---------+
| first_name | age | country |
+------------+-----+---------+
| wang | 75 | china |
| ron | 68 | us |
+------------+-----+---------+
Here's how to run the same query with Python:
df.filter(col("age") > lit(65))
+------------+-----+---------+ | first_name | age | country | +------------+-----+---------+ | wang | 75 | china | | ron | 68 | us | +------------+-----+---------+
Here's how to select the first_name and country columns from the DataFrame with SQL:
ctx.sql("select first_name, country from some_people")
+------------+---------+
| first_name | country |
+------------+---------+
| li | china |
| wang | china |
| ron | us |
| amanda | us |
+------------+---------+
Here's how to run the same query with Python:
df.select(col("first_name"), col("country"))
+------------+---------+ | first_name | country | +------------+---------+ | li | china | | wang | china | | ron | us | | amanda | us | +------------+---------+
Here's how to run a group by aggregation query:
ctx.sql("select country, count(*) as num_people from some_people group by country")
+---------+------------+
| country | num_people |
+---------+------------+
| china | 2 |
| us | 2 |
+---------+------------+