docs/source/user-guide/sql/struct_coercion.md - datafusion - Git at Google

 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 # Struct Type Coercion and Field Mapping

 DataFusion uses **name-based field mapping** when coercing struct types across different operations. This document explains how struct coercion works, when it applies, and how to handle NULL fields.

 ## Overview: Name-Based vs Positional Mapping

 When combining structs from different sources (e.g., in UNION, array construction, or JOINs), DataFusion matches struct fields by **name** rather than by **position**. This provides more robust and predictable behavior compared to positional matching.

 ### Example: Field Reordering is Handled Transparently

 ```sql
 -- These two structs have the same fields in different order
 SELECT [{a: 1, b: 2}, {b: 3, a: 4}];

 -- Result: Field names matched, values unified
 -- [{"a": 1, "b": 2}, {"a": 4, "b": 3}]
 ```

 ## Coercion Paths Using Name-Based Matching

 The following query operations use name-based field mapping for struct coercion:

 ### 1. Array Literal Construction

 When creating array literals with struct elements that have different field orders:

 ```sql
 -- Structs with reordered fields in array literal
 SELECT [{x: 1, y: 2}, {y: 3, x: 4}];

 -- Unified type: List(Struct("x": Int32, "y": Int32))
 -- Values: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
 ```

 **When it applies:**

 - Array literals with struct elements: `[{...}, {...}]`
 - Nested arrays with structs: `[[{x: 1}, {x: 2}]]`

 ### 2. Array Construction from Columns

 When constructing arrays from table columns with different struct schemas:

 ```sql
 CREATE TABLE t_left (s struct(x int, y int)) AS VALUES ({x: 1, y: 2});
 CREATE TABLE t_right (s struct(y int, x int)) AS VALUES ({y: 3, x: 4});

 -- Dynamically constructs unified array schema
 SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;

 -- Result: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
 ```

 **When it applies:**

 - Array construction with column references: `[col1, col2]`
 - Array construction in joins with matching field names

 ### 3. UNION Operations

 When combining query results with different struct field orders:

 ```sql
 SELECT {a: 1, b: 2} as s
 UNION ALL
 SELECT {b: 3, a: 4} as s;

 -- Result: {"a": 1, "b": 2} and {"a": 4, "b": 3}
 ```

 **When it applies:**

 - UNION ALL with structs: field names matched across branches
 - UNION (deduplicated) with structs

 ### 4. Common Table Expressions (CTEs)

 When multiple CTEs produce structs with different field orders that are combined:

 ```sql
 WITH
   t1 AS (SELECT {a: 1, b: 2} as s),
   t2 AS (SELECT {b: 3, a: 4} as s)
 SELECT s FROM t1
 UNION ALL
 SELECT s FROM t2;

 -- Result: Field names matched across CTEs
 ```

 ### 5. VALUES Clauses

 When creating tables or temporary results with struct values in different field orders:

 ```sql
 CREATE TABLE t AS VALUES ({a: 1, b: 2}), ({b: 3, a: 4});

 -- Table schema unified: struct(a: int, b: int)
 -- Values: {a: 1, b: 2} and {a: 4, b: 3}
 ```

 ### 6. JOIN Operations

 When joining tables where the JOIN condition involves structs with different field orders:

 ```sql
 CREATE TABLE orders (customer struct(name varchar, id int));
 CREATE TABLE customers (info struct(id int, name varchar));

 -- Join matches struct fields by name
 SELECT * FROM orders
 JOIN customers ON orders.customer = customers.info;
 ```

 ### 7. Aggregate Functions

 When collecting structs with different field orders using aggregate functions like `array_agg`:

 ```sql
 SELECT array_agg(s) FROM (
   SELECT {x: 1, y: 2} as s
   UNION ALL
   SELECT {y: 3, x: 4} as s
 ) t
 GROUP BY category;

 -- Result: Array of structs with unified field order
 ```

 ### 8. Window Functions

 When using window functions with struct expressions having different field orders:

 ```sql
 SELECT
   id,
   row_number() over (partition by s order by id) as rn
 FROM (
   SELECT {category: 1, value: 10} as s, 1 as id
   UNION ALL
   SELECT {value: 20, category: 1} as s, 2 as id
 );

 -- Fields matched by name in PARTITION BY clause
 ```

 ## NULL Handling for Missing Fields

 When structs have different field sets, missing fields are filled with **NULL** values during coercion.

 ### Example: Partial Field Overlap

 ```sql
 -- Struct in first position has fields: a, b
 -- Struct in second position has fields: b, c
 -- Unified schema includes all fields: a, b, c

 SELECT [
   CAST({a: 1, b: 2} AS STRUCT(a INT, b INT, c INT)),
   CAST({b: 3, c: 4} AS STRUCT(a INT, b INT, c INT))
 ];

 -- Result:
 -- [
 --   {"a": 1, "b": 2, "c": NULL},
 --   {"a": NULL, "b": 3, "c": 4}
 -- ]
 ```

 ### Limitations

 **Field count must match exactly.** If structs have different numbers of fields and their field names don't completely overlap, the query will fail:

 ```sql
 -- This fails because field sets don't match:
 -- t_left has {x, y} but t_right has {x, y, z}
 SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;
 -- Error: Cannot coerce struct with mismatched field counts
 ```

 **Workaround: Use explicit CAST**

 To handle partial field overlap, explicitly cast structs to a unified schema:

 ```sql
 SELECT [
   CAST(t_left.s AS STRUCT(x INT, y INT, z INT)),
   CAST(t_right.s AS STRUCT(x INT, y INT, z INT))
 ] FROM t_left JOIN t_right;
 ```

 ## Migration Guide: From Positional to Name-Based Matching

 If you have existing code that relied on **positional** struct field matching, you may need to update it.

 ### Example: Query That Changes Behavior

 **Old behavior (positional):**

 ```sql
 -- These would have been positionally mapped (left-to-right)
 SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
 -- Old result (positional): [{"x": 1, "y": 2}, {"y": 3, "x": 4}]
 ```

 **New behavior (name-based):**

 ```sql
 -- Now uses name-based matching
 SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
 -- New result (by name): [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
 ```

 ### Migration Steps

 1. **Review struct operations** - Look for queries that combine structs from different sources
 2. **Check field names** - Verify that field names match as expected (not positions)
 3. **Test with new coercion** - Run queries and verify the results match your expectations
 4. **Handle field reordering** - If you need specific field orders, use explicit CAST operations

 ### Using Explicit CAST for Compatibility

 If you need precise control over struct field order and types, use explicit `CAST`:

 ```sql
 -- Guarantee specific field order and types
 SELECT CAST({b: 3, a: 4} AS STRUCT(a INT, b INT));
 -- Result: {"a": 4, "b": 3}
 ```

 ## Best Practices

 ### 1. Be Explicit with Schema Definitions

 When joining or combining structs, define target schemas explicitly:

 ```sql
 -- Good: explicit schema definition
 SELECT CAST(data AS STRUCT(id INT, name VARCHAR, active BOOLEAN))
 FROM external_source;
 ```

 ### 2. Use Named Struct Constructors

 Prefer named struct constructors for clarity:

 ```sql
 -- Good: field names are explicit
 SELECT named_struct('id', 1, 'name', 'Alice', 'active', true);

 -- Or using struct literal syntax
 SELECT {id: 1, name: 'Alice', active: true};
 ```

 ### 3. Test Field Mappings

 Always verify that field mappings work as expected:

 ```sql
 -- Use arrow_typeof to verify unified schema
 SELECT arrow_typeof([{x: 1, y: 2}, {y: 3, x: 4}]);
 -- Result: List(Struct("x": Int32, "y": Int32))
 ```

 ### 4. Handle Partial Field Overlap Explicitly

 When combining structs with partial field overlap, use explicit CAST:

 ```sql
 -- Instead of relying on implicit coercion
 SELECT [
   CAST(left_struct AS STRUCT(x INT, y INT, z INT)),
   CAST(right_struct AS STRUCT(x INT, y INT, z INT))
 ];
 ```

 ### 5. Document Struct Schemas

 In complex queries, document the expected struct schemas:

 ```sql
 -- Expected schema: {customer_id: INT, name: VARCHAR, age: INT}
 SELECT {
   customer_id: c.id,
   name: c.name,
   age: c.age
 } as customer_info
 FROM customers c;
 ```

 ## Error Messages and Troubleshooting

 ### "Cannot coerce struct with different field counts"

 **Cause:** Trying to combine structs with different numbers of fields.

 **Solution:**

 ```sql
 -- Use explicit CAST to handle missing fields
 SELECT [
   CAST(struct1 AS STRUCT(a INT, b INT, c INT)),
   CAST(struct2 AS STRUCT(a INT, b INT, c INT))
 ];
 ```

 ### "Field X not found in struct"

 **Cause:** Referencing a field name that doesn't exist in the struct.

 **Solution:**

 ```sql
 -- Verify field names match exactly (case-sensitive)
 SELECT s['field_name'] FROM my_table;  -- Use bracket notation for access
 -- Or use get_field function
 SELECT get_field(s, 'field_name') FROM my_table;
 ```

 ### Unexpected NULL values after coercion

 **Cause:** Struct coercion added NULL for missing fields.

 **Solution:** Check that all structs have the required fields, or explicitly handle NULLs:

 ```sql
 SELECT COALESCE(s['field'], default_value) FROM my_table;
 ```

 ## Related Functions

 - `arrow_typeof()` - Returns the Arrow type of an expression
 - `struct()` / `named_struct()` - Creates struct values
 - `get_field()` - Extracts field values from structs
 - `CAST()` - Explicitly casts structs to specific schemas
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Struct Type Coercion and Field Mapping

	DataFusion uses name-based field mapping when coercing struct types across different operations. This document explains how struct coercion works, when it applies, and how to handle NULL fields.

	## Overview: Name-Based vs Positional Mapping

	When combining structs from different sources (e.g., in UNION, array construction, or JOINs), DataFusion matches struct fields by name rather than by position. This provides more robust and predictable behavior compared to positional matching.

	### Example: Field Reordering is Handled Transparently

	```sql
	-- These two structs have the same fields in different order
	SELECT [{a: 1, b: 2}, {b: 3, a: 4}];

	-- Result: Field names matched, values unified
	-- [{"a": 1, "b": 2}, {"a": 4, "b": 3}]
	```

	## Coercion Paths Using Name-Based Matching

	The following query operations use name-based field mapping for struct coercion:

	### 1. Array Literal Construction

	When creating array literals with struct elements that have different field orders:

	```sql
	-- Structs with reordered fields in array literal
	SELECT [{x: 1, y: 2}, {y: 3, x: 4}];

	-- Unified type: List(Struct("x": Int32, "y": Int32))
	-- Values: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
	```

	When it applies:

	- Array literals with struct elements: `[{...}, {...}]`
	- Nested arrays with structs: `[[{x: 1}, {x: 2}]]`

	### 2. Array Construction from Columns

	When constructing arrays from table columns with different struct schemas:

	```sql
	CREATE TABLE t_left (s struct(x int, y int)) AS VALUES ({x: 1, y: 2});
	CREATE TABLE t_right (s struct(y int, x int)) AS VALUES ({y: 3, x: 4});

	-- Dynamically constructs unified array schema
	SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;

	-- Result: [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
	```

	When it applies:

	- Array construction with column references: `[col1, col2]`
	- Array construction in joins with matching field names

	### 3. UNION Operations

	When combining query results with different struct field orders:

	```sql
	SELECT {a: 1, b: 2} as s
	UNION ALL
	SELECT {b: 3, a: 4} as s;

	-- Result: {"a": 1, "b": 2} and {"a": 4, "b": 3}
	```

	When it applies:

	- UNION ALL with structs: field names matched across branches
	- UNION (deduplicated) with structs

	### 4. Common Table Expressions (CTEs)

	When multiple CTEs produce structs with different field orders that are combined:

	```sql
	WITH
	t1 AS (SELECT {a: 1, b: 2} as s),
	t2 AS (SELECT {b: 3, a: 4} as s)
	SELECT s FROM t1
	UNION ALL
	SELECT s FROM t2;

	-- Result: Field names matched across CTEs
	```

	### 5. VALUES Clauses

	When creating tables or temporary results with struct values in different field orders:

	```sql
	CREATE TABLE t AS VALUES ({a: 1, b: 2}), ({b: 3, a: 4});

	-- Table schema unified: struct(a: int, b: int)
	-- Values: {a: 1, b: 2} and {a: 4, b: 3}
	```

	### 6. JOIN Operations

	When joining tables where the JOIN condition involves structs with different field orders:

	```sql
	CREATE TABLE orders (customer struct(name varchar, id int));
	CREATE TABLE customers (info struct(id int, name varchar));

	-- Join matches struct fields by name
	SELECT * FROM orders
	JOIN customers ON orders.customer = customers.info;
	```

	### 7. Aggregate Functions

	When collecting structs with different field orders using aggregate functions like `array_agg`:

	```sql
	SELECT array_agg(s) FROM (
	SELECT {x: 1, y: 2} as s
	UNION ALL
	SELECT {y: 3, x: 4} as s
	) t
	GROUP BY category;

	-- Result: Array of structs with unified field order
	```

	### 8. Window Functions

	When using window functions with struct expressions having different field orders:

	```sql
	SELECT
	id,
	row_number() over (partition by s order by id) as rn
	FROM (
	SELECT {category: 1, value: 10} as s, 1 as id
	UNION ALL
	SELECT {value: 20, category: 1} as s, 2 as id
	);

	-- Fields matched by name in PARTITION BY clause
	```

	## NULL Handling for Missing Fields

	When structs have different field sets, missing fields are filled with NULL values during coercion.

	### Example: Partial Field Overlap

	```sql
	-- Struct in first position has fields: a, b
	-- Struct in second position has fields: b, c
	-- Unified schema includes all fields: a, b, c

	SELECT [
	CAST({a: 1, b: 2} AS STRUCT(a INT, b INT, c INT)),
	CAST({b: 3, c: 4} AS STRUCT(a INT, b INT, c INT))
	];

	-- Result:
	-- [
	-- {"a": 1, "b": 2, "c": NULL},
	-- {"a": NULL, "b": 3, "c": 4}
	-- ]
	```

	### Limitations

	Field count must match exactly. If structs have different numbers of fields and their field names don't completely overlap, the query will fail:

	```sql
	-- This fails because field sets don't match:
	-- t_left has {x, y} but t_right has {x, y, z}
	SELECT [t_left.s, t_right.s] FROM t_left JOIN t_right;
	-- Error: Cannot coerce struct with mismatched field counts
	```

	Workaround: Use explicit CAST

	To handle partial field overlap, explicitly cast structs to a unified schema:

	```sql
	SELECT [
	CAST(t_left.s AS STRUCT(x INT, y INT, z INT)),
	CAST(t_right.s AS STRUCT(x INT, y INT, z INT))
	] FROM t_left JOIN t_right;
	```

	## Migration Guide: From Positional to Name-Based Matching

	If you have existing code that relied on positional struct field matching, you may need to update it.

	### Example: Query That Changes Behavior

	Old behavior (positional):

	```sql
	-- These would have been positionally mapped (left-to-right)
	SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
	-- Old result (positional): [{"x": 1, "y": 2}, {"y": 3, "x": 4}]
	```

	New behavior (name-based):

	```sql
	-- Now uses name-based matching
	SELECT [{x: 1, y: 2}, {y: 3, x: 4}];
	-- New result (by name): [{"x": 1, "y": 2}, {"x": 4, "y": 3}]
	```

	### Migration Steps

	1. Review struct operations - Look for queries that combine structs from different sources
	2. Check field names - Verify that field names match as expected (not positions)
	3. Test with new coercion - Run queries and verify the results match your expectations
	4. Handle field reordering - If you need specific field orders, use explicit CAST operations

	### Using Explicit CAST for Compatibility

	If you need precise control over struct field order and types, use explicit `CAST`:

	```sql
	-- Guarantee specific field order and types
	SELECT CAST({b: 3, a: 4} AS STRUCT(a INT, b INT));
	-- Result: {"a": 4, "b": 3}
	```

	## Best Practices

	### 1. Be Explicit with Schema Definitions

	When joining or combining structs, define target schemas explicitly:

	```sql
	-- Good: explicit schema definition
	SELECT CAST(data AS STRUCT(id INT, name VARCHAR, active BOOLEAN))
	FROM external_source;
	```

	### 2. Use Named Struct Constructors

	Prefer named struct constructors for clarity:

	```sql
	-- Good: field names are explicit
	SELECT named_struct('id', 1, 'name', 'Alice', 'active', true);

	-- Or using struct literal syntax
	SELECT {id: 1, name: 'Alice', active: true};
	```

	### 3. Test Field Mappings

	Always verify that field mappings work as expected:

	```sql
	-- Use arrow_typeof to verify unified schema
	SELECT arrow_typeof([{x: 1, y: 2}, {y: 3, x: 4}]);
	-- Result: List(Struct("x": Int32, "y": Int32))
	```

	### 4. Handle Partial Field Overlap Explicitly

	When combining structs with partial field overlap, use explicit CAST:

	```sql
	-- Instead of relying on implicit coercion
	SELECT [
	CAST(left_struct AS STRUCT(x INT, y INT, z INT)),
	CAST(right_struct AS STRUCT(x INT, y INT, z INT))
	];
	```

	### 5. Document Struct Schemas

	In complex queries, document the expected struct schemas:

	```sql
	-- Expected schema: {customer_id: INT, name: VARCHAR, age: INT}
	SELECT {
	customer_id: c.id,
	name: c.name,
	age: c.age
	} as customer_info
	FROM customers c;
	```

	## Error Messages and Troubleshooting

	### "Cannot coerce struct with different field counts"

	Cause: Trying to combine structs with different numbers of fields.

	Solution:

	```sql
	-- Use explicit CAST to handle missing fields
	SELECT [
	CAST(struct1 AS STRUCT(a INT, b INT, c INT)),
	CAST(struct2 AS STRUCT(a INT, b INT, c INT))
	];
	```

	### "Field X not found in struct"

	Cause: Referencing a field name that doesn't exist in the struct.

	Solution:

	```sql
	-- Verify field names match exactly (case-sensitive)
	SELECT s['field_name'] FROM my_table; -- Use bracket notation for access
	-- Or use get_field function
	SELECT get_field(s, 'field_name') FROM my_table;
	```

	### Unexpected NULL values after coercion

	Cause: Struct coercion added NULL for missing fields.

	Solution: Check that all structs have the required fields, or explicitly handle NULLs:

	```sql
	SELECT COALESCE(s['field'], default_value) FROM my_table;
	```

	## Related Functions

	- `arrow_typeof()` - Returns the Arrow type of an expression
	- `struct()` / `named_struct()` - Creates struct values
	- `get_field()` - Extracts field values from structs
	- `CAST()` - Explicitly casts structs to specific schemas