Real-world data is often imperfect, containing formatting errors, missing values, or inconsistent data types. Apache Cloudberry provides robust error handling mechanisms that allow you to load correctly formatted data while isolating and managing problematic rows, ensuring your ETL processes are resilient and reliable.
By default, if external table data contains any error, the entire load operation fails and no data is loaded. With error handling enabled, you can load valid data and deal with problematic rows separately.
Apache Cloudberry supports two error handling approaches:
Enable single row error isolation by adding SEGMENT REJECT LIMIT to your external table definition:
CREATE EXTERNAL TABLE sales_data_with_errors ( transaction_id int, product_name text, sale_date date, amount decimal(10,2) ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) SEGMENT REJECT LIMIT 100;
This configuration allows up to 100 rows with errors per segment before the operation fails.
You can also specify error limits as a percentage:
CREATE EXTERNAL TABLE sales_data ( transaction_id int, product_name text, sale_date date, amount decimal(10,2) ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) SEGMENT REJECT LIMIT 5 PERCENT;
This allows up to 5% of rows to contain errors before failing.
Use LOG ERRORS to capture detailed information about rejected rows:
CREATE EXTERNAL TABLE sales_data_logged ( transaction_id int, product_name text, sale_date date, amount decimal(10,2) ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) LOG ERRORS SEGMENT REJECT LIMIT 50;
For long-term error analysis, use persistent error logging:
CREATE EXTERNAL TABLE sales_data_persistent ( transaction_id int, product_name text, sale_date date, amount decimal(10,2) ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) LOG ERRORS PERSISTENTLY SEGMENT REJECT LIMIT 25;
When error logging is enabled, Apache Cloudberry creates error log tables that you can query:
-- View recent errors from the current session SELECT * FROM gp_read_error_log('sales_data_logged');
The error log contains these columns:
| Column | Description |
|---|---|
cmdtime | Timestamp when the error occurred |
relname | Name of the external table |
filename | Source file containing the error |
linenum | Line number in the source file |
bytenum | Byte position in the source file |
errmsg | Error message description |
rawdata | Raw data that caused the error |
rawbytes | Raw bytes of the problematic data |
-- Find the most common error types SELECT errmsg, COUNT(*) as error_count FROM gp_read_error_log('sales_data_logged') GROUP BY errmsg ORDER BY error_count DESC; -- View specific error details SELECT cmdtime, filename, linenum, errmsg, rawdata FROM gp_read_error_log('sales_data_logged') WHERE errmsg LIKE '%invalid input syntax%' ORDER BY cmdtime DESC;
Error: invalid input syntax for type numeric Cause: Non-numeric data in numeric columns Solution: Clean data or use text columns with post-processing
-- Original problematic data: "N/A" in amount column -- Solution: Use text type and handle conversion later CREATE EXTERNAL TABLE sales_flexible ( transaction_id int, product_name text, sale_date text, -- Use text for flexible parsing amount text -- Use text to handle "N/A" values ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) LOG ERRORS SEGMENT REJECT LIMIT 10 PERCENT;
Error: invalid input syntax for type date
Cause: Inconsistent date formats Solution: Standardize date formats or use flexible parsing
-- Handle multiple date formats in post-processing SELECT transaction_id, product_name, CASE WHEN sale_date ~ '^\d{4}-\d{2}-\d{2}$' THEN sale_date::date WHEN sale_date ~ '^\d{2}/\d{2}/\d{4}$' THEN to_date(sale_date, 'MM/DD/YYYY') ELSE NULL END as parsed_date, amount::decimal(10,2) FROM sales_flexible WHERE sale_date IS NOT NULL;
Error: extra data after last expected column
Cause: Inconsistent number of columns Solution: Use more flexible table definition
-- Add extra optional columns to handle variable column counts CREATE EXTERNAL TABLE flexible_sales ( transaction_id int, product_name text, sale_date date, amount decimal(10,2), extra_field1 text, -- Optional fields extra_field2 text, extra_field3 text ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' (HEADER) LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
Error: invalid byte sequence Cause: Character encoding mismatch Solution: Specify correct encoding
CREATE EXTERNAL TABLE encoded_data ( id int, description text ) LOCATION ('gpfdist://etl-server:8081/data/*.txt') FORMAT 'TEXT' (DELIMITER '|') ENCODING 'LATIN1' -- Specify encoding LOG ERRORS SEGMENT REJECT LIMIT 5 PERCENT;
Use a staging approach for complex data cleaning:
-- Phase 1: Load into staging table with flexible types CREATE EXTERNAL TABLE sales_staging ( transaction_id text, product_name text, sale_date text, amount text, raw_line text -- Store entire row for complex cases ) LOCATION ('gpfdist://etl-server:8081/sales/*.csv') FORMAT 'CSV' LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT; -- Phase 2: Clean and insert into final table INSERT INTO sales_final (transaction_id, product_name, sale_date, amount) SELECT transaction_id::int, product_name, sale_date::date, amount::decimal(10,2) FROM sales_staging WHERE transaction_id ~ '^\d+$' -- Validate numeric ID AND sale_date ~ '^\d{4}-\d{2}-\d{2}$' -- Validate date format AND amount ~ '^\d+\.?\d*$'; -- Validate amount format
Set up monitoring for error rates:
-- Function to check error rate CREATE OR REPLACE FUNCTION check_error_rate(table_name text, threshold_percent numeric) RETURNS boolean AS $$ DECLARE error_count int; total_count int; error_rate numeric; BEGIN SELECT COUNT(*) INTO error_count FROM gp_read_error_log(table_name); -- Estimate total processed rows (depends on your tracking method) SELECT reltuples INTO total_count FROM pg_class WHERE relname = table_name; IF total_count > 0 THEN error_rate := (error_count::numeric / total_count) * 100; RETURN error_rate <= threshold_percent; END IF; RETURN true; END; $$ LANGUAGE plpgsql; -- Usage SELECT check_error_rate('sales_data_logged', 5.0); -- Check if error rate is under 5%
| Data Quality | Suggested Reject Limit | Use Case |
|---|---|---|
| High quality | 1-10 rows | Production systems |
| Medium quality | 1-5% | Development/testing |
| Poor quality | 10-20% | Initial data exploration |
| Unknown quality | 50% | Data discovery phase |
If you are experiencing high error rates:
If error handling is impacting performance: