| --- |
| title: Reading and Writing Fixed-Width Text Data in an Object Store |
| --- |
| |
| The PXF object store connectors support reading and writing fixed-width text using the Greenplum Database [fixed width custom formatter](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/admin_guide-load-topics-g-importing-and-exporting-fixed-width-data.html). This section describes how to use PXF to access fixed-width text, including how to create, query, and insert data into an external table that references files in the object store. |
| |
| **Note**: Accessing fixed-width text data from an object store is very similar to accessing such data in HDFS. |
| |
| ## <a id="prereq"></a>Prerequisites |
| |
| Ensure that you have met the PXF Object Store [Prerequisites](access_objstore.html#objstore_prereq) before you attempt to read data from or write data to an object store. |
| |
| ## <a id="profile_fixedwidth"></a>Reading Text Data with Fixed Widths |
| |
| Use the `<objstore>:fixedwidth` profile when you read fixed-width text from an object store where each line is a single record. PXF supports the following `<objstore>` profile prefixes: |
| |
| | Object Store | Profile Prefix | |
| |-------|-------------------------------------| |
| | Azure Blob Storage | wasbs | |
| | Azure Data Lake Storage Gen2 | abfss | |
| | Google Cloud Storage | gs | |
| | MinIO | s3 | |
| | AWS S3 | s3 | |
| |
| The following syntax creates a Greenplum Database readable external table that references such a text file in an object store: |
| |
| ``` sql |
| CREATE EXTERNAL TABLE <table_name> |
| ( <column_name> <data_type> [, ...] | LIKE <other_table> ) |
| LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:fixedwidth[&SERVER=<server_name>][&NEWLINE=<bytecode>][&IGNORE_MISSING_PATH=<boolean>]') |
| FORMAT 'CUSTOM' (FORMATTER='fixedwidth_in', <field_name>='<width>' [, ...] [, line_delim[=|<space>][E]'<delim_value>']); |
| ``` |
| |
| The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html) command are described in the table below. |
| |
| | Keyword | Value | |
| |-------|-------------------------------------| |
| | \<path‑to‑file\> | The path to the directory or file in the object store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path‑to‑file\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path‑to‑file\> must not specify a relative path nor include the dollar sign (`$`) character. | |
| | PROFILE=\<objstore\>:fixedwidth | The `PROFILE` must identify the specific object store. For example, `s3:fixedwidth`. | |
| | SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. | |
| | NEWLINE=\<bytecode\> | When the `line_delim` formatter option contains `\r`, `\r\n`, or a set of custom escape characters, you must set `<bytecode>` to `CR`, `CRLF`, or the set of bytecode characters, respectively. | |
| | IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. | |
| | FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `FORMATTER='fixedwidth_in'` (read). | |
| | \<field_name>='\<width>' | The name and the width of the field. For example: `first_name='15'` specifies that the `first_name` field is `15` characters long. By default, when the field value is less than `<width>` size, Greenplum Database expects the field to be right-padded with spaces to that size. | |
| | line_delim | The line delimiter character in the data. Preface the \<delim_value\> with an `E` when the value is an escape sequence. Examples: `line_delim=E'\n'`, `line_delim 'aaa'`. The default value is `'\n'`.| |
| |
| **Note**: PXF does not support the `(HEADER)` formatter option in the `CREATE EXTERNAL TABLE` command. |
| |
| If you are accessing an S3 object store, you can provide S3 credentials via custom options in the `CREATE EXTERNAL TABLE` command as described in [Overriding the S3 Server Configuration for External Tables](access_s3.html#s3_override_ext_ext). |
| |
| ## <a id="about_fields"></a>About Specifying field_name and width |
| |
| Greenplum Database loads all fields in a line of fixed-width data in their physical order. The `<field_name>`s that you specify in the `FORMAT` options must match the order that you define the columns in the `CREATE [WRITABLE] EXTERNAL TABLE` command. You specify the size of each field in the `<width>` value. |
| |
| Refer to the Greenplum Database [fixed width custom formatter documentation](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/admin_guide-load-topics-g-importing-and-exporting-fixed-width-data.html) for more information about the formatter options. |
| |
| |
| ## <a id="about_lineend"></a>About the line_delim and NEWLINE Formatter Options |
| |
| By default, Greenplum Database uses the `\n` (LF) character for the new line delimiter. When the line delimiter for the external file is also `\n`, you need not specify the `line_delim` option. If the `line_delim` formatter option is provided and contains `\r` (CR), `\r\n` (CRLF), or a set of custom escape characters, you must specify the `NEWLINE` option in the external table `LOCATION` clause, and set the value to `CR`, `CRLF`, or the set of bytecode characters, respectively. |
| |
| Refer to the Greenplum Database [fixed width custom formatter documentation](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/admin_guide-load-topics-g-importing-and-exporting-fixed-width-data.html) for more information about the formatter options. |
| |
| ## <a id="fixedwidth_read_example"></a>Example: Reading Fixed-Width Text Data on S3 |
| |
| Perform the following procedure to create a sample text file, copy the file to S3, and use the `s3:fixedwidth` profile to create a PXF external table to query the data. |
| |
| To run this example, you must: |
| |
| - Have the AWS CLI tools installed on your system |
| - Know your AWS access ID and secret key |
| - Have write permission to an S3 bucket |
| |
| Procedure: |
| |
| 1. Create a directory in S3 for PXF example data files. For example, if you have write access to an S3 bucket named `BUCKET`: |
| |
| ``` shell |
| $ aws s3 mb s3://BUCKET/pxf_examples |
| ``` |
| 1. Locally create a plain text data file named `pxf_s3_fixedwidth.txt`: |
| |
| ``` shell |
| $ echo 'Prague Jan 101 4875.33 |
| Rome Mar 87 1557.39 |
| Bangalore May 317 8936.99 |
| Beijing Jul 411 11600.67 ' > /tmp/pxf_s3_fixedwidth.txt |
| ``` |
| |
| In this sample file, the first field is 15 characters long, the second is 4 characters, the third is 6 characters, and the last field is 10 characters long. |
| |
| > **Note** Open the `/tmp/pxf_s3_fixedwidth.txt` file in the editor of your choice, and ensure that the last field is right-padded with spaces to 10 characters in size. |
| |
| 1. Copy the data file to the S3 directory that you created in Step 1: |
| |
| ``` shell |
| $ aws s3 cp /tmp/pxf_s3_fixedwidth.txt s3://BUCKET/pxf_examples/ |
| ``` |
| |
| 1. Verify that the file now resides in S3: |
| |
| ``` shell |
| $ aws s3 ls s3://BUCKET/pxf_examples/pxf_s3s_fixedwidth.txt |
| ``` |
| |
| 1. Start the `psql` subsystem: |
| |
| ``` shell |
| $ psql -d postgres |
| ``` |
| |
| 1. Use the PXF `s3:fixedwidth` profile to create a Greenplum Database external table that references the `pxf_s3_fixedwidth.txt` file that you just created and added to S3. For example, if your server name is `s3srvcfg`: |
| |
| ``` sql |
| postgres=# CREATE EXTERNAL TABLE pxf_s3_fixedwidth_r(location text, month text, num_orders int, total_sales float8) |
| LOCATION ('pxf://data/pxf_examples/pxf_s3_fixedwidth.txt?PROFILE=s3:fixedwidth&SERVER=s3srvcfg&NEWLINE=CRLF') |
| FORMAT 'CUSTOM' (formatter='fixedwidth_in', location='15', month='4', num_orders='6', total_sales='10', line_delim=E'\r\n'); |
| ``` |
| |
| 2. Query the external table: |
| |
| ``` sql |
| postgres=# SELECT * FROM pxf_s3_fixedwidth_r; |
| ``` |
| |
| ``` pre |
| location | month | num_orders | total_sales |
| ---------------+-------+------------+------------- |
| Prague | Jan | 101 | 4875.33 |
| Rome | Mar | 87 | 1557.39 |
| Bangalore | May | 317 | 8936.99 |
| Beijing | Jul | 411 | 11600.67 |
| (4 rows) |
| ``` |
| |
| ## <a id="s3write_text"></a>Writing Fixed-Width Text Data |
| |
| The `<objstore>:fixedwidth` profiles support writing fixed-width text to an object store. When you create a writable external table with PXF, you specify the name of a directory. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified. |
| |
| **Note**: External tables that you create with a writable profile can only be used for `INSERT` operations. If you want to query the data that you inserted, you must create a separate readable external table that references the directory. |
| |
| Use the following syntax to create a Greenplum Database writable external table that references an object store directory: |
| |
| ``` sql |
| CREATE WRITABLE EXTERNAL TABLE <table_name> |
| ( <column_name> <data_type> [, ...] | LIKE <other_table> ) |
| LOCATION ('pxf://<path-to-dir> |
| ?PROFILE=<objstore>:fixedwidth[&SERVER=<server_name>][&NEWLINE=<bytecode>][&<write-option>=<value>[...]]') |
| FORMAT 'CUSTOM' (FORMATTER='fixedwidth_out' [, <field_name>='<width>'] [, ...] [, line_delim[=|<space>][E]'<delim_value>']); |
| [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY]; |
| ``` |
| |
| The specific keywords and values used in the [CREATE EXTERNAL TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html) command are described in the table below. |
| |
| | Keyword | Value | |
| |-------|-------------------------------------| |
| | \<path‑to‑dir\> | The path to the directory in the data store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path‑to‑dir\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path‑to‑dir\> must not specify a relative path nor include the dollar sign (`$`) character. | |
| | PROFILE=\<objstore\>:fixedwidth | The `PROFILE` must identify the specific object store. For example, `s3:fixedwidth`. | |
| | SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. | |
| | NEWLINE=\<bytecode\> | When the `line_delim` formatter option contains `\r`, `\r\n`, or a set of custom escape characters, you must set `<bytecode>` to `CR`, `CRLF`, or the set of bytecode characters, respectively. | |
| | \<write‑option\>=\<value\> | \<write-option\>s are described below.| |
| | FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `FORMATTER='fixedwidth_out'` (write). | |
| | \<field_name>='\<width>' | The name and the width of the field. For example: `first_name='15'` specifies that the `first_name` field is `15` characters long. By default, when writing to the external file and the field value is less than `<width>` size, Greenplum Database right-pads the field with spaces to `<width>` size. | |
| | line_delim | The line delimiter character in the data. Preface the \<delim_value\> with an `E` when the value is an escape sequence. Examples: `line_delim=E'\n'`, `line_delim 'aaa'`. The default value is `'\n'`. | |
| | DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation. | |
| |
| Writable external tables that you create using the `<objstore>:fixedwidth` profile can optionally use record or block compression. You specify the compression codec via an option in the `CREATE WRITABLE EXTERNAL TABLE` `LOCATION` clause: |
| |
| | Write Option | Value Description | |
| |-------|-------------------------------------| |
| | COMPRESSION_CODEC | The compression codec alias. Supported compression codecs for writing fixed-width text data include: `default`, `bzip2`, `gzip`, and `uncompressed`. If this option is not provided, Greenplum Database performs no data compression. | |
| |
| ## <a id="fixedwidth_write_example"></a>Example: Writing Fixed-Width Text Data to S3 |
| |
| This example utilizes the data schema introduced in [Example: Reading Fixed-Width Text Data on S3](#fixedwidth_read_example). |
| |
| | Column Name | Width | Data Type | |
| |-------|-------------------------------------| |
| | location | 15 | text | |
| | month | 4 | text | |
| | number_of_orders | 6 | int | |
| | total_sales | 10 | float8 | |
| |
| #### <a id="fixedwidth_write_proc" class="no-quick-link"></a>Procedure |
| |
| Perform the following procedure to create a Greenplum Database writable external table utilizing the same data schema as described above. You will use the PXF `s3:fixedwidth` profile to write data to S3. You will also create a separate, readable external table to read the data that you wrote to #3. |
| |
| 1. Create a Greenplum Database writable external table utilizing the data schema described above. Write to the S3 directory `BUCKET/pxf_examples/fixedwidth_write`. Create the table specifying `\n` as the line delimiter. For example, if your server name is `s3srvcfg`: |
| |
| ``` sql |
| postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_s3_fixedwidth_w(location text, month text, num_orders int, total_sales float8) |
| LOCATION ('pxf://BUCKET/pxf_examples/fixedwidth_write?PROFILE=s3:fixedwidth&SERVER=s3srvcfg') |
| FORMAT 'CUSTOM' (formatter='fixedwidth_out', location='15', month='4', num_orders='6', total_sales='10'); |
| ``` |
| |
| 2. Write a few individual records to the `fixedwidth_write` S3 directory by using the `INSERT` command on the `pxf_s3_fixedwidth_w` table: |
| |
| ``` sql |
| postgres=# INSERT INTO pxf_s3_fixedwidth_w VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); |
| postgres=# INSERT INTO pxf_s3_fixedwidth_w VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 ); |
| ``` |
| |
| 5. Greenplum Database does not support directly querying a writable external table. To query the data that you just added to S3, you must create a readable external Greenplum Database table that references the S3 directory: |
| |
| ``` sql |
| postgres=# CREATE EXTERNAL TABLE pxf_s3_fixedwidth_r2(location text, month text, num_orders int, total_sales float8) |
| LOCATION ('pxf://BUCKET/pxf_examples/fixedwidth_write?PROFILE=s3:fixedwidth&SERVER=s3srvcfg') |
| FORMAT 'CUSTOM' (formatter='fixedwidth_in', location='15', month='4', num_orders='6', total_sales='10'); |
| ``` |
| |
| 6. Query the readable external table: |
| |
| ``` sql |
| postgres=# SELECT * FROM pxf_s3_fixedwidth_r2 ORDER BY total_sales; |
| ``` |
| |
| ``` pre |
| location | month | num_orders | total_sales |
| -----------+-------+------------+------------- |
| Frankfurt | Mar | 777 | 3956.98 |
| Cleveland | Oct | 3812 | 96645.37 |
| (2 rows) |
| ``` |
| |