blob: c5bfae7c719ba5bcbc265342208a359a731cbc4b [file] [log] [blame]
External Scan in Apache Cloudberry
===================================
External scan is used to scan a special kind of database tables called "external
tables". Unlike an ordinary database table the data of an external table is
external to the database directories. It can be on a flat file on some other
directory or come from an http server for example, and the external access layer
is in charge of reaching that data source and interpret it. A user issuing a
query against an external table should not notice the difference between an
external table and a base table - as the query will return rows as if it came
from a regular table.
One of the main differences in the access layer is the inability to have indexes
or issue any DML on an external table. This is because the data is not really
inside the database directory and is not in a form of tuples (it is plain text).
Data access
-----------
The external access layer may get the table data from 3 different sources (the
actual source is defined when the external table is created with the CREATE
EXTERNAL TABLE command):
1. data is on a file on the server.
2. data is served by an http server.
3. data is served by a gpfdist process (basically a special case http server).
We use libcurl to access the data. If the data is on a file on the server
libcurl will open it with fopen() and if the file comes from an http server
there are other special libcurl functions that will issue an http request to the
http server and read the data as if it's a file.
See open_external_file() in fileam.c for specifics.
Data processing
---------------
Data is retrieved from the data source via the libcurl url_fread() function. It
is then parsed in a way that is identical to how the COPY command parses data.
When a complete row is parsed together with its attributes it is converted to a
heap tuple and propagated up to the next node.
Process
-------
This is roughly what the call stack looks like:
return new tuple
--> form a tuple
--> convert row attributes to internal format with input functions
--> parse a data row according to COPY rules
--> get data from file
--> open external file (or remote source)
Errors
------
Of course there will be occasions where a query against an external table will
fail as the access layer will not manage to parse the user data properly due to
"bad data" (e.g: data row with too many attributes). In that case a COPY style
error will be reported and external scan cancelled.