src/backend/access/external/README - cloudberry - Git at Google

 External Scan in Apache Cloudberry
 ===================================

 External scan is used to scan a special kind of database tables called "external
 tables". Unlike an ordinary database table the data of an external table is
 external to the database directories. It can be on a flat file on some other
 directory or come from an http server for example, and the external access layer
 is in charge of reaching that data source and interpret it. A user issuing a
 query against an external table should not notice the difference between an
 external table and a base table - as the query will return rows as if it came
 from a regular table.

 One of the main differences in the access layer is the inability to have indexes
 or issue any DML on an external table. This is because the data is not really
 inside the database directory and is not in a form of tuples (it is plain text).


 Data access
 -----------

 The external access layer may get the table data from 3 different sources (the
 actual source is defined when the external table is created with the CREATE
 EXTERNAL TABLE command):

  1. data is on a file on the server.
  2. data is served by an http server.
  3. data is served by a gpfdist process (basically a special case http server).

 We use libcurl to access the data. If the data is on a file on the server
 libcurl will open it with fopen() and if the file comes from an http server
 there are other special libcurl functions that will issue an http request to the
 http server and read the data as if it's a file.

 See open_external_file() in fileam.c for specifics.


 Data processing
 ---------------

 Data is retrieved from the data source via the libcurl url_fread() function. It
 is then parsed in a way that is identical to how the COPY command parses data.
 When a complete row is parsed together with its attributes it is converted to a
 heap tuple and propagated up to the next node.


 Process
 -------

 This is roughly what the call stack looks like:

     return new tuple
     --> form a tuple
         --> convert row attributes to internal format with input functions
             --> parse a data row according to COPY rules
                 --> get data from file
                     --> open external file (or remote source)

 Errors
 ------

 Of course there will be occasions where a query against an external table will
 fail as the access layer will not manage to parse the user data properly due to
 "bad data" (e.g: data row with too many attributes). In that case a COPY style
 error will be reported and external scan cancelled.
	External Scan in Apache Cloudberry
	===================================

	External scan is used to scan a special kind of database tables called "external
	tables". Unlike an ordinary database table the data of an external table is
	external to the database directories. It can be on a flat file on some other
	directory or come from an http server for example, and the external access layer
	is in charge of reaching that data source and interpret it. A user issuing a
	query against an external table should not notice the difference between an
	external table and a base table - as the query will return rows as if it came
	from a regular table.

	One of the main differences in the access layer is the inability to have indexes
	or issue any DML on an external table. This is because the data is not really
	inside the database directory and is not in a form of tuples (it is plain text).


	Data access
	-----------

	The external access layer may get the table data from 3 different sources (the
	actual source is defined when the external table is created with the CREATE
	EXTERNAL TABLE command):

	1. data is on a file on the server.
	2. data is served by an http server.
	3. data is served by a gpfdist process (basically a special case http server).

	We use libcurl to access the data. If the data is on a file on the server
	libcurl will open it with fopen() and if the file comes from an http server
	there are other special libcurl functions that will issue an http request to the
	http server and read the data as if it's a file.

	See open_external_file() in fileam.c for specifics.


	Data processing
	---------------

	Data is retrieved from the data source via the libcurl url_fread() function. It
	is then parsed in a way that is identical to how the COPY command parses data.
	When a complete row is parsed together with its attributes it is converted to a
	heap tuple and propagated up to the next node.


	Process
	-------

	This is roughly what the call stack looks like:

	return new tuple
	--> form a tuple
	--> convert row attributes to internal format with input functions
	--> parse a data row according to COPY rules
	--> get data from file
	--> open external file (or remote source)

	Errors
	------

	Of course there will be occasions where a query against an external table will
	fail as the access layer will not manage to parse the user data properly due to
	"bad data" (e.g: data row with too many attributes). In that case a COPY style
	error will be reported and external scan cancelled.