blob: 1b42f02897dc6b491378a6cf2dfb955abdfb9db6 [file] [log] [blame]
Tuple Representations
=====================
What is a tuple representation?
-------------------------------
Tuples are passed around in generic TupleTableSlot objects. Interested parties
need to be able to decode the data and see the values. The current tuple
representation provides information on how to decode the data into typed
attributes
Why do we need more than one tuple representation?
---------------------------------------------------
Data comes from disk in a disk-oriented representation, serialized in a buffer
and includes MVCC columns for heap tables. That is sub-optimal for fast
processing. The execution engine does not usually need system columns, so it
can drop the MVCC info. Some operators dont need all the attributes (they
execute projections), but creating a new tuple means copying all the columns
into a new buffer. Variable length attributes make non-sequential accessing of
attributes slow. We have different in-memory representation of tuples that can
improve performance for these cases.
What different representations do we have?
------------------------------------------
* HeapTuple
- in a buffer or
- palloc'ed
* MinimalTuple
- same as HeapTuple, but no system columns
* VirtualTuple
- optimization to avoid copying data for projections, etc.
* MemTuple
- Cloudberry addition: similar to MinimalTuple but "CPU" optimized
TupleTableSlot
--------------
A TupleTableSlot is a very central data structure used to store tuples. Most
important fields:
* TupleDesc: tuple descriptor that contains the schema.
* HeapTuple: if available, physical representation of the tuple. This is
usually just a pointer to a buffer and a length.
* MemTuple: if available, memtuple representation of the tuple.
* Datum *values: Array of the values (datums) for all the attributes. Might
not be available.
* bool *is_null: Array of booleans, is a particular attribute null. Might not
be available.
* nvalid: How many of the attributes we have extracted in (values, is_null)
arrays.
Tuple Representations Stored in a TupleTableSlot
------------------------------------------------
HeapTuple
This is identical to the on-disk representation in a heap table. It's a
serialized (or "materialized") tuple, stored either in a buffer that belongs to
the buffer manager, or allocated with palloc onto the process' heap. Where a
heap tuple is stored is important for resource management as shared buffers
need to be correctly pinned.
MinimalTuple
It's similar to a palloc'ed HeapTuple, but the system columns are
stripped. This means there's no useless MVCC information carried around in the
executor.
VirtualTuple
A "virtual" tuple is an optimization used to minimize physical data copying in
between plan nodes. Any pass-by-reference Datums in the tuple point to storage
that is not directly associated with the TupleTableSlot. They can point to (1)
part of a tuple stored in a lower plan node's output TupleTableSlot or (2) a
function result constructed in a plan node's per-tuple econtext. It is the
responsibility of the generating plan node to be sure these resources are not
released for as long as the virtual tuple needs to be valid.
Tuples that need to be copied anywhere (e.g. Motion, or DML) must be
"materialized" into physical tuples.
MemTuple
In Cloudberry, MemTuple replaces the MinimalTuple. MemTuple is also a
serialized tuple format. The main design goal is to reduce CPU usage so that
getting to a certain column does not require decoding of the columns preceding
it.
Non-sequential accessing attributes (columns) of HeapTuple and MinimalTuple is
done using slot_getsomeattrs or slot_getallattrs to populate some or all of the
Datum* value pointers. Accessing the M-th attribute involves iterating through
the first M-1 attributes, calculating their actual length (fixed or variable,
null or not), the adding to an offset.
MemTuple is an optimized representation which speeds-up non-sequential access
of the At query start time, for each slot (or each place we use MemTuple), we
compute a binding data structure from the schema (TupleDesc). For each
attribute in the tuple, the binding structure contains information that
determines where the attribute is located in the MemTuple. It consists of the
following:
1. The physical attribute number. Physical attr numbers may differ from
schema because we reorder attributes as follows. 8 bytes aligned attrs go
first, then 4 bytes aligned, then 2, then 1, and last variable length
field.
2. The offset where the attribute starts and the length of the attributes.
3. Information to adjust the offset if there are null columns before (in
physical order) the attribute.
There are many functions to convert the tuple stored in a TupleTableSlot from
one representation to another. Refer to access/memtup.h and
executor/tuptable.h for more details.
Toasted attributes
------------------
Values that are too large to fit into a heap tuple in-line are stored
externally in a toast table. A large value is broken down into smaller fixed
sized chunks. Each chunk is stored as a tuple in the toast table. More
details can be found in access/tuptoaster.c.