src/backend/jit/README - cloudberry - Git at Google

 What is Just-in-Time Compilation?
 =================================

 Just-in-Time compilation (JIT) is the process of turning some form of
 interpreted program evaluation into a native program, and doing so at
 runtime.

 For example, instead of using a facility that can evaluate arbitrary
 SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
 is possible to generate a function than can be natively executed by
 the CPU that just handles that expression, yielding a speedup.

 This is JIT, rather than ahead-of-time (AOT) compilation, because it
 is done at query execution time, and perhaps only in cases where the
 relevant task is repeated a number of times. Given the way JIT
 compilation is used in PostgreSQL, the lines between interpretation,
 AOT and JIT are somewhat blurry.

 Note that the interpreted program turned into a native program does
 not necessarily have to be a program in the classical sense. E.g. it
 is highly beneficial to JIT compile tuple deforming into a native
 function just handling a specific type of table, despite tuple
 deforming not commonly being understood as a "program".


 Why JIT?
 ========

 Parts of PostgreSQL are commonly bottlenecked by comparatively small
 pieces of CPU intensive code. In a number of cases that is because the
 relevant code has to be very generic (e.g. handling arbitrary SQL
 level expressions, over arbitrary tables, with arbitrary extensions
 installed). This often leads to a large number of indirect jumps and
 unpredictable branches, and generally a high number of instructions
 for a given task. E.g. just evaluating an expression comparing a
 column in a database to an integer ends up needing several hundred
 cycles.

 By generating native code large numbers of indirect jumps can be
 removed by either making them into direct branches (e.g. replacing the
 indirect call to an SQL operator's implementation with a direct call
 to that function), or by removing it entirely (e.g. by evaluating the
 branch at compile time because the input is constant). Similarly a lot
 of branches can be entirely removed (e.g. by again evaluating the
 branch at compile time because the input is constant). The latter is
 particularly beneficial for removing branches during tuple deforming.


 How to JIT
 ==========

 PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
 because it is developed by several large corporations and therefore
 unlikely to be discontinued, because it has a license compatible with
 PostgreSQL, and because its IR can be generated from C using the Clang
 compiler.


 Shared Library Separation
 -------------------------

 To avoid the main PostgreSQL binary directly depending on LLVM, which
 would prevent LLVM support being independently installed by OS package
 managers, the LLVM dependent code is located in a shared library that
 is loaded on-demand.

 An additional benefit of doing so is that it is relatively easy to
 evaluate JIT compilation that does not use LLVM, by changing out the
 shared library used to provide JIT compilation.

 To achieve this, code intending to perform JIT (e.g. expression evaluation)
 calls an LLVM independent wrapper located in jit.c to do so. If the
 shared library providing JIT support can be loaded (i.e. PostgreSQL was
 compiled with LLVM support and the shared library is installed), the task
 of JIT compiling an expression gets handed off to the shared library. This
 obviously requires that the function in jit.c is allowed to fail in case
 no JIT provider can be loaded.

 Which shared library is loaded is determined by the jit_provider GUC,
 defaulting to "llvmjit".

 Cloistering code performing JIT into a shared library unfortunately
 also means that code doing JIT compilation for various parts of code
 has to be located separately from the code doing so without
 JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
 rather than executor/.


 JIT Context
 -----------

 For performance and convenience reasons it is useful to allow JITed
 functions to be emitted and deallocated together. It is e.g. very
 common to create a number of functions at query initialization time,
 use them during query execution, and then deallocate all of them
 together at the end of the query.

 Lifetimes of JITed functions are managed via JITContext. Exactly one
 such context should be created for work in which all created JITed
 function should have the same lifetime. E.g. there's exactly one
 JITContext for each query executed, in the query's EState.  Only the
 release of a JITContext is exposed to the provider independent
 facility, as the creation of one is done on-demand by the JIT
 implementations.

 Emitting individual functions separately is more expensive than
 emitting several functions at once, and emitting them together can
 provide additional optimization opportunities. To facilitate that, the
 LLVM provider separates defining functions from optimizing and
 emitting functions in an executable manner.

 Creating functions into the current mutable module (a module
 essentially is LLVM's equivalent of a translation unit in C) is done
 using
   extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 in which it then can emit as much code using the LLVM APIs as it
 wants. Whenever a function actually needs to be called
   extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
 returns a pointer to it.

 E.g. in the expression evaluation case this setup allows most
 functions in a query to be emitted during ExecInitNode(), delaying the
 function emission to the time the first time a function is actually
 used.


 Error Handling
 --------------

 There are two aspects of error handling.  Firstly, generated (LLVM IR)
 and emitted functions (mmap()ed segments) need to be cleaned up both
 after a successful query execution and after an error. This is done by
 registering each created JITContext with the current resource owner,
 and cleaning it up on error / end of transaction. If it is desirable
 to release resources earlier, jit_release_context() can be used.

 The second, less pretty, aspect of error handling is OOM handling
 inside LLVM itself. The above resowner based mechanism takes care of
 cleaning up emitted code upon ERROR, but there's also the chance that
 LLVM itself runs out of memory. LLVM by default does *not* use any C++
 exceptions. Its allocations are primarily funneled through the
 standard "new" handlers, and some direct use of malloc() and
 mmap(). For the former a 'new handler' exists:
 http://en.cppreference.com/w/cpp/memory/new/set_new_handler
 For the latter LLVM provides callbacks that get called upon failure
 (unfortunately mmap() failures are treated as fatal rather than OOM errors).
 What we've chosen to do for now is have two functions that LLVM using code
 must use:
 extern void llvm_enter_fatal_on_oom(void);
 extern void llvm_leave_fatal_on_oom(void);
 before interacting with LLVM code.

 When a libstdc++ new or LLVM error occurs, the handlers set up by the
 above functions trigger a FATAL error. We have to use FATAL rather
 than ERROR, as we *cannot* reliably throw ERROR inside a foreign
 library without risking corrupting its internal state.

 Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
 the handlers instead are reset on toplevel sigsetjmp() level.

 Using a relatively small enter/leave protected section of code, rather
 than setting up these handlers globally, avoids negative interactions
 with extensions that might use C++ such as PostGIS. As LLVM code
 generation should never execute arbitrary code, just setting these
 handlers temporarily ought to suffice.


 Type Synchronization
 --------------------

 To be able to generate code that can perform tasks done by "interpreted"
 PostgreSQL, it obviously is required that code generation knows about at
 least a few PostgreSQL types.  While it is possible to inform LLVM about
 type definitions by recreating them manually in C code, that is failure
 prone and labor intensive.

 Instead there is one small file (llvmjit_types.c) which references each of
 the types required for JITing. That file is translated to bitcode at
 compile time, and loaded when LLVM is initialized in a backend.

 That works very well to synchronize the type definition, but unfortunately
 it does *not* synchronize offsets as the IR level representation doesn't
 know field names.  Instead, required offsets are maintained as defines in
 the original struct definition, like so:
 #define FIELDNO_TUPLETABLESLOT_NVALID 9
         int                     tts_nvalid;             /* # of valid values in tts_values */
 While that still needs to be defined, it's only required for a
 relatively small number of fields, and it's bunched together with the
 struct definition, so it's easily kept synchronized.


 Inlining
 --------

 One big advantage of JITing expressions is that it can significantly
 reduce the overhead of PostgreSQL's extensible function/operator
 mechanism, by inlining the body of called functions/operators.

 It obviously is undesirable to maintain a second implementation of
 commonly used functions, just for inlining purposes. Instead we take
 advantage of the fact that the Clang compiler can emit LLVM IR.

 The ability to do so allows us to get the LLVM IR for all operators
 (e.g. int8eq, float8pl etc), without maintaining two copies.  These
 bitcode files get installed into the server's
   $pkglibdir/bitcode/postgres/
 Using existing LLVM functionality (for parallel LTO compilation),
 additionally an index is over these is stored to
 $pkglibdir/bitcode/postgres.index.bc

 Similarly extensions can install code into
   $pkglibdir/bitcode/[extension]/
 accompanied by
   $pkglibdir/bitcode/[extension].index.bc

 just alongside the actual library.  An extension's index will be used
 to look up symbols when located in the corresponding shared
 library. Symbols that are used inside the extension, when inlined,
 will be first looked up in the main binary and then the extension's.


 Caching
 -------

 Currently it is not yet possible to cache generated functions, even
 though that'd be desirable from a performance point of view. The
 problem is that the generated functions commonly contain pointers into
 per-execution memory. The expression evaluation machinery needs to
 be redesigned a bit to avoid that. Basically all per-execution memory
 needs to be referenced as an offset to one block of memory stored in
 an ExprState, rather than absolute pointers into memory.

 Once that is addressed, adding an LRU cache that's keyed by the
 generated LLVM IR will allow the usage of optimized functions even for
 faster queries.

 A longer term project is to move expression compilation to the planner
 stage, allowing e.g. to tie compiled expressions to prepared
 statements.

 An even more advanced approach would be to use JIT with few
 optimizations initially, and build an optimized version in the
 background. But that's even further off.


 What to JIT
 ===========

 Currently expression evaluation and tuple deforming are JITed. Those
 were chosen because they commonly are major CPU bottlenecks in
 analytics queries, but are by no means the only potentially beneficial cases.

 For JITing to be beneficial a piece of code first and foremost has to
 be a CPU bottleneck. But also importantly, JITing can only be
 beneficial if overhead can be removed by doing so. E.g. in the tuple
 deforming case the knowledge about the number of columns and their
 types can remove a significant number of branches, and in the
 expression evaluation case a lot of indirect jumps/calls can be
 removed.  If neither of these is the case, JITing is a waste of
 resources.

 Future avenues for JITing are tuple sorting, COPY parsing/output
 generation, and later compiling larger parts of queries.


 When to JIT
 ===========

 Currently there are a number of GUCs that influence JITing:

 - jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
   get JITed, *without* optimization (expensive part), corresponding to
   -O0. This commonly already results in significant speedups if
   expression/deforming is a bottleneck (removing dynamic branches
   mostly).
 - jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
   get JITed, *with* optimization (expensive part).
 - jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
   higher cost.

 Whenever a query's total cost is above these limits, JITing is
 performed.

 Alternative costing models, e.g. by generating separate paths for
 parts of a query with lower cpu_* costs, are also a possibility, but
 it's doubtful the overhead of doing so is sufficient.  Another
 alternative would be to count the number of times individual
 expressions are estimated to be evaluated, and perform JITing of these
 individual expressions.

 The obvious seeming approach of JITing expressions individually after
 a number of execution turns out not to work too well. Primarily
 because emitting many small functions individually has significant
 overhead. Secondarily because the time until JITing occurs causes
 relative slowdowns that eat into the gain of JIT compilation.
	What is Just-in-Time Compilation?
	=================================

	Just-in-Time compilation (JIT) is the process of turning some form of
	interpreted program evaluation into a native program, and doing so at
	runtime.

	For example, instead of using a facility that can evaluate arbitrary
	SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
	is possible to generate a function than can be natively executed by
	the CPU that just handles that expression, yielding a speedup.

	This is JIT, rather than ahead-of-time (AOT) compilation, because it
	is done at query execution time, and perhaps only in cases where the
	relevant task is repeated a number of times. Given the way JIT
	compilation is used in PostgreSQL, the lines between interpretation,
	AOT and JIT are somewhat blurry.

	Note that the interpreted program turned into a native program does
	not necessarily have to be a program in the classical sense. E.g. it
	is highly beneficial to JIT compile tuple deforming into a native
	function just handling a specific type of table, despite tuple
	deforming not commonly being understood as a "program".


	Why JIT?
	========

	Parts of PostgreSQL are commonly bottlenecked by comparatively small
	pieces of CPU intensive code. In a number of cases that is because the
	relevant code has to be very generic (e.g. handling arbitrary SQL
	level expressions, over arbitrary tables, with arbitrary extensions
	installed). This often leads to a large number of indirect jumps and
	unpredictable branches, and generally a high number of instructions
	for a given task. E.g. just evaluating an expression comparing a
	column in a database to an integer ends up needing several hundred
	cycles.

	By generating native code large numbers of indirect jumps can be
	removed by either making them into direct branches (e.g. replacing the
	indirect call to an SQL operator's implementation with a direct call
	to that function), or by removing it entirely (e.g. by evaluating the
	branch at compile time because the input is constant). Similarly a lot
	of branches can be entirely removed (e.g. by again evaluating the
	branch at compile time because the input is constant). The latter is
	particularly beneficial for removing branches during tuple deforming.


	How to JIT
	==========

	PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
	because it is developed by several large corporations and therefore
	unlikely to be discontinued, because it has a license compatible with
	PostgreSQL, and because its IR can be generated from C using the Clang
	compiler.


	Shared Library Separation
	-------------------------

	To avoid the main PostgreSQL binary directly depending on LLVM, which
	would prevent LLVM support being independently installed by OS package
	managers, the LLVM dependent code is located in a shared library that
	is loaded on-demand.

	An additional benefit of doing so is that it is relatively easy to
	evaluate JIT compilation that does not use LLVM, by changing out the
	shared library used to provide JIT compilation.

	To achieve this, code intending to perform JIT (e.g. expression evaluation)
	calls an LLVM independent wrapper located in jit.c to do so. If the
	shared library providing JIT support can be loaded (i.e. PostgreSQL was
	compiled with LLVM support and the shared library is installed), the task
	of JIT compiling an expression gets handed off to the shared library. This
	obviously requires that the function in jit.c is allowed to fail in case
	no JIT provider can be loaded.

	Which shared library is loaded is determined by the jit_provider GUC,
	defaulting to "llvmjit".

	Cloistering code performing JIT into a shared library unfortunately
	also means that code doing JIT compilation for various parts of code
	has to be located separately from the code doing so without
	JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
	rather than executor/.


	JIT Context
	-----------

	For performance and convenience reasons it is useful to allow JITed
	functions to be emitted and deallocated together. It is e.g. very
	common to create a number of functions at query initialization time,
	use them during query execution, and then deallocate all of them
	together at the end of the query.

	Lifetimes of JITed functions are managed via JITContext. Exactly one
	such context should be created for work in which all created JITed
	function should have the same lifetime. E.g. there's exactly one
	JITContext for each query executed, in the query's EState. Only the
	release of a JITContext is exposed to the provider independent
	facility, as the creation of one is done on-demand by the JIT
	implementations.

	Emitting individual functions separately is more expensive than
	emitting several functions at once, and emitting them together can
	provide additional optimization opportunities. To facilitate that, the
	LLVM provider separates defining functions from optimizing and
	emitting functions in an executable manner.

	Creating functions into the current mutable module (a module
	essentially is LLVM's equivalent of a translation unit in C) is done
	using
	extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
	in which it then can emit as much code using the LLVM APIs as it
	wants. Whenever a function actually needs to be called
	extern void llvm_get_function(LLVMJitContext context, const char *funcname);
	returns a pointer to it.

	E.g. in the expression evaluation case this setup allows most
	functions in a query to be emitted during ExecInitNode(), delaying the
	function emission to the time the first time a function is actually
	used.


	Error Handling
	--------------

	There are two aspects of error handling. Firstly, generated (LLVM IR)
	and emitted functions (mmap()ed segments) need to be cleaned up both
	after a successful query execution and after an error. This is done by
	registering each created JITContext with the current resource owner,
	and cleaning it up on error / end of transaction. If it is desirable
	to release resources earlier, jit_release_context() can be used.

	The second, less pretty, aspect of error handling is OOM handling
	inside LLVM itself. The above resowner based mechanism takes care of
	cleaning up emitted code upon ERROR, but there's also the chance that
	LLVM itself runs out of memory. LLVM by default does not use any C++
	exceptions. Its allocations are primarily funneled through the
	standard "new" handlers, and some direct use of malloc() and
	mmap(). For the former a 'new handler' exists:
	http://en.cppreference.com/w/cpp/memory/new/set_new_handler
	For the latter LLVM provides callbacks that get called upon failure
	(unfortunately mmap() failures are treated as fatal rather than OOM errors).
	What we've chosen to do for now is have two functions that LLVM using code
	must use:
	extern void llvm_enter_fatal_on_oom(void);
	extern void llvm_leave_fatal_on_oom(void);
	before interacting with LLVM code.

	When a libstdc++ new or LLVM error occurs, the handlers set up by the
	above functions trigger a FATAL error. We have to use FATAL rather
	than ERROR, as we cannot reliably throw ERROR inside a foreign
	library without risking corrupting its internal state.

	Users of the above sections do not have to use PG_TRY/CATCH blocks,
	the handlers instead are reset on toplevel sigsetjmp() level.

	Using a relatively small enter/leave protected section of code, rather
	than setting up these handlers globally, avoids negative interactions
	with extensions that might use C++ such as PostGIS. As LLVM code
	generation should never execute arbitrary code, just setting these
	handlers temporarily ought to suffice.


	Type Synchronization
	--------------------

	To be able to generate code that can perform tasks done by "interpreted"
	PostgreSQL, it obviously is required that code generation knows about at
	least a few PostgreSQL types. While it is possible to inform LLVM about
	type definitions by recreating them manually in C code, that is failure
	prone and labor intensive.

	Instead there is one small file (llvmjit_types.c) which references each of
	the types required for JITing. That file is translated to bitcode at
	compile time, and loaded when LLVM is initialized in a backend.

	That works very well to synchronize the type definition, but unfortunately
	it does not synchronize offsets as the IR level representation doesn't
	know field names. Instead, required offsets are maintained as defines in
	the original struct definition, like so:
	#define FIELDNO_TUPLETABLESLOT_NVALID 9
	int tts_nvalid; /* # of valid values in tts_values */
	While that still needs to be defined, it's only required for a
	relatively small number of fields, and it's bunched together with the
	struct definition, so it's easily kept synchronized.


	Inlining
	--------

	One big advantage of JITing expressions is that it can significantly
	reduce the overhead of PostgreSQL's extensible function/operator
	mechanism, by inlining the body of called functions/operators.

	It obviously is undesirable to maintain a second implementation of
	commonly used functions, just for inlining purposes. Instead we take
	advantage of the fact that the Clang compiler can emit LLVM IR.

	The ability to do so allows us to get the LLVM IR for all operators
	(e.g. int8eq, float8pl etc), without maintaining two copies. These
	bitcode files get installed into the server's
	$pkglibdir/bitcode/postgres/
	Using existing LLVM functionality (for parallel LTO compilation),
	additionally an index is over these is stored to
	$pkglibdir/bitcode/postgres.index.bc

	Similarly extensions can install code into
	$pkglibdir/bitcode/[extension]/
	accompanied by
	$pkglibdir/bitcode/[extension].index.bc

	just alongside the actual library. An extension's index will be used
	to look up symbols when located in the corresponding shared
	library. Symbols that are used inside the extension, when inlined,
	will be first looked up in the main binary and then the extension's.


	Caching
	-------

	Currently it is not yet possible to cache generated functions, even
	though that'd be desirable from a performance point of view. The
	problem is that the generated functions commonly contain pointers into
	per-execution memory. The expression evaluation machinery needs to
	be redesigned a bit to avoid that. Basically all per-execution memory
	needs to be referenced as an offset to one block of memory stored in
	an ExprState, rather than absolute pointers into memory.

	Once that is addressed, adding an LRU cache that's keyed by the
	generated LLVM IR will allow the usage of optimized functions even for
	faster queries.

	A longer term project is to move expression compilation to the planner
	stage, allowing e.g. to tie compiled expressions to prepared
	statements.

	An even more advanced approach would be to use JIT with few
	optimizations initially, and build an optimized version in the
	background. But that's even further off.


	What to JIT
	===========

	Currently expression evaluation and tuple deforming are JITed. Those
	were chosen because they commonly are major CPU bottlenecks in
	analytics queries, but are by no means the only potentially beneficial cases.

	For JITing to be beneficial a piece of code first and foremost has to
	be a CPU bottleneck. But also importantly, JITing can only be
	beneficial if overhead can be removed by doing so. E.g. in the tuple
	deforming case the knowledge about the number of columns and their
	types can remove a significant number of branches, and in the
	expression evaluation case a lot of indirect jumps/calls can be
	removed. If neither of these is the case, JITing is a waste of
	resources.

	Future avenues for JITing are tuple sorting, COPY parsing/output
	generation, and later compiling larger parts of queries.


	When to JIT
	===========

	Currently there are a number of GUCs that influence JITing:

	- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
	get JITed, without optimization (expensive part), corresponding to
	-O0. This commonly already results in significant speedups if
	expression/deforming is a bottleneck (removing dynamic branches
	mostly).
	- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
	get JITed, with optimization (expensive part).
	- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
	higher cost.

	Whenever a query's total cost is above these limits, JITing is
	performed.

	Alternative costing models, e.g. by generating separate paths for
	parts of a query with lower cpu_* costs, are also a possibility, but
	it's doubtful the overhead of doing so is sufficient. Another
	alternative would be to count the number of times individual
	expressions are estimated to be evaluated, and perform JITing of these
	individual expressions.

	The obvious seeming approach of JITing expressions individually after
	a number of execution turns out not to work too well. Primarily
	because emitting many small functions individually has significant
	overhead. Secondarily because the time until JITing occurs causes
	relative slowdowns that eat into the gain of JIT compilation.