src/backend/crypto/README - cloudberry - Git at Google

 Cluster File Encryption
 =======================

 This directory contains support functions and sample scripts to be used
 for cluster file encryption.

 Architecture
 ------------

 Fundamentally, cluster file encryption must store data in a file system
 in such a way that the keys required to decrypt the file system data can
 only be accessed using somewhere outside of the file system itself.  The
 external requirement can be someone typing in a passphrase, getting a
 key from a key management server (KMS), or decrypting a key stored in
 the file system using a hardware security module (HSM).  The current
 architecture supports all of these methods, and includes sample scripts
 for them.

 The simplest method for accessing data keys using some external
 requirement would be to retrieve all data encryption keys from a KMS.
 However, retrieved keys would still need to be verified as valid.  This
 method also introduces unacceptable complexity for simpler use-cases,
 like user-supplied passphrases or HSM usage.  External key rotation
 would also be very hard since it would require re-encrypting all the
 file system data with the new externally-stored keys.

 For these reason, a two-tiered architecture is used, which uses two
 types of encryption keys: a key encryption key (KEK) and data encryption
 keys (DEK). The KEK should not be present unencrypted in the file system
 --- it should be supplied the user, stored externally (e.g., in a KMS)
 or stored in the file system encrypted with a HSM (e.g., PIV device).
 The DEK is used to encrypt database files and is stored in the same file
 system as the database but is encrypted using the KEK.  Because the DEK
 is encrypted, its storage in the file system is no more of a security
 weakness and the storage of the encrypted database files in the same
 file system.

 Implementation
 --------------

 To enable cluster file encryption, the initdb option
 --cluster-key-command must be used, which specifies a command to
 retrieve the KEK.  initdb records the cluster_key_command in
 postgresql.conf.  Every time the KEK is needed, the command is run and
 must return 64 hex characters which are decoded into the KEK.  The
 command is called twice during initdb, and every time the server starts.
 initdb also sets the encryption method in controldata during server
 bootstrap.

 initdb runs "postgres --boot", which calls function
 kmgr.c::BootStrapKmgr(), which calls the cluster key command.  The
 cluster key command returns a KEK which is used to encrypt random bytes
 for each DEK and writes them to the file system by
 kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used).
 Currently the DEK files are 0 and 1 and are stored in
 $PGDATA/pg_cryptokeys/live.  The wrapped DEK files use Key Wrapping with
 Padding which verifies the validity of the KEK.

 initdb also does a non-boot backend start which calls
 kmgr.c::InitializeKmgr(), which calls the cluster key command a second
 time.  This decrypts/unwraps the DEK keys and stores them in the shared
 memory structure KmgrShmem. This step also happens every time the server
 starts. Later patches will use the keys stored in KmgrShmem to
 encrypt/decrypt database files.  KmgrShmem is erased via
 explicit_bzero() on server shutdown.

 Limitations
 -----------

 There doesn't seem to be a reasonable way to detect all malicious data
 modification or key extraction if a user has write permission on the
 files in PGDATA. It might be possible to limit the key extraction risk
 if postgresql.auto.conf were able to be moved to a directory outside of
 PGDATA, and if postmaster.opts could be moved or ignored when cluster
 file encryption is used. (This file is used by pg_ctl restart.)

 It doesn't appear possible to detect all malicious writes --- even if
 you add message authentication code (MAC) checks to encrypted files,
 modifying non-encrypted files could still affect encrypted ones, e.g.,
 modifying files in pg_xact could affect how heap rows are interpreted.
 Basically you would need to encrypt all files, and at that point you
 might as well just use an encrypted file system. There also doesn't seem
 to be a way to prevent key extraction if someone has read permission on
 postgres process memory.

 Initialization Vector
 ---------------------

 Nonce means "number used once". An Initialization Vector (IV) is a
 specific type of nonce. That is, unique but not necessarily random or
 secret, as specified by the NIST
 (https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf).
 To generate unique IVs, the NIST recommends two methods:

 	The first method is to apply the forward cipher function, under
 	the same key that is used for the encryption of the plaintext,
 	to a nonce. The nonce must be a data block that is unique to
 	each execution of the encryption operation. For example, the
 	nonce may be a counter, as described in Appendix B, or a message
 	number. The second method is to generate a random data block
 	using a FIPS-approved random number generator.

 We will use the first method to generate IVs. That is, select nonce
 carefully and use a cipher with the key to make it unique enough to use
 as an IV. The nonce selection for buffer encryption and WAL encryption
 are described below.

 If the IV was used more than once with the same key (and we only use one
 data encryption key), changes in the unencrypted data would be visible
 in the encrypted data.

 IV for Heap/Index Encryption
 - - - - - - - - - - - - - -

 To create the 16-byte IV needed by AES for each page version, we will
 use the page LSN (8 bytes) and page number (4 bytes).  In the remaining
 four bytes, one bit will be used to indicate if the LSN is WAL (real) or
 fake (see below). The LSN is ideal for use in the IV because it is
 always increasing, and is changed every time a page is updated.  The
 same LSN is never used for two relations with different page contents.

 However, the same LSN can be used in multiple pages in the same relation
 --- this can happen when a heap update expires an old tuple and adds a
 new tuple to another page.  By adding the page number to the IV, we keep
 the IV unique.

 By not using the database id in the IV, CREATE DATABASE can copy the
 heap/index files from the old database to a new one without
 decryption/encryption.  Both page copies are valid.  Once a database
 changes its pages, it gets new LSNs, and hence new IV.  Using only the
 LSN and page number also avoids requiring pg_upgrade to preserve
 database oids, tablespace oids, and relfilenodes.

 As part of WAL logging, every change of a WAL-logged page gets a new
 LSN, and therefore a new IV automatically.

 However, the LSN must then be visible on encrypted pages, so we will not
 encrypt the LSN on the page. We will also not encrypt the CRC so
 pg_checksums can still check pages offline without access to the keys.

 Non-Permanent Relations
 - - - - - - - - - - - -

 To avoid the overhead of generating WAL for non-permanent (unlogged and
 temporary) relations, we assign fake LSNs that are derived from a
 counter via xlog.c::GetFakeLSNForUnloggedRel().  (GiST also uses this
 counter for LSNs.)  We also set a bit in the IV so the use of the same
 value for WAL (real) and fake LSNs will still generate unique IVs.  Only
 main forks are encrypted, not init, vm, or fsm files.

 In the code, we need to identify if a page uses WAL or fake LSNs in
 four places, when:

 1.  Reading a page from the file system and decrypting
 2.  Setting the WAL or fake LSN on a page
 3.  Hint bits changes requiring new LSNs for the encryption IV
 4.  Encrypting and writing a page to the file system

 For all these case, we have access to the fork number and either the
 relation's persistence state or the buffer state.  If it is a "main"
 fork and the relation persistence state is RELPERSISTENCE_PERMANENT, or
 if it is an "init" fork, we use a real LSN.  If it is a main fork and
 RELPERSISTENCE_PERMANENT is false, we use a fake LSN.  The buffer state
 BM_PERMANENT is true if the relation is PERMANENT or is an init fork.

 Init Forks
 - - - - -

 Init forks for unlogged relations get permanent LSNs because unlogged
 relation creation is WAL logged/crash safe, even though the relation's
 contents are not.  When the init fork is copied to represent an empty
 relation during crash recovery, it becomes a non-permanent page and must
 be successfully decrypted as such.  Therefore, when it is copied, its
 LSN is changed to e fake LSN and then encrypted.  This prevents a real
 LSN from being encrypted with the fake nonce bit.

 LSN Assignment, GiST, & Non-Permanent Relations
 - - - - - - - - - - - - - - - - - - - - - - - -

 LSN assignment has to be slightly modified for encryption.  In normal,
 non-encryption mode, LSNs are assigned to pages following these rules:

 1.  During GiST builds, some pages are assigned fixed LSNs (GistBuildLSN)

 2.  During GiST builds, non-permanent pages not assigned fixed LSNs in
 #1 are assigned fake LSNs, via gistutil.c::gistGetFakeLSN().

 3.  All other permanent pages are assigned WAL-based LSNs based on the
 WAL position of their WAL records.

 4.  All other non-permanent pages have LSNs of zero.

 When encryption is enabled:

 1.  During GiST builds, permanent pages are assigned WAL-based LSNs
 generated by xloginsert.c::LSNForEncryption().

 2.  During GiST builds, non-permanent pages are assigned fake LSNs.
 (No constant LSNs are used in #1 or #2.)

 3.  same as #3 above

 4.  All other non-permanent pages are assigned fake LSNs before page
 encryption.

 When switching to an encrypted replica from a non-encrypted primary,
 GiST indexes will be using fixed LSNs for permanent tables, so it is
 recommended to rebuild GiST indexes.  Non-permanent relations are not
 replicated, so they are not an issue.

 Hint Bits
 - - - - -

 For hint bit changes, the LSN normally doesn't change, which is a
 problem.  By enabling wal_log_hints, you get full page writes to the WAL
 after the first hint bit change of the checkpoint.  This is useful for
 two reasons.  First, it generates a new LSN, which is needed for the IV
 to be secure.  Second, full page images protect against torn pages,
 which is an even bigger requirement for encryption because the new LSN
 is re-encrypting the entire page, not just the hint bit changes.  You
 can safely lose the hint bit changes, but you need to use the same LSN
 to decrypt the entire page, so a torn page with an LSN change cannot be
 decrypted.  To prevent this, wal_log_hints guarantees that the
 pre-hint-bit version (and previous LSN version) of the page is restored.

 However, if a hint-bit-modified page is written to the file system
 during a checkpoint, and there is a later hint bit change switching the
 same page from clean to dirty during the same checkpoint, we need a new
 LSN, and wal_log_hints doesn't give us a new LSN here.  The fix for this
 is to update the page LSN by writing a dummy WAL record via
 xloginsert.c::LSNForEncryption() in such cases.
	Cluster File Encryption
	=======================

	This directory contains support functions and sample scripts to be used
	for cluster file encryption.

	Architecture
	------------

	Fundamentally, cluster file encryption must store data in a file system
	in such a way that the keys required to decrypt the file system data can
	only be accessed using somewhere outside of the file system itself. The
	external requirement can be someone typing in a passphrase, getting a
	key from a key management server (KMS), or decrypting a key stored in
	the file system using a hardware security module (HSM). The current
	architecture supports all of these methods, and includes sample scripts
	for them.

	The simplest method for accessing data keys using some external
	requirement would be to retrieve all data encryption keys from a KMS.
	However, retrieved keys would still need to be verified as valid. This
	method also introduces unacceptable complexity for simpler use-cases,
	like user-supplied passphrases or HSM usage. External key rotation
	would also be very hard since it would require re-encrypting all the
	file system data with the new externally-stored keys.

	For these reason, a two-tiered architecture is used, which uses two
	types of encryption keys: a key encryption key (KEK) and data encryption
	keys (DEK). The KEK should not be present unencrypted in the file system
	--- it should be supplied the user, stored externally (e.g., in a KMS)
	or stored in the file system encrypted with a HSM (e.g., PIV device).
	The DEK is used to encrypt database files and is stored in the same file
	system as the database but is encrypted using the KEK. Because the DEK
	is encrypted, its storage in the file system is no more of a security
	weakness and the storage of the encrypted database files in the same
	file system.

	Implementation
	--------------

	To enable cluster file encryption, the initdb option
	--cluster-key-command must be used, which specifies a command to
	retrieve the KEK. initdb records the cluster_key_command in
	postgresql.conf. Every time the KEK is needed, the command is run and
	must return 64 hex characters which are decoded into the KEK. The
	command is called twice during initdb, and every time the server starts.
	initdb also sets the encryption method in controldata during server
	bootstrap.

	initdb runs "postgres --boot", which calls function
	kmgr.c::BootStrapKmgr(), which calls the cluster key command. The
	cluster key command returns a KEK which is used to encrypt random bytes
	for each DEK and writes them to the file system by
	kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used).
	Currently the DEK files are 0 and 1 and are stored in
	$PGDATA/pg_cryptokeys/live. The wrapped DEK files use Key Wrapping with
	Padding which verifies the validity of the KEK.

	initdb also does a non-boot backend start which calls
	kmgr.c::InitializeKmgr(), which calls the cluster key command a second
	time. This decrypts/unwraps the DEK keys and stores them in the shared
	memory structure KmgrShmem. This step also happens every time the server
	starts. Later patches will use the keys stored in KmgrShmem to
	encrypt/decrypt database files. KmgrShmem is erased via
	explicit_bzero() on server shutdown.

	Limitations
	-----------

	There doesn't seem to be a reasonable way to detect all malicious data
	modification or key extraction if a user has write permission on the
	files in PGDATA. It might be possible to limit the key extraction risk
	if postgresql.auto.conf were able to be moved to a directory outside of
	PGDATA, and if postmaster.opts could be moved or ignored when cluster
	file encryption is used. (This file is used by pg_ctl restart.)

	It doesn't appear possible to detect all malicious writes --- even if
	you add message authentication code (MAC) checks to encrypted files,
	modifying non-encrypted files could still affect encrypted ones, e.g.,
	modifying files in pg_xact could affect how heap rows are interpreted.
	Basically you would need to encrypt all files, and at that point you
	might as well just use an encrypted file system. There also doesn't seem
	to be a way to prevent key extraction if someone has read permission on
	postgres process memory.

	Initialization Vector
	---------------------

	Nonce means "number used once". An Initialization Vector (IV) is a
	specific type of nonce. That is, unique but not necessarily random or
	secret, as specified by the NIST
	(https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf).
	To generate unique IVs, the NIST recommends two methods:

	The first method is to apply the forward cipher function, under
	the same key that is used for the encryption of the plaintext,
	to a nonce. The nonce must be a data block that is unique to
	each execution of the encryption operation. For example, the
	nonce may be a counter, as described in Appendix B, or a message
	number. The second method is to generate a random data block
	using a FIPS-approved random number generator.

	We will use the first method to generate IVs. That is, select nonce
	carefully and use a cipher with the key to make it unique enough to use
	as an IV. The nonce selection for buffer encryption and WAL encryption
	are described below.

	If the IV was used more than once with the same key (and we only use one
	data encryption key), changes in the unencrypted data would be visible
	in the encrypted data.

	IV for Heap/Index Encryption
	- - - - - - - - - - - - - -

	To create the 16-byte IV needed by AES for each page version, we will
	use the page LSN (8 bytes) and page number (4 bytes). In the remaining
	four bytes, one bit will be used to indicate if the LSN is WAL (real) or
	fake (see below). The LSN is ideal for use in the IV because it is
	always increasing, and is changed every time a page is updated. The
	same LSN is never used for two relations with different page contents.

	However, the same LSN can be used in multiple pages in the same relation
	--- this can happen when a heap update expires an old tuple and adds a
	new tuple to another page. By adding the page number to the IV, we keep
	the IV unique.

	By not using the database id in the IV, CREATE DATABASE can copy the
	heap/index files from the old database to a new one without
	decryption/encryption. Both page copies are valid. Once a database
	changes its pages, it gets new LSNs, and hence new IV. Using only the
	LSN and page number also avoids requiring pg_upgrade to preserve
	database oids, tablespace oids, and relfilenodes.

	As part of WAL logging, every change of a WAL-logged page gets a new
	LSN, and therefore a new IV automatically.

	However, the LSN must then be visible on encrypted pages, so we will not
	encrypt the LSN on the page. We will also not encrypt the CRC so
	pg_checksums can still check pages offline without access to the keys.

	Non-Permanent Relations
	- - - - - - - - - - - -

	To avoid the overhead of generating WAL for non-permanent (unlogged and
	temporary) relations, we assign fake LSNs that are derived from a
	counter via xlog.c::GetFakeLSNForUnloggedRel(). (GiST also uses this
	counter for LSNs.) We also set a bit in the IV so the use of the same
	value for WAL (real) and fake LSNs will still generate unique IVs. Only
	main forks are encrypted, not init, vm, or fsm files.

	In the code, we need to identify if a page uses WAL or fake LSNs in
	four places, when:

	1. Reading a page from the file system and decrypting
	2. Setting the WAL or fake LSN on a page
	3. Hint bits changes requiring new LSNs for the encryption IV
	4. Encrypting and writing a page to the file system

	For all these case, we have access to the fork number and either the
	relation's persistence state or the buffer state. If it is a "main"
	fork and the relation persistence state is RELPERSISTENCE_PERMANENT, or
	if it is an "init" fork, we use a real LSN. If it is a main fork and
	RELPERSISTENCE_PERMANENT is false, we use a fake LSN. The buffer state
	BM_PERMANENT is true if the relation is PERMANENT or is an init fork.

	Init Forks
	- - - - -

	Init forks for unlogged relations get permanent LSNs because unlogged
	relation creation is WAL logged/crash safe, even though the relation's
	contents are not. When the init fork is copied to represent an empty
	relation during crash recovery, it becomes a non-permanent page and must
	be successfully decrypted as such. Therefore, when it is copied, its
	LSN is changed to e fake LSN and then encrypted. This prevents a real
	LSN from being encrypted with the fake nonce bit.

	LSN Assignment, GiST, & Non-Permanent Relations
	- - - - - - - - - - - - - - - - - - - - - - - -

	LSN assignment has to be slightly modified for encryption. In normal,
	non-encryption mode, LSNs are assigned to pages following these rules:

	1. During GiST builds, some pages are assigned fixed LSNs (GistBuildLSN)

	2. During GiST builds, non-permanent pages not assigned fixed LSNs in
	#1 are assigned fake LSNs, via gistutil.c::gistGetFakeLSN().

	3. All other permanent pages are assigned WAL-based LSNs based on the
	WAL position of their WAL records.

	4. All other non-permanent pages have LSNs of zero.

	When encryption is enabled:

	1. During GiST builds, permanent pages are assigned WAL-based LSNs
	generated by xloginsert.c::LSNForEncryption().

	2. During GiST builds, non-permanent pages are assigned fake LSNs.
	(No constant LSNs are used in #1 or #2.)

	3. same as #3 above

	4. All other non-permanent pages are assigned fake LSNs before page
	encryption.

	When switching to an encrypted replica from a non-encrypted primary,
	GiST indexes will be using fixed LSNs for permanent tables, so it is
	recommended to rebuild GiST indexes. Non-permanent relations are not
	replicated, so they are not an issue.

	Hint Bits
	- - - - -

	For hint bit changes, the LSN normally doesn't change, which is a
	problem. By enabling wal_log_hints, you get full page writes to the WAL
	after the first hint bit change of the checkpoint. This is useful for
	two reasons. First, it generates a new LSN, which is needed for the IV
	to be secure. Second, full page images protect against torn pages,
	which is an even bigger requirement for encryption because the new LSN
	is re-encrypting the entire page, not just the hint bit changes. You
	can safely lose the hint bit changes, but you need to use the same LSN
	to decrypt the entire page, so a torn page with an LSN change cannot be
	decrypted. To prevent this, wal_log_hints guarantees that the
	pre-hint-bit version (and previous LSN version) of the page is restored.

	However, if a hint-bit-modified page is written to the file system
	during a checkpoint, and there is a later hint bit change switching the
	same page from clean to dirty during the same checkpoint, we need a new
	LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this
	is to update the page LSN by writing a dummy WAL record via
	xloginsert.c::LSNForEncryption() in such cases.