doc/src/sgml/hash.sgml - cloudberry - Git at Google

 <!-- doc/src/sgml/hash.sgml -->

 <chapter id="hash-index">
 <title>Hash Indexes</title>

    <indexterm>
     <primary>index</primary>
     <secondary>Hash</secondary>
    </indexterm>

 <sect1 id="hash-intro">
  <title>Overview</title>

  <para>
   <productname>PostgreSQL</productname>
   includes an implementation of persistent on-disk hash indexes,
   which are fully crash recoverable. Any data type can be indexed by a
   hash index, including data types that do not have a well-defined linear
   ordering. Hash indexes store only the hash value of the data being
   indexed, thus there are no restrictions on the size of the data column
   being indexed.
  </para>

  <para>
   Hash indexes support only single-column indexes and do not allow
   uniqueness checking.
  </para>

  <para>
   Hash indexes support only the <literal>=</literal> operator,
   so WHERE clauses that specify range operations will not be able to take
   advantage of hash indexes.
  </para>

  <para>
   Each hash index tuple stores just the 4-byte hash value, not the actual
   column value. As a result, hash indexes may be much smaller than B-trees
   when indexing longer data items such as UUIDs, URLs, etc. The absence of
   the column value also makes all hash index scans lossy. Hash indexes may
   take part in bitmap index scans and backward scans.
  </para>

  <para>
   Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
   that use equality scans on larger tables. In a B-tree index, searches must
   descend through the tree until the leaf page is found. In tables with
   millions of rows, this descent can increase access time to data. The
   equivalent of a leaf page in a hash index is referred to as a bucket page. In
   contrast, a hash index allows accessing the bucket pages directly,
   thereby potentially reducing index access time in larger tables. This
   reduction in "logical I/O" becomes even more pronounced on indexes/data
   larger than shared_buffers/RAM.
  </para>

  <para>
   Hash indexes have been designed to cope with uneven distributions of
   hash values. Direct access to the bucket pages works well if the hash
   values are evenly distributed. When inserts mean that the bucket page
   becomes full, additional overflow pages are chained to that specific
   bucket page, locally expanding the storage for index tuples that match
   that hash value. When scanning a hash bucket during queries, we need to
   scan through all of the overflow pages. Thus an unbalanced hash index
   might actually be worse than a B-tree in terms of number of block
   accesses required, for some data.
  </para>

  <para>
   As a result of the overflow cases, we can say that hash indexes are
   most suitable for unique, nearly unique data or data with a low number
   of rows per hash bucket.
   One possible way to avoid problems is to exclude highly non-unique
   values from the index using a partial index condition, but this may
   not be suitable in many cases.
  </para>

  <para>
   Like B-Trees, hash indexes perform simple index tuple deletion. This
   is a deferred maintenance operation that deletes index tuples that are
   known to be safe to delete (those whose item identifier's LP_DEAD bit
   is already set). If an insert finds no space is available on a page we
   try to avoid creating a new overflow page by attempting to remove dead
   index tuples. Removal cannot occur if the page is pinned at that time.
   Deletion of dead index pointers also occurs during VACUUM.
  </para>

  <para>
   If it can, VACUUM will also try to squeeze the index tuples onto as
   few overflow pages as possible, minimizing the overflow chain. If an
   overflow page becomes empty, overflow pages can be recycled for reuse
   in other buckets, though we never return them to the operating system.
   There is currently no provision to shrink a hash index, other than by
   rebuilding it with REINDEX.
   There is no provision for reducing the number of buckets, either.
  </para>

  <para>
   Hash indexes may expand the number of bucket pages as the number of
   rows indexed grows. The hash key-to-bucket-number mapping is chosen so that
   the index can be incrementally expanded. When a new bucket is to be added to
   the index, exactly one existing bucket will need to be "split", with some of
   its tuples being transferred to the new bucket according to the updated
   key-to-bucket-number mapping.
  </para>

  <para>
   The expansion occurs in the foreground, which could increase execution
   time for user inserts. Thus, hash indexes may not be suitable for tables
   with rapidly increasing number of rows.
  </para>

 </sect1>

 <sect1 id="hash-implementation">
  <title>Implementation</title>

  <para>
   There are four kinds of pages in a hash index: the meta page (page zero),
   which contains statically allocated control information; primary bucket
   pages; overflow pages; and bitmap pages, which keep track of overflow
   pages that have been freed and are available for re-use. For addressing
   purposes, bitmap pages are regarded as a subset of the overflow pages.
  </para>

  <para>
   Both scanning the index and inserting tuples require locating the bucket
   where a given tuple ought to be located. To do this, we need the bucket
   count, highmask, and lowmask from the metapage; however, it's undesirable
   for performance reasons to have to have to lock and pin the metapage for
   every such operation. Instead, we retain a cached copy of the metapage
   in each backend's relcache entry. This will produce the correct bucket
   mapping as long as the target bucket hasn't been split since the last
   cache refresh.
  </para>

  <para>
   Primary bucket pages and overflow pages are allocated independently since
   any given index might need more or fewer overflow pages relative to its
   number of buckets. The hash code uses an interesting set of addressing
   rules to support a variable number of overflow pages while not having to
   move primary bucket pages around after they are created.
  </para>

  <para>
   Each row in the table indexed is represented by a single index tuple in
   the hash index. Hash index tuples are stored in bucket pages, and if
   they exist, overflow pages. We speed up searches by keeping the index entries
   in any one index page sorted by hash code, thus allowing binary search to be
   used within an index page. Note however that there is *no* assumption about
   the relative ordering of hash codes across different index pages of a bucket.
  </para>

  <para>
   The bucket splitting algorithms to expand the hash index are too complex to
   be worthy of mention here, though are described in more detail in
   <filename>src/backend/access/hash/README</filename>.
   The split algorithm is crash safe and can be restarted if not completed
   successfully.
  </para>

 </sect1>

 </chapter>
	<!-- doc/src/sgml/hash.sgml -->

	<chapter id="hash-index">
	<title>Hash Indexes</title>

	<indexterm>
	<primary>index</primary>
	<secondary>Hash</secondary>
	</indexterm>

	<sect1 id="hash-intro">
	<title>Overview</title>

	<para>
	<productname>PostgreSQL</productname>
	includes an implementation of persistent on-disk hash indexes,
	which are fully crash recoverable. Any data type can be indexed by a
	hash index, including data types that do not have a well-defined linear
	ordering. Hash indexes store only the hash value of the data being
	indexed, thus there are no restrictions on the size of the data column
	being indexed.
	</para>

	<para>
	Hash indexes support only single-column indexes and do not allow
	uniqueness checking.
	</para>

	<para>
	Hash indexes support only the <literal>=</literal> operator,
	so WHERE clauses that specify range operations will not be able to take
	advantage of hash indexes.
	</para>

	<para>
	Each hash index tuple stores just the 4-byte hash value, not the actual
	column value. As a result, hash indexes may be much smaller than B-trees
	when indexing longer data items such as UUIDs, URLs, etc. The absence of
	the column value also makes all hash index scans lossy. Hash indexes may
	take part in bitmap index scans and backward scans.
	</para>

	<para>
	Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
	that use equality scans on larger tables. In a B-tree index, searches must
	descend through the tree until the leaf page is found. In tables with
	millions of rows, this descent can increase access time to data. The
	equivalent of a leaf page in a hash index is referred to as a bucket page. In
	contrast, a hash index allows accessing the bucket pages directly,
	thereby potentially reducing index access time in larger tables. This
	reduction in "logical I/O" becomes even more pronounced on indexes/data
	larger than shared_buffers/RAM.
	</para>

	<para>
	Hash indexes have been designed to cope with uneven distributions of
	hash values. Direct access to the bucket pages works well if the hash
	values are evenly distributed. When inserts mean that the bucket page
	becomes full, additional overflow pages are chained to that specific
	bucket page, locally expanding the storage for index tuples that match
	that hash value. When scanning a hash bucket during queries, we need to
	scan through all of the overflow pages. Thus an unbalanced hash index
	might actually be worse than a B-tree in terms of number of block
	accesses required, for some data.
	</para>

	<para>
	As a result of the overflow cases, we can say that hash indexes are
	most suitable for unique, nearly unique data or data with a low number
	of rows per hash bucket.
	One possible way to avoid problems is to exclude highly non-unique
	values from the index using a partial index condition, but this may
	not be suitable in many cases.
	</para>

	<para>
	Like B-Trees, hash indexes perform simple index tuple deletion. This
	is a deferred maintenance operation that deletes index tuples that are
	known to be safe to delete (those whose item identifier's LP_DEAD bit
	is already set). If an insert finds no space is available on a page we
	try to avoid creating a new overflow page by attempting to remove dead
	index tuples. Removal cannot occur if the page is pinned at that time.
	Deletion of dead index pointers also occurs during VACUUM.
	</para>

	<para>
	If it can, VACUUM will also try to squeeze the index tuples onto as
	few overflow pages as possible, minimizing the overflow chain. If an
	overflow page becomes empty, overflow pages can be recycled for reuse
	in other buckets, though we never return them to the operating system.
	There is currently no provision to shrink a hash index, other than by
	rebuilding it with REINDEX.
	There is no provision for reducing the number of buckets, either.
	</para>

	<para>
	Hash indexes may expand the number of bucket pages as the number of
	rows indexed grows. The hash key-to-bucket-number mapping is chosen so that
	the index can be incrementally expanded. When a new bucket is to be added to
	the index, exactly one existing bucket will need to be "split", with some of
	its tuples being transferred to the new bucket according to the updated
	key-to-bucket-number mapping.
	</para>

	<para>
	The expansion occurs in the foreground, which could increase execution
	time for user inserts. Thus, hash indexes may not be suitable for tables
	with rapidly increasing number of rows.
	</para>

	</sect1>

	<sect1 id="hash-implementation">
	<title>Implementation</title>

	<para>
	There are four kinds of pages in a hash index: the meta page (page zero),
	which contains statically allocated control information; primary bucket
	pages; overflow pages; and bitmap pages, which keep track of overflow
	pages that have been freed and are available for re-use. For addressing
	purposes, bitmap pages are regarded as a subset of the overflow pages.
	</para>

	<para>
	Both scanning the index and inserting tuples require locating the bucket
	where a given tuple ought to be located. To do this, we need the bucket
	count, highmask, and lowmask from the metapage; however, it's undesirable
	for performance reasons to have to have to lock and pin the metapage for
	every such operation. Instead, we retain a cached copy of the metapage
	in each backend's relcache entry. This will produce the correct bucket
	mapping as long as the target bucket hasn't been split since the last
	cache refresh.
	</para>

	<para>
	Primary bucket pages and overflow pages are allocated independently since
	any given index might need more or fewer overflow pages relative to its
	number of buckets. The hash code uses an interesting set of addressing
	rules to support a variable number of overflow pages while not having to
	move primary bucket pages around after they are created.
	</para>

	<para>
	Each row in the table indexed is represented by a single index tuple in
	the hash index. Hash index tuples are stored in bucket pages, and if
	they exist, overflow pages. We speed up searches by keeping the index entries
	in any one index page sorted by hash code, thus allowing binary search to be
	used within an index page. Note however that there is no assumption about
	the relative ordering of hash codes across different index pages of a bucket.
	</para>

	<para>
	The bucket splitting algorithms to expand the hash index are too complex to
	be worthy of mention here, though are described in more detail in
	<filename>src/backend/access/hash/README</filename>.
	The split algorithm is crash safe and can be restarted if not completed
	successfully.
	</para>

	</sect1>

	</chapter>