docs/_docs/administrators-guide/data-partitions.adoc - ignite-3 - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 // http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 = Data Partitioning

 Data partitioning is a method of subdividing large sets of data into smaller chunks and distributing them between all server nodes in a balanced manner.

 == How Data is Partitioned

 When the table is created, it is always assigned to a link:administrators-guide/distribution-zones[distribution zone]. Based on the distribution zone parameters, data is separated into `PARTITIONS` parts, called *partitions*, stored `REPLICAS` times across the cluster. Each partition is identified by a number from a limited set (0 to 24 by default). Each individual copy of a partition is called a *replica*, and is stored on separate nodes, if possible. Ignite uses the *Fair* partition distribution algorithm. It means that it stores information on partition distribution and uses this information for assigning new partitions. This information is preserved in cluster metastorage, and is recalculated only when necessary.


 Once partitions and all replicas are created, they are distributed across the available cluster nodes that are included in the distribution zone following the `DATA_NODES_FILTER` parameter and according to the *partition distribution algorithm*. Thus, each key is mapped to a list of nodes owning the corresponding partition and is stored on those nodes. When data is added, it is distributed evenly between partitions.

 image::images/partitioning.png[Ignite Partitions]

 The nodes store partitions in the folder specified in the `ignite.system.partitionsBasePath`, or in the `work/partitions` folder. The nodes also store partition-specific RAFT logs in the `ignite.system.partitionsLogPath` with the information on RAFT elections and consensus.

 === Default Partition Number

 When creating a distribution zone, you have an option to manually set the number of partitions with the `PARTITIONS` parameter, for example:

 [source, sql]
 ----
 CREATE ZONE IF NOT EXISTS exampleZone WITH STORAGE_PROFILES='default',PARTITIONS=10
 ----

 Otherwise, Ignite will automatically calculate the recommended number of partitions:

 ----
 dataNodesCount * coresOnNode * 2 / replicas
 ----

 == Primary Replicas and Leases

 Once the partitions are distributed on the nodes, Ignite forms *replication groups* for all partitions of the table, and each group elects its leader. To linearize writes to partitions, Ignite designates one replica of each partition as a *primary replica*.

 To designate a primary replica, Ignite uses a process of granting a *lease*. Leases are granted by the *lease placement driver*, and signify the node that houses the primary replica, called a *lease holder*. Once the lease is granted, information about it is written to the link:administrators-guide/lifecycle#cluster-metastorage-group[metastorage], and provided to all nodes in the cluster. Usually, the primary replica will be the same as replication group leader.

 Granted leases are valid for a short period of time and are extended every couple of seconds, preserving the continuity of each lease. A lease cannot be revoked until it expires. In exceptional situations (for example, when primary replica is unable to serve as primary anymore, leaseholder node goes offline, replication group is inoperable, etc.) the placement driver waits for the lease to expire and then initiates the negotiation of the new one.

 Only the primary replica can handle operations of read-write transactions. Other replicas of the partition can be read from by using read-only transactions.

 If a new replica is chosen to receive the lease, it first makes sure it is up-to-date with its replication group by stored data. In scenarios where replication group is no longer operable (for example, a node unexpectedly leaves the cluster and the group loses majority), it follows the link:administrators-guide/disaster-recovery[disaster recovery] procedure, and you may need to reset the partitions manually.

 == Version Storage

 As new data is written to the partition, Ignite does not immediately delete old one. Instead, Ignite stores old keys in a *version chain* within the same partition.

 Older key versions can only be accessed by link:developers-guide/transactions#read-only-transactions[read-only transactions], while up-to-date version can be accessed by any transactions.

 Older key versions are kept until the *low watermark* point is reached. By default, low watermark is 600000 ms, and it can be changed in link:administrators-guide/config/cluster-config#garbage-collection-configuration[cluster configuration]. Increasing data availability time will mean that old key versions are stored and available for longer, however storing them may require extra storage, depending on cluster load.

 In a similar manner, link:sql-reference/ddl#drop-table[dropped tables] are also not removed from disk until the low watermark point, however you can no longer write to these tables. Read-only transactions that try to get data from these tables will succeed if they read data at timestamp before the table was dropped, and will delay the low watermark point if it is necessary to complete the transaction.

 Once the low watermark is reached, old versions of data are considered garbage and will be cleaned up by garbage collector during the next cleanup. This data may or may not be available, as garbage collection is not an immediate process. If a transaction was already started before the low watermark was reached, the required data will be kept available until the end of transaction even if the garbage collection happens. Additionally, Ignite checks that old data is not required anywhere on the cluster before cleaning up the data.

 == Distribution Reset

 The SQL query performance can deteriorate in a cluster where tables had been created over a long period, alongside topology changes, due to sub-optimum data colocation. To resolve this issue, you can reset (recalculate) partition distribution using link:ignite-cli-tool#distribution-commands[CLI] or link:developers-guide/rest/rest-api[REST API].

 NOTE: Reset is likely to result in <<Partition Rebalance>> that may take a long time.

 == Partition Rebalance

 When the cluster size changes, Ignite waits for the timeout specified in the `DATA_NODES_AUTO_ADJUST_SCALE_UP` or `DATA_NODES_AUTO_ADJUST_SCALE_DOWN` distribution zone properties, and then redistributes partitions according to partition distribution algorithm and transfers data to make it up-to-date with the replication group. This process is called *data rebalance*.
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.
	= Data Partitioning

	Data partitioning is a method of subdividing large sets of data into smaller chunks and distributing them between all server nodes in a balanced manner.

	== How Data is Partitioned

	When the table is created, it is always assigned to a link:administrators-guide/distribution-zones[distribution zone]. Based on the distribution zone parameters, data is separated into `PARTITIONS` parts, called partitions, stored `REPLICAS` times across the cluster. Each partition is identified by a number from a limited set (0 to 24 by default). Each individual copy of a partition is called a replica, and is stored on separate nodes, if possible. Ignite uses the Fair partition distribution algorithm. It means that it stores information on partition distribution and uses this information for assigning new partitions. This information is preserved in cluster metastorage, and is recalculated only when necessary.


	Once partitions and all replicas are created, they are distributed across the available cluster nodes that are included in the distribution zone following the `DATA_NODES_FILTER` parameter and according to the partition distribution algorithm. Thus, each key is mapped to a list of nodes owning the corresponding partition and is stored on those nodes. When data is added, it is distributed evenly between partitions.

	image::images/partitioning.png[Ignite Partitions]

	The nodes store partitions in the folder specified in the `ignite.system.partitionsBasePath`, or in the `work/partitions` folder. The nodes also store partition-specific RAFT logs in the `ignite.system.partitionsLogPath` with the information on RAFT elections and consensus.

	=== Default Partition Number

	When creating a distribution zone, you have an option to manually set the number of partitions with the `PARTITIONS` parameter, for example:

	[source, sql]
	----
	CREATE ZONE IF NOT EXISTS exampleZone WITH STORAGE_PROFILES='default',PARTITIONS=10
	----

	Otherwise, Ignite will automatically calculate the recommended number of partitions:

	----
	dataNodesCount * coresOnNode * 2 / replicas
	----

	== Primary Replicas and Leases

	Once the partitions are distributed on the nodes, Ignite forms replication groups for all partitions of the table, and each group elects its leader. To linearize writes to partitions, Ignite designates one replica of each partition as a primary replica.

	To designate a primary replica, Ignite uses a process of granting a lease. Leases are granted by the lease placement driver, and signify the node that houses the primary replica, called a lease holder. Once the lease is granted, information about it is written to the link:administrators-guide/lifecycle#cluster-metastorage-group[metastorage], and provided to all nodes in the cluster. Usually, the primary replica will be the same as replication group leader.

	Granted leases are valid for a short period of time and are extended every couple of seconds, preserving the continuity of each lease. A lease cannot be revoked until it expires. In exceptional situations (for example, when primary replica is unable to serve as primary anymore, leaseholder node goes offline, replication group is inoperable, etc.) the placement driver waits for the lease to expire and then initiates the negotiation of the new one.

	Only the primary replica can handle operations of read-write transactions. Other replicas of the partition can be read from by using read-only transactions.

	If a new replica is chosen to receive the lease, it first makes sure it is up-to-date with its replication group by stored data. In scenarios where replication group is no longer operable (for example, a node unexpectedly leaves the cluster and the group loses majority), it follows the link:administrators-guide/disaster-recovery[disaster recovery] procedure, and you may need to reset the partitions manually.

	== Version Storage

	As new data is written to the partition, Ignite does not immediately delete old one. Instead, Ignite stores old keys in a version chain within the same partition.

	Older key versions can only be accessed by link:developers-guide/transactions#read-only-transactions[read-only transactions], while up-to-date version can be accessed by any transactions.

	Older key versions are kept until the low watermark point is reached. By default, low watermark is 600000 ms, and it can be changed in link:administrators-guide/config/cluster-config#garbage-collection-configuration[cluster configuration]. Increasing data availability time will mean that old key versions are stored and available for longer, however storing them may require extra storage, depending on cluster load.

	In a similar manner, link:sql-reference/ddl#drop-table[dropped tables] are also not removed from disk until the low watermark point, however you can no longer write to these tables. Read-only transactions that try to get data from these tables will succeed if they read data at timestamp before the table was dropped, and will delay the low watermark point if it is necessary to complete the transaction.

	Once the low watermark is reached, old versions of data are considered garbage and will be cleaned up by garbage collector during the next cleanup. This data may or may not be available, as garbage collection is not an immediate process. If a transaction was already started before the low watermark was reached, the required data will be kept available until the end of transaction even if the garbage collection happens. Additionally, Ignite checks that old data is not required anywhere on the cluster before cleaning up the data.

	== Distribution Reset

	The SQL query performance can deteriorate in a cluster where tables had been created over a long period, alongside topology changes, due to sub-optimum data colocation. To resolve this issue, you can reset (recalculate) partition distribution using link:ignite-cli-tool#distribution-commands[CLI] or link:developers-guide/rest/rest-api[REST API].

	NOTE: Reset is likely to result in <<Partition Rebalance>> that may take a long time.

	== Partition Rebalance

	When the cluster size changes, Ignite waits for the timeout specified in the `DATA_NODES_AUTO_ADJUST_SCALE_UP` or `DATA_NODES_AUTO_ADJUST_SCALE_DOWN` distribution zone properties, and then redistributes partitions according to partition distribution algorithm and transfers data to make it up-to-date with the replication group. This process is called data rebalance.