doc/modules/cassandra/pages/data_modeling/data_modeling_refining.adoc - cassandra - Git at Google

 = Evaluating and Refining Data Models

 Once you’ve created a physical model, there are some steps you’ll want
 to take to evaluate and refine table designs to help ensure optimal
 performance.

 == Calculating Partition Size

 The first thing that you want to look for is whether your tables will
 have partitions that will be overly large, or to put it another way, too
 wide. Partition size is measured by the number of cells (values) that
 are stored in the partition. Cassandra’s hard limit is 2 billion cells
 per partition, but you’ll likely run into performance issues before
 reaching that limit.

 In order to calculate the size of partitions, use the following formula:

 [latexmath]
 ++++
 \[N_v = N_r (N_c - N_{pk} - N_s) + N_s\]
 ++++

 The number of values (or cells) in the partition (N~v~) is equal to the
 number of static columns (N~s~) plus the product of the number of rows
 (N~r~) and the number of of values per row. The number of values per row
 is defined as the number of columns (N~c~) minus the number of primary
 key columns (N~pk~) and static columns (N~s~).

 The number of columns tends to be relatively static, although it is
 possible to alter tables at runtime. For this reason, a primary driver
 of partition size is the number of rows in the partition. This is a key
 factor that you must consider in determining whether a partition has the
 potential to get too large. Two billion values sounds like a lot, but in
 a sensor system where tens or hundreds of values are measured every
 millisecond, the number of values starts to add up pretty fast.

 Let’s take a look at one of the tables to analyze the partition size.
 Because it has a wide partition design with one partition per hotel,
 look at the `available_rooms_by_hotel_date` table. The table has four
 columns total (N~c~ = 4), including three primary key columns (N~pk~ =
 3) and no static columns (N~s~ = 0). Plugging these values into the
 formula, the result is:

 [latexmath]
 ++++
 \[N_v = N_r (4 - 3 - 0) + 0 = 1N_r\]
 ++++

 Therefore the number of values for this table is equal to the number of
 rows. You still need to determine a number of rows. To do this, make
 estimates based on the application design. The table is storing a record
 for each room, in each of hotel, for every night. Let's assume the
 system will be used to store two years of inventory at a time, and there
 are 5,000 hotels in the system, with an average of 100 rooms in each
 hotel.

 Since there is a partition for each hotel, the estimated number of rows
 per partition is as follows:

 [latexmath]
 ++++
 \[N_r = 100 rooms/hotel \times 730 days = 73,000 rows\]
 ++++

 This relatively small number of rows per partition is not going to get
 you in too much trouble, but if you start storing more dates of
 inventory, or don’t manage the size of the inventory well using TTL, you
 could start having issues. You still might want to look at breaking up
 this large partition, which you'll see how to do shortly.

 When performing sizing calculations, it is tempting to assume the
 nominal or average case for variables such as the number of rows.
 Consider calculating the worst case as well, as these sorts of
 predictions have a way of coming true in successful systems.

 == Calculating Size on Disk

 In addition to calculating the size of a partition, it is also an
 excellent idea to estimate the amount of disk space that will be
 required for each table you plan to store in the cluster. In order to
 determine the size, use the following formula to determine the size S~t~
 of a partition:

 [latexmath]
 ++++
 \[S_t = \displaystyle\sum_i sizeOf\big (c_{k_i}\big) + \displaystyle\sum_j sizeOf\big(c_{s_j}\big) + N_r\times \bigg(\displaystyle\sum_k sizeOf\big(c_{r_k}\big) + \displaystyle\sum_l sizeOf\big(c_{c_l}\big)\bigg) +\]
 ++++

 [latexmath]
 ++++
 \[N_v\times sizeOf\big(t_{avg}\big)\]
 ++++

 This is a bit more complex than the previous formula, but let's break it
 down a bit at a time. Let’s take a look at the notation first:

 * In this formula, c~k~ refers to partition key columns, c~s~ to static
 columns, c~r~ to regular columns, and c~c~ to clustering columns.
 * The term t~avg~ refers to the average number of bytes of metadata
 stored per cell, such as timestamps. It is typical to use an estimate of
 8 bytes for this value.
 * You'll recognize the number of rows N~r~ and number of values N~v~
 from previous calculations.
 * The *sizeOf()* function refers to the size in bytes of the CQL data
 type of each referenced column.

 The first term asks you to sum the size of the partition key columns.
 For this example, the `available_rooms_by_hotel_date` table has a single
 partition key column, the `hotel_id`, which is of type `text`. Assuming
 that hotel identifiers are simple 5-character codes, you have a 5-byte
 value, so the sum of the partition key column sizes is 5 bytes.

 The second term asks you to sum the size of the static columns. This
 table has no static columns, so the size is 0 bytes.

 The third term is the most involved, and for good reason—it is
 calculating the size of the cells in the partition. Sum the size of the
 clustering columns and regular columns. The two clustering columns are
 the `date`, which is 4 bytes, and the `room_number`, which is a 2-byte
 short integer, giving a sum of 6 bytes. There is only a single regular
 column, the boolean `is_available`, which is 1 byte in size. Summing the
 regular column size (1 byte) plus the clustering column size (6 bytes)
 gives a total of 7 bytes. To finish up the term, multiply this value by
 the number of rows (73,000), giving a result of 511,000 bytes (0.51 MB).

 The fourth term is simply counting the metadata that that Cassandra
 stores for each cell. In the storage format used by Cassandra 3.0 and
 later, the amount of metadata for a given cell varies based on the type
 of data being stored, and whether or not custom timestamp or TTL values
 are specified for individual cells. For this table, reuse the number of
 values from the previous calculation (73,000) and multiply by 8, which
 gives 0.58 MB.

 Adding these terms together, you get a final estimate:

 [latexmath]
 ++++
 \[Partition size = 16 bytes + 0 bytes + 0.51 MB + 0.58 MB = 1.1 MB\]
 ++++

 This formula is an approximation of the actual size of a partition on
 disk, but is accurate enough to be quite useful. Remembering that the
 partition must be able to fit on a single node, it looks like the table
 design will not put a lot of strain on disk storage.

 Cassandra’s storage engine was re-implemented for the 3.0 release,
 including a new format for SSTable files. The previous format stored a
 separate copy of the clustering columns as part of the record for each
 cell. The newer format eliminates this duplication, which reduces the
 size of stored data and simplifies the formula for computing that size.

 Keep in mind also that this estimate only counts a single replica of
 data. You will need to multiply the value obtained here by the number of
 partitions and the number of replicas specified by the keyspace’s
 replication strategy in order to determine the total required total
 capacity for each table. This will come in handy when you plan your
 cluster.

 == Breaking Up Large Partitions

 As discussed previously, the goal is to design tables that can provide
 the data you need with queries that touch a single partition, or failing
 that, the minimum possible number of partitions. However, as shown in
 the examples, it is quite possible to design wide partition-style tables
 that approach Cassandra’s built-in limits. Performing sizing analysis on
 tables may reveal partitions that are potentially too large, either in
 number of values, size on disk, or both.

 The technique for splitting a large partition is straightforward: add an
 additional column to the partition key. In most cases, moving one of the
 existing columns into the partition key will be sufficient. Another
 option is to introduce an additional column to the table to act as a
 sharding key, but this requires additional application logic.

 Continuing to examine the available rooms example, if you add the `date`
 column to the partition key for the `available_rooms_by_hotel_date`
 table, each partition would then represent the availability of rooms at
 a specific hotel on a specific date. This will certainly yield
 partitions that are significantly smaller, perhaps too small, as the
 data for consecutive days will likely be on separate nodes.

 Another technique known as *bucketing* is often used to break the data
 into moderate-size partitions. For example, you could bucketize the
 `available_rooms_by_hotel_date` table by adding a `month` column to the
 partition key, perhaps represented as an integer. The comparision with
 the original design is shown in the figure below. While the `month`
 column is partially duplicative of the `date`, it provides a nice way of
 grouping related data in a partition that will not get too large.

 image::data_modeling_hotel_bucketing.png[image]

 If you really felt strongly about preserving a wide partition design,
 you could instead add the `room_id` to the partition key, so that each
 partition would represent the availability of the room across all dates.
 Because there was no query identified that involves searching
 availability of a specific room, the first or second design approach is
 most suitable to the application needs.

 _Material adapted from Cassandra, The Definitive Guide. Published by
 O'Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All
 rights reserved. Used with permission._
	= Evaluating and Refining Data Models

	Once you’ve created a physical model, there are some steps you’ll want
	to take to evaluate and refine table designs to help ensure optimal
	performance.

	== Calculating Partition Size

	The first thing that you want to look for is whether your tables will
	have partitions that will be overly large, or to put it another way, too
	wide. Partition size is measured by the number of cells (values) that
	are stored in the partition. Cassandra’s hard limit is 2 billion cells
	per partition, but you’ll likely run into performance issues before
	reaching that limit.

	In order to calculate the size of partitions, use the following formula:

	[latexmath]
	++++
	\[N_v = N_r (N_c - N_{pk} - N_s) + N_s\]
	++++

	The number of values (or cells) in the partition (N~v~) is equal to the
	number of static columns (N~s~) plus the product of the number of rows
	(N~r~) and the number of of values per row. The number of values per row
	is defined as the number of columns (N~c~) minus the number of primary
	key columns (N~pk~) and static columns (N~s~).

	The number of columns tends to be relatively static, although it is
	possible to alter tables at runtime. For this reason, a primary driver
	of partition size is the number of rows in the partition. This is a key
	factor that you must consider in determining whether a partition has the
	potential to get too large. Two billion values sounds like a lot, but in
	a sensor system where tens or hundreds of values are measured every
	millisecond, the number of values starts to add up pretty fast.

	Let’s take a look at one of the tables to analyze the partition size.
	Because it has a wide partition design with one partition per hotel,
	look at the `available_rooms_by_hotel_date` table. The table has four
	columns total (N~c~ = 4), including three primary key columns (N~pk~ =
	3) and no static columns (N~s~ = 0). Plugging these values into the
	formula, the result is:

	[latexmath]
	++++
	\[N_v = N_r (4 - 3 - 0) + 0 = 1N_r\]
	++++

	Therefore the number of values for this table is equal to the number of
	rows. You still need to determine a number of rows. To do this, make
	estimates based on the application design. The table is storing a record
	for each room, in each of hotel, for every night. Let's assume the
	system will be used to store two years of inventory at a time, and there
	are 5,000 hotels in the system, with an average of 100 rooms in each
	hotel.

	Since there is a partition for each hotel, the estimated number of rows
	per partition is as follows:

	[latexmath]
	++++
	\[N_r = 100 rooms/hotel \times 730 days = 73,000 rows\]
	++++

	This relatively small number of rows per partition is not going to get
	you in too much trouble, but if you start storing more dates of
	inventory, or don’t manage the size of the inventory well using TTL, you
	could start having issues. You still might want to look at breaking up
	this large partition, which you'll see how to do shortly.

	When performing sizing calculations, it is tempting to assume the
	nominal or average case for variables such as the number of rows.
	Consider calculating the worst case as well, as these sorts of
	predictions have a way of coming true in successful systems.

	== Calculating Size on Disk

	In addition to calculating the size of a partition, it is also an
	excellent idea to estimate the amount of disk space that will be
	required for each table you plan to store in the cluster. In order to
	determine the size, use the following formula to determine the size S~t~
	of a partition:

	[latexmath]
	++++
	\[S_t = \displaystyle\sum_i sizeOf\big (c_{k_i}\big) + \displaystyle\sum_j sizeOf\big(c_{s_j}\big) + N_r\times \bigg(\displaystyle\sum_k sizeOf\big(c_{r_k}\big) + \displaystyle\sum_l sizeOf\big(c_{c_l}\big)\bigg) +\]
	++++

	[latexmath]
	++++
	\[N_v\times sizeOf\big(t_{avg}\big)\]
	++++

	This is a bit more complex than the previous formula, but let's break it
	down a bit at a time. Let’s take a look at the notation first:

	* In this formula, c~k~ refers to partition key columns, c~s~ to static
	columns, c~r~ to regular columns, and c~c~ to clustering columns.
	* The term t~avg~ refers to the average number of bytes of metadata
	stored per cell, such as timestamps. It is typical to use an estimate of
	8 bytes for this value.
	* You'll recognize the number of rows N~r~ and number of values N~v~
	from previous calculations.
	* The sizeOf() function refers to the size in bytes of the CQL data
	type of each referenced column.

	The first term asks you to sum the size of the partition key columns.
	For this example, the `available_rooms_by_hotel_date` table has a single
	partition key column, the `hotel_id`, which is of type `text`. Assuming
	that hotel identifiers are simple 5-character codes, you have a 5-byte
	value, so the sum of the partition key column sizes is 5 bytes.

	The second term asks you to sum the size of the static columns. This
	table has no static columns, so the size is 0 bytes.

	The third term is the most involved, and for good reason—it is
	calculating the size of the cells in the partition. Sum the size of the
	clustering columns and regular columns. The two clustering columns are
	the `date`, which is 4 bytes, and the `room_number`, which is a 2-byte
	short integer, giving a sum of 6 bytes. There is only a single regular
	column, the boolean `is_available`, which is 1 byte in size. Summing the
	regular column size (1 byte) plus the clustering column size (6 bytes)
	gives a total of 7 bytes. To finish up the term, multiply this value by
	the number of rows (73,000), giving a result of 511,000 bytes (0.51 MB).

	The fourth term is simply counting the metadata that that Cassandra
	stores for each cell. In the storage format used by Cassandra 3.0 and
	later, the amount of metadata for a given cell varies based on the type
	of data being stored, and whether or not custom timestamp or TTL values
	are specified for individual cells. For this table, reuse the number of
	values from the previous calculation (73,000) and multiply by 8, which
	gives 0.58 MB.

	Adding these terms together, you get a final estimate:

	[latexmath]
	++++
	\[Partition size = 16 bytes + 0 bytes + 0.51 MB + 0.58 MB = 1.1 MB\]
	++++

	This formula is an approximation of the actual size of a partition on
	disk, but is accurate enough to be quite useful. Remembering that the
	partition must be able to fit on a single node, it looks like the table
	design will not put a lot of strain on disk storage.

	Cassandra’s storage engine was re-implemented for the 3.0 release,
	including a new format for SSTable files. The previous format stored a
	separate copy of the clustering columns as part of the record for each
	cell. The newer format eliminates this duplication, which reduces the
	size of stored data and simplifies the formula for computing that size.

	Keep in mind also that this estimate only counts a single replica of
	data. You will need to multiply the value obtained here by the number of
	partitions and the number of replicas specified by the keyspace’s
	replication strategy in order to determine the total required total
	capacity for each table. This will come in handy when you plan your
	cluster.

	== Breaking Up Large Partitions

	As discussed previously, the goal is to design tables that can provide
	the data you need with queries that touch a single partition, or failing
	that, the minimum possible number of partitions. However, as shown in
	the examples, it is quite possible to design wide partition-style tables
	that approach Cassandra’s built-in limits. Performing sizing analysis on
	tables may reveal partitions that are potentially too large, either in
	number of values, size on disk, or both.

	The technique for splitting a large partition is straightforward: add an
	additional column to the partition key. In most cases, moving one of the
	existing columns into the partition key will be sufficient. Another
	option is to introduce an additional column to the table to act as a
	sharding key, but this requires additional application logic.

	Continuing to examine the available rooms example, if you add the `date`
	column to the partition key for the `available_rooms_by_hotel_date`
	table, each partition would then represent the availability of rooms at
	a specific hotel on a specific date. This will certainly yield
	partitions that are significantly smaller, perhaps too small, as the
	data for consecutive days will likely be on separate nodes.

	Another technique known as bucketing is often used to break the data
	into moderate-size partitions. For example, you could bucketize the
	`available_rooms_by_hotel_date` table by adding a `month` column to the
	partition key, perhaps represented as an integer. The comparision with
	the original design is shown in the figure below. While the `month`
	column is partially duplicative of the `date`, it provides a nice way of
	grouping related data in a partition that will not get too large.

	image::data_modeling_hotel_bucketing.png[image]

	If you really felt strongly about preserving a wide partition design,
	you could instead add the `room_id` to the partition key, so that each
	partition would represent the availability of the room across all dates.
	Because there was no query identified that involves searching
	availability of a specific room, the first or second design approach is
	most suitable to the application needs.

	_Material adapted from Cassandra, The Definitive Guide. Published by
	O'Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All
	rights reserved. Used with permission._