| = Evaluating and Refining Data Models |
| |
| Once you’ve created a physical model, there are some steps you’ll want |
| to take to evaluate and refine table designs to help ensure optimal |
| performance. |
| |
| == Calculating Partition Size |
| |
| The first thing that you want to look for is whether your tables will |
| have partitions that will be overly large, or to put it another way, too |
| wide. Partition size is measured by the number of cells (values) that |
| are stored in the partition. Cassandra’s hard limit is 2 billion cells |
| per partition, but you’ll likely run into performance issues before |
| reaching that limit. |
| |
| In order to calculate the size of partitions, use the following formula: |
| |
| [latexmath] |
| ++++ |
| \[N_v = N_r (N_c - N_{pk} - N_s) + N_s\] |
| ++++ |
| |
| The number of values (or cells) in the partition (N~v~) is equal to the |
| number of static columns (N~s~) plus the product of the number of rows |
| (N~r~) and the number of of values per row. The number of values per row |
| is defined as the number of columns (N~c~) minus the number of primary |
| key columns (N~pk~) and static columns (N~s~). |
| |
| The number of columns tends to be relatively static, although it is |
| possible to alter tables at runtime. For this reason, a primary driver |
| of partition size is the number of rows in the partition. This is a key |
| factor that you must consider in determining whether a partition has the |
| potential to get too large. Two billion values sounds like a lot, but in |
| a sensor system where tens or hundreds of values are measured every |
| millisecond, the number of values starts to add up pretty fast. |
| |
| Let’s take a look at one of the tables to analyze the partition size. |
| Because it has a wide partition design with one partition per hotel, |
| look at the `available_rooms_by_hotel_date` table. The table has four |
| columns total (N~c~ = 4), including three primary key columns (N~pk~ = |
| 3) and no static columns (N~s~ = 0). Plugging these values into the |
| formula, the result is: |
| |
| [latexmath] |
| ++++ |
| \[N_v = N_r (4 - 3 - 0) + 0 = 1N_r\] |
| ++++ |
| |
| Therefore the number of values for this table is equal to the number of |
| rows. You still need to determine a number of rows. To do this, make |
| estimates based on the application design. The table is storing a record |
| for each room, in each of hotel, for every night. Let's assume the |
| system will be used to store two years of inventory at a time, and there |
| are 5,000 hotels in the system, with an average of 100 rooms in each |
| hotel. |
| |
| Since there is a partition for each hotel, the estimated number of rows |
| per partition is as follows: |
| |
| [latexmath] |
| ++++ |
| \[N_r = 100 rooms/hotel \times 730 days = 73,000 rows\] |
| ++++ |
| |
| This relatively small number of rows per partition is not going to get |
| you in too much trouble, but if you start storing more dates of |
| inventory, or don’t manage the size of the inventory well using TTL, you |
| could start having issues. You still might want to look at breaking up |
| this large partition, which you'll see how to do shortly. |
| |
| When performing sizing calculations, it is tempting to assume the |
| nominal or average case for variables such as the number of rows. |
| Consider calculating the worst case as well, as these sorts of |
| predictions have a way of coming true in successful systems. |
| |
| == Calculating Size on Disk |
| |
| In addition to calculating the size of a partition, it is also an |
| excellent idea to estimate the amount of disk space that will be |
| required for each table you plan to store in the cluster. In order to |
| determine the size, use the following formula to determine the size S~t~ |
| of a partition: |
| |
| [latexmath] |
| ++++ |
| \[S_t = \displaystyle\sum_i sizeOf\big (c_{k_i}\big) + \displaystyle\sum_j sizeOf\big(c_{s_j}\big) + N_r\times \bigg(\displaystyle\sum_k sizeOf\big(c_{r_k}\big) + \displaystyle\sum_l sizeOf\big(c_{c_l}\big)\bigg) +\] |
| ++++ |
| |
| [latexmath] |
| ++++ |
| \[N_v\times sizeOf\big(t_{avg}\big)\] |
| ++++ |
| |
| This is a bit more complex than the previous formula, but let's break it |
| down a bit at a time. Let’s take a look at the notation first: |
| |
| * In this formula, c~k~ refers to partition key columns, c~s~ to static |
| columns, c~r~ to regular columns, and c~c~ to clustering columns. |
| * The term t~avg~ refers to the average number of bytes of metadata |
| stored per cell, such as timestamps. It is typical to use an estimate of |
| 8 bytes for this value. |
| * You'll recognize the number of rows N~r~ and number of values N~v~ |
| from previous calculations. |
| * The *sizeOf()* function refers to the size in bytes of the CQL data |
| type of each referenced column. |
| |
| The first term asks you to sum the size of the partition key columns. |
| For this example, the `available_rooms_by_hotel_date` table has a single |
| partition key column, the `hotel_id`, which is of type `text`. Assuming |
| that hotel identifiers are simple 5-character codes, you have a 5-byte |
| value, so the sum of the partition key column sizes is 5 bytes. |
| |
| The second term asks you to sum the size of the static columns. This |
| table has no static columns, so the size is 0 bytes. |
| |
| The third term is the most involved, and for good reason—it is |
| calculating the size of the cells in the partition. Sum the size of the |
| clustering columns and regular columns. The two clustering columns are |
| the `date`, which is 4 bytes, and the `room_number`, which is a 2-byte |
| short integer, giving a sum of 6 bytes. There is only a single regular |
| column, the boolean `is_available`, which is 1 byte in size. Summing the |
| regular column size (1 byte) plus the clustering column size (6 bytes) |
| gives a total of 7 bytes. To finish up the term, multiply this value by |
| the number of rows (73,000), giving a result of 511,000 bytes (0.51 MB). |
| |
| The fourth term is simply counting the metadata that that Cassandra |
| stores for each cell. In the storage format used by Cassandra 3.0 and |
| later, the amount of metadata for a given cell varies based on the type |
| of data being stored, and whether or not custom timestamp or TTL values |
| are specified for individual cells. For this table, reuse the number of |
| values from the previous calculation (73,000) and multiply by 8, which |
| gives 0.58 MB. |
| |
| Adding these terms together, you get a final estimate: |
| |
| [latexmath] |
| ++++ |
| \[Partition size = 16 bytes + 0 bytes + 0.51 MB + 0.58 MB = 1.1 MB\] |
| ++++ |
| |
| This formula is an approximation of the actual size of a partition on |
| disk, but is accurate enough to be quite useful. Remembering that the |
| partition must be able to fit on a single node, it looks like the table |
| design will not put a lot of strain on disk storage. |
| |
| Cassandra’s storage engine was re-implemented for the 3.0 release, |
| including a new format for SSTable files. The previous format stored a |
| separate copy of the clustering columns as part of the record for each |
| cell. The newer format eliminates this duplication, which reduces the |
| size of stored data and simplifies the formula for computing that size. |
| |
| Keep in mind also that this estimate only counts a single replica of |
| data. You will need to multiply the value obtained here by the number of |
| partitions and the number of replicas specified by the keyspace’s |
| replication strategy in order to determine the total required total |
| capacity for each table. This will come in handy when you plan your |
| cluster. |
| |
| == Breaking Up Large Partitions |
| |
| As discussed previously, the goal is to design tables that can provide |
| the data you need with queries that touch a single partition, or failing |
| that, the minimum possible number of partitions. However, as shown in |
| the examples, it is quite possible to design wide partition-style tables |
| that approach Cassandra’s built-in limits. Performing sizing analysis on |
| tables may reveal partitions that are potentially too large, either in |
| number of values, size on disk, or both. |
| |
| The technique for splitting a large partition is straightforward: add an |
| additional column to the partition key. In most cases, moving one of the |
| existing columns into the partition key will be sufficient. Another |
| option is to introduce an additional column to the table to act as a |
| sharding key, but this requires additional application logic. |
| |
| Continuing to examine the available rooms example, if you add the `date` |
| column to the partition key for the `available_rooms_by_hotel_date` |
| table, each partition would then represent the availability of rooms at |
| a specific hotel on a specific date. This will certainly yield |
| partitions that are significantly smaller, perhaps too small, as the |
| data for consecutive days will likely be on separate nodes. |
| |
| Another technique known as *bucketing* is often used to break the data |
| into moderate-size partitions. For example, you could bucketize the |
| `available_rooms_by_hotel_date` table by adding a `month` column to the |
| partition key, perhaps represented as an integer. The comparision with |
| the original design is shown in the figure below. While the `month` |
| column is partially duplicative of the `date`, it provides a nice way of |
| grouping related data in a partition that will not get too large. |
| |
| image::data_modeling_hotel_bucketing.png[image] |
| |
| If you really felt strongly about preserving a wide partition design, |
| you could instead add the `room_id` to the partition key, so that each |
| partition would represent the availability of the room across all dates. |
| Because there was no query identified that involves searching |
| availability of a specific room, the first or second design approach is |
| most suitable to the application needs. |
| |
| _Material adapted from Cassandra, The Definitive Guide. Published by |
| O'Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All |
| rights reserved. Used with permission._ |