doc/modules/cassandra/pages/data_modeling/data_modeling_logical.adoc - cassandra - Git at Google

 = Logical Data Modeling

 Now that you have defined your queries, you’re ready to begin designing
 Cassandra tables. First, create a logical model containing a table for
 each query, capturing entities and relationships from the conceptual
 model.

 To name each table, you’ll identify the primary entity type for which
 you are querying and use that to start the entity name. If you are
 querying by attributes of other related entities, append those to the
 table name, separated with `_by_`. For example, `hotels_by_poi`.

 Next, you identify the primary key for the table, adding partition key
 columns based on the required query attributes, and clustering columns
 in order to guarantee uniqueness and support desired sort ordering.

 The design of the primary key is extremely important, as it will
 determine how much data will be stored in each partition and how that
 data is organized on disk, which in turn will affect how quickly
 Cassandra processes reads.

 Complete each table by adding any additional attributes identified by
 the query. If any of these additional attributes are the same for every
 instance of the partition key, mark the column as static.

 Now that was a pretty quick description of a fairly involved process, so
 it will be worthwhile to work through a detailed example. First, let’s
 introduce a notation that you can use to represent logical models.

 Several individuals within the Cassandra community have proposed
 notations for capturing data models in diagrammatic form. This document
 uses a notation popularized by Artem Chebotko which provides a simple,
 informative way to visualize the relationships between queries and
 tables in your designs. This figure shows the Chebotko notation for a
 logical data model.

 image::data_modeling_chebotko_logical.png[image]

 Each table is shown with its title and a list of columns. Primary key
 columns are identified via symbols such as *K* for partition key columns
 and **C**↑ or **C**↓ to represent clustering columns. Lines are shown
 entering tables or between tables to indicate the queries that each
 table is designed to support.

 == Hotel Logical Data Model

 The figure below shows a Chebotko logical data model for the queries
 involving hotels, points of interest, rooms, and amenities. One thing
 you'll notice immediately is that the Cassandra design doesn’t include
 dedicated tables for rooms or amenities, as you had in the relational
 design. This is because the workflow didn’t identify any queries
 requiring this direct access.

 image::data_modeling_hotel_logical.png[image]

 Let’s explore the details of each of these tables.

 The first query Q1 is to find hotels near a point of interest, so you’ll
 call this table `hotels_by_poi`. Searching by a named point of interest
 is a clue that the point of interest should be a part of the primary
 key. Let’s reference the point of interest by name, because according to
 the workflow that is how users will start their search.

 You’ll note that you certainly could have more than one hotel near a
 given point of interest, so you’ll need another component in the primary
 key in order to make sure you have a unique partition for each hotel. So
 you add the hotel key as a clustering column.

 An important consideration in designing your table’s primary key is
 making sure that it defines a unique data element. Otherwise you run the
 risk of accidentally overwriting data.

 Now for the second query (Q2), you’ll need a table to get information
 about a specific hotel. One approach would have been to put all of the
 attributes of a hotel in the `hotels_by_poi` table, but you added only
 those attributes that were required by the application workflow.

 From the workflow diagram, you know that the `hotels_by_poi` table is
 used to display a list of hotels with basic information on each hotel,
 and the application knows the unique identifiers of the hotels returned.
 When the user selects a hotel to view details, you can then use Q2,
 which is used to obtain details about the hotel. Because you already
 have the `hotel_id` from Q1, you use that as a reference to the hotel
 you’re looking for. Therefore the second table is just called `hotels`.

 Another option would have been to store a set of `poi_names` in the
 hotels table. This is an equally valid approach. You’ll learn through
 experience which approach is best for your application.

 Q3 is just a reverse of Q1—looking for points of interest near a hotel,
 rather than hotels near a point of interest. This time, however, you
 need to access the details of each point of interest, as represented by
 the `pois_by_hotel` table. As previously, you add the point of interest
 name as a clustering key to guarantee uniqueness.

 At this point, let’s now consider how to support query Q4 to help the
 user find available rooms at a selected hotel for the nights they are
 interested in staying. Note that this query involves both a start date
 and an end date. Because you’re querying over a range instead of a
 single date, you know that you’ll need to use the date as a clustering
 key. Use the `hotel_id` as a primary key to group room data for each
 hotel on a single partition, which should help searches be super fast.
 Let’s call this the `available_rooms_by_hotel_date` table.

 To support searching over a range, use `clustering columns
 <clustering-columns>` to store attributes that you need to access in a
 range query. Remember that the order of the clustering columns is
 important.

 The design of the `available_rooms_by_hotel_date` table is an instance
 of the *wide partition* pattern. This pattern is sometimes called the
 *wide row* pattern when discussing databases that support similar
 models, but wide partition is a more accurate description from a
 Cassandra perspective. The essence of the pattern is to group multiple
 related rows in a partition in order to support fast access to multiple
 rows within the partition in a single query.

 In order to round out the shopping portion of the data model, add the
 `amenities_by_room` table to support Q5. This will allow users to view
 the amenities of one of the rooms that is available for the desired stay
 dates.

 == Reservation Logical Data Model

 Now let's switch gears to look at the reservation queries. The figure
 shows a logical data model for reservations. You’ll notice that these
 tables represent a denormalized design; the same data appears in
 multiple tables, with differing keys.

 image::data_modeling_reservation_logical.png[image]

 In order to satisfy Q6, the `reservations_by_guest` table can be used to
 look up the reservation by guest name. You could envision query Q7 being
 used on behalf of a guest on a self-serve website or a call center agent
 trying to assist the guest. Because the guest name might not be unique,
 you include the guest ID here as a clustering column as well.

 Q8 and Q9 in particular help to remind you to create queries that
 support various stakeholders of the application, not just customers but
 staff as well, and perhaps even the analytics team, suppliers, and so
 on.

 The hotel staff might wish to see a record of upcoming reservations by
 date in order to get insight into how the hotel is performing, such as
 what dates the hotel is sold out or undersold. Q8 supports the retrieval
 of reservations for a given hotel by date.

 Finally, you create a `guests` table. This provides a single location
 that used to store guest information. In this case, you specify a
 separate unique identifier for guest records, as it is not uncommon for
 guests to have the same name. In many organizations, a customer database
 such as the `guests` table would be part of a separate customer
 management application, which is why other guest access patterns were
 omitted from the example.

 == Patterns and Anti-Patterns

 As with other types of software design, there are some well-known
 patterns and anti-patterns for data modeling in Cassandra. You’ve
 already used one of the most common patterns in this hotel model—the
 wide partition pattern.

 The *time series* pattern is an extension of the wide partition pattern.
 In this pattern, a series of measurements at specific time intervals are
 stored in a wide partition, where the measurement time is used as part
 of the partition key. This pattern is frequently used in domains
 including business analysis, sensor data management, and scientific
 experiments.

 The time series pattern is also useful for data other than measurements.
 Consider the example of a banking application. You could store each
 customer’s balance in a row, but that might lead to a lot of read and
 write contention as various customers check their balance or make
 transactions. You’d probably be tempted to wrap a transaction around
 writes just to protect the balance from being updated in error. In
 contrast, a time series–style design would store each transaction as a
 timestamped row and leave the work of calculating the current balance to
 the application.

 One design trap that many new users fall into is attempting to use
 Cassandra as a queue. Each item in the queue is stored with a timestamp
 in a wide partition. Items are appended to the end of the queue and read
 from the front, being deleted after they are read. This is a design that
 seems attractive, especially given its apparent similarity to the time
 series pattern. The problem with this approach is that the deleted items
 are now `tombstones <asynch-deletes>` that Cassandra must scan past in
 order to read from the front of the queue. Over time, a growing number
 of tombstones begins to degrade read performance.

 The queue anti-pattern serves as a reminder that any design that relies
 on the deletion of data is potentially a poorly performing design.

 _Material adapted from Cassandra, The Definitive Guide. Published by
 O'Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All
 rights reserved. Used with permission._
	= Logical Data Modeling

	Now that you have defined your queries, you’re ready to begin designing
	Cassandra tables. First, create a logical model containing a table for
	each query, capturing entities and relationships from the conceptual
	model.

	To name each table, you’ll identify the primary entity type for which
	you are querying and use that to start the entity name. If you are
	querying by attributes of other related entities, append those to the
	table name, separated with `_by_`. For example, `hotels_by_poi`.

	Next, you identify the primary key for the table, adding partition key
	columns based on the required query attributes, and clustering columns
	in order to guarantee uniqueness and support desired sort ordering.

	The design of the primary key is extremely important, as it will
	determine how much data will be stored in each partition and how that
	data is organized on disk, which in turn will affect how quickly
	Cassandra processes reads.

	Complete each table by adding any additional attributes identified by
	the query. If any of these additional attributes are the same for every
	instance of the partition key, mark the column as static.

	Now that was a pretty quick description of a fairly involved process, so
	it will be worthwhile to work through a detailed example. First, let’s
	introduce a notation that you can use to represent logical models.

	Several individuals within the Cassandra community have proposed
	notations for capturing data models in diagrammatic form. This document
	uses a notation popularized by Artem Chebotko which provides a simple,
	informative way to visualize the relationships between queries and
	tables in your designs. This figure shows the Chebotko notation for a
	logical data model.

	image::data_modeling_chebotko_logical.png[image]

	Each table is shown with its title and a list of columns. Primary key
	columns are identified via symbols such as K for partition key columns
	and C↑ or C↓ to represent clustering columns. Lines are shown
	entering tables or between tables to indicate the queries that each
	table is designed to support.

	== Hotel Logical Data Model

	The figure below shows a Chebotko logical data model for the queries
	involving hotels, points of interest, rooms, and amenities. One thing
	you'll notice immediately is that the Cassandra design doesn’t include
	dedicated tables for rooms or amenities, as you had in the relational
	design. This is because the workflow didn’t identify any queries
	requiring this direct access.

	image::data_modeling_hotel_logical.png[image]

	Let’s explore the details of each of these tables.

	The first query Q1 is to find hotels near a point of interest, so you’ll
	call this table `hotels_by_poi`. Searching by a named point of interest
	is a clue that the point of interest should be a part of the primary
	key. Let’s reference the point of interest by name, because according to
	the workflow that is how users will start their search.

	You’ll note that you certainly could have more than one hotel near a
	given point of interest, so you’ll need another component in the primary
	key in order to make sure you have a unique partition for each hotel. So
	you add the hotel key as a clustering column.

	An important consideration in designing your table’s primary key is
	making sure that it defines a unique data element. Otherwise you run the
	risk of accidentally overwriting data.

	Now for the second query (Q2), you’ll need a table to get information
	about a specific hotel. One approach would have been to put all of the
	attributes of a hotel in the `hotels_by_poi` table, but you added only
	those attributes that were required by the application workflow.

	From the workflow diagram, you know that the `hotels_by_poi` table is
	used to display a list of hotels with basic information on each hotel,
	and the application knows the unique identifiers of the hotels returned.
	When the user selects a hotel to view details, you can then use Q2,
	which is used to obtain details about the hotel. Because you already
	have the `hotel_id` from Q1, you use that as a reference to the hotel
	you’re looking for. Therefore the second table is just called `hotels`.

	Another option would have been to store a set of `poi_names` in the
	hotels table. This is an equally valid approach. You’ll learn through
	experience which approach is best for your application.

	Q3 is just a reverse of Q1—looking for points of interest near a hotel,
	rather than hotels near a point of interest. This time, however, you
	need to access the details of each point of interest, as represented by
	the `pois_by_hotel` table. As previously, you add the point of interest
	name as a clustering key to guarantee uniqueness.

	At this point, let’s now consider how to support query Q4 to help the
	user find available rooms at a selected hotel for the nights they are
	interested in staying. Note that this query involves both a start date
	and an end date. Because you’re querying over a range instead of a
	single date, you know that you’ll need to use the date as a clustering
	key. Use the `hotel_id` as a primary key to group room data for each
	hotel on a single partition, which should help searches be super fast.
	Let’s call this the `available_rooms_by_hotel_date` table.

	To support searching over a range, use `clustering columns
	<clustering-columns>` to store attributes that you need to access in a
	range query. Remember that the order of the clustering columns is
	important.

	The design of the `available_rooms_by_hotel_date` table is an instance
	of the wide partition pattern. This pattern is sometimes called the
	wide row pattern when discussing databases that support similar
	models, but wide partition is a more accurate description from a
	Cassandra perspective. The essence of the pattern is to group multiple
	related rows in a partition in order to support fast access to multiple
	rows within the partition in a single query.

	In order to round out the shopping portion of the data model, add the
	`amenities_by_room` table to support Q5. This will allow users to view
	the amenities of one of the rooms that is available for the desired stay
	dates.

	== Reservation Logical Data Model

	Now let's switch gears to look at the reservation queries. The figure
	shows a logical data model for reservations. You’ll notice that these
	tables represent a denormalized design; the same data appears in
	multiple tables, with differing keys.

	image::data_modeling_reservation_logical.png[image]

	In order to satisfy Q6, the `reservations_by_guest` table can be used to
	look up the reservation by guest name. You could envision query Q7 being
	used on behalf of a guest on a self-serve website or a call center agent
	trying to assist the guest. Because the guest name might not be unique,
	you include the guest ID here as a clustering column as well.

	Q8 and Q9 in particular help to remind you to create queries that
	support various stakeholders of the application, not just customers but
	staff as well, and perhaps even the analytics team, suppliers, and so
	on.

	The hotel staff might wish to see a record of upcoming reservations by
	date in order to get insight into how the hotel is performing, such as
	what dates the hotel is sold out or undersold. Q8 supports the retrieval
	of reservations for a given hotel by date.

	Finally, you create a `guests` table. This provides a single location
	that used to store guest information. In this case, you specify a
	separate unique identifier for guest records, as it is not uncommon for
	guests to have the same name. In many organizations, a customer database
	such as the `guests` table would be part of a separate customer
	management application, which is why other guest access patterns were
	omitted from the example.

	== Patterns and Anti-Patterns

	As with other types of software design, there are some well-known
	patterns and anti-patterns for data modeling in Cassandra. You’ve
	already used one of the most common patterns in this hotel model—the
	wide partition pattern.

	The time series pattern is an extension of the wide partition pattern.
	In this pattern, a series of measurements at specific time intervals are
	stored in a wide partition, where the measurement time is used as part
	of the partition key. This pattern is frequently used in domains
	including business analysis, sensor data management, and scientific
	experiments.

	The time series pattern is also useful for data other than measurements.
	Consider the example of a banking application. You could store each
	customer’s balance in a row, but that might lead to a lot of read and
	write contention as various customers check their balance or make
	transactions. You’d probably be tempted to wrap a transaction around
	writes just to protect the balance from being updated in error. In
	contrast, a time series–style design would store each transaction as a
	timestamped row and leave the work of calculating the current balance to
	the application.

	One design trap that many new users fall into is attempting to use
	Cassandra as a queue. Each item in the queue is stored with a timestamp
	in a wide partition. Items are appended to the end of the queue and read
	from the front, being deleted after they are read. This is a design that
	seems attractive, especially given its apparent similarity to the time
	series pattern. The problem with this approach is that the deleted items
	are now `tombstones <asynch-deletes>` that Cassandra must scan past in
	order to read from the front of the queue. Over time, a growing number
	of tombstones begins to degrade read performance.

	The queue anti-pattern serves as a reminder that any design that relies
	on the deletion of data is potentially a poorly performing design.

	_Material adapted from Cassandra, The Definitive Guide. Published by
	O'Reilly Media, Inc. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All
	rights reserved. Used with permission._