| // Licensed to the Apache Software Foundation (ASF) under one or more |
| // contributor license agreements. See the NOTICE file distributed with |
| // this work for additional information regarding copyright ownership. |
| // The ASF licenses this file to You under the Apache License, Version 2.0 |
| // (the "License"); you may not use this file except in compliance with |
| // the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| = Getting Started with SQL |
| |
| This guide walks you through using Ignite 3 SQL capabilities via the command-line interface. You'll set up a distributed Apache Ignite cluster, create and manipulate the Chinook database (a sample database representing a digital media store), and learn to leverage Ignite's powerful SQL features. |
| |
| == Prerequisites |
| |
| * Docker and Docker Compose installed on your system; |
| * Basic familiarity with SQL; |
| * Command-line terminal access; |
| * 8GB+ of available RAM for running the containers; |
| * SQL directory with Chinook Database files downloaded. |
| |
| == Before Starting |
| |
| This tutorial uses prepared files to streamline deployment. Make sure you have downloaded the required files: |
| |
| - link:quick-start/sql-files/docker-compose.yml[docker-compose.yml^] |
| - link:quick-start/sql-files/sql.zip[sql archive^] |
| |
| Unpack the archive into a new folder, and place it and the docker compose file in the same directory where you will be running the Docker CLI commands. The tutorial expects these SQL files to be available and mounted to the container. |
| |
| CAUTION: Without these files, you will be unable to load the sample data needed for the exercises. |
| |
| == Setting Up a Ignite 3 Cluster |
| |
| Before we can start using SQL, we need to set up a Ignite cluster. We will use Docker Compose to create a three-node cluster. |
| |
| === Starting the Cluster |
| |
| Open a terminal in the directory containing the docker compose file `docker-compose.yml` file and start the cluster with Docker: |
| |
| [source, bash] |
| ---- |
| docker compose up -d |
| ---- |
| |
| This command starts the cluster in detached mode. You should see startup messages from all three nodes. When they are ready, you will see messages indicating that the servers have started successfully. |
| |
| [source, bash] |
| ---- |
| docker compose up -d |
| |
| [+] Running 4 |
| ✔ Network ignite3_default Created |
| ✔ Container ignite3-node2-1 Started |
| ✔ Container ignite3-node3-1 Started |
| ✔ Container ignite3-node1-1 Started |
| ---- |
| |
| You can check that all containers are running with the following command: |
| |
| [source, bash] |
| ---- |
| docker compose ps |
| ---- |
| |
| You should see all three nodes with a "running" status. |
| |
| [TIP] |
| Verify that all three nodes are "running" before continuing. |
| |
| == Connecting to the Cluster Using Ignite CLI |
| |
| Now we will connect to our running cluster using Ignite command-line interface (CLI). |
| |
| === Starting the CLI |
| |
| In your terminal, run: |
| |
| [source, bash] |
| ---- |
| docker run --rm -it --network=host -e LANG=C.UTF-8 -e LC_ALL=C.UTF-8 -v ./sql/:/opt/ignite/downloads/ apacheignite/ignite:3.0.0 cli |
| ---- |
| |
| This starts an interactive CLI container connected to the same Docker network as our cluster and mounts a volume containing the sql files for the Chinook Database. When prompted, connect to the default node. If you refused the connection, you can do it manually with the following command: |
| |
| [source, bash] |
| ---- |
| connect http://localhost:10300 |
| ---- |
| |
| You should see a message that you're connected to `http://localhost:10300` and a note that the cluster is not initialized. |
| |
| [NOTE] |
| The CLI container runs separately from the cluster nodes, but connects to them over the Docker network. |
| |
| === Initializing the Cluster |
| |
| Before we can use the cluster, we need to initialize it: |
| |
| [source, bash] |
| ---- |
| cluster init --name=ignite3 --metastorage-group=node1,node2,node3 |
| ---- |
| |
| NOTE: If the license is not available, make sure the license file was mounted correctly. |
| |
| [source, text] |
| ---- |
| # ___ __ |
| ### / | ____ ____ _ _____ / /_ ___ |
| # ##### / /| | / __ \ / __ `// ___// __ \ / _ \ |
| ### ###### / ___ | / /_/ // /_/ // /__ / / / // ___/ |
| ##### ####### /_/ |_|/ .___/ \__,_/ \___//_/ /_/ \___/ |
| ####### ###### /_/ |
| ######## #### ____ _ __ _____ |
| # ######## ## / _/____ _ ____ (_)/ /_ ___ |__ / |
| #### ####### # / / / __ `// __ \ / // __// _ \ /_ < |
| ##### ##### _/ / / /_/ // / / // // /_ / ___/ ___/ / |
| #### ## /___/ \__, //_/ /_//_/ \__/ \___/ /____/ |
| ## /____/ |
| |
| Apache Ignite CLI version 3.0.0 |
| |
| |
| You appear to have not connected to any node yet. Do you want to connect to the default node http://localhost:10300? [Y/n] y |
| Connected to http://localhost:10300 |
| The cluster is not initialized. Run cluster init command to initialize it. |
| [node1]> cluster init --name=ignite3 --metastorage-group=node1,node2,node3 |
| Cluster was initialized successfully |
| [node1]> |
| ---- |
| |
| == Creating the Chinook Database Schema |
| |
| Now that our cluster is running and initialized, we can start using SQL to create and work with data in Ignite. The Chinook database is a digital music store dataset, with tables for artists, albums, tracks, customers, and sales. |
| |
| === Entering SQL Mode |
| |
| To start working with SQL, enter SQL mode in the CLI: |
| |
| [source, text] |
| ---- |
| sql |
| ---- |
| |
| Your prompt should change to `sql-cli>` indicating you're now in SQL mode. |
| |
| [source,text] |
| ---- |
| [node1]> sql |
| sql-cli> |
| ---- |
| |
| === Creating Distribution Zones |
| |
| Before we create tables, let's set up distribution zones to control how our data is distributed and replicated across the cluster: |
| |
| [source, sql] |
| ---- |
| CREATE ZONE IF NOT EXISTS Chinook WITH replicas=2, storage_profiles='default'; |
| CREATE ZONE IF NOT EXISTS ChinookReplicated WITH replicas=3, partitions=25, storage_profiles='default'; |
| ---- |
| |
| These commands create two zones: |
| |
| * `Chinook` - Standard zone with 2 replicas for most tables; |
| * `ChinookReplicated` - Zone with 3 replicas for frequently accessed reference data. |
| |
| === Database Entity Relationship |
| |
| Here's the entity relationship diagram for our Chinook database: |
| |
| ++++ |
| <pre class="mermaid"> |
| erDiagram |
| ARTIST ||--o{ ALBUM : creates |
| ALBUM ||--o{ TRACK : contains |
| GENRE ||--o{ TRACK : categorizes |
| MEDIATYPE ||--o{ TRACK : formats |
| CUSTOMER ||--o{ INVOICE : places |
| INVOICE ||--o{ INVOICELINE : contains |
| TRACK ||--o{ INVOICELINE : purchased-in |
| EMPLOYEE ||--o{ CUSTOMER : supports |
| PLAYLIST ||--o{ PLAYLISTTRACK : contains |
| TRACK ||--o{ PLAYLISTTRACK : appears-in |
| ++++ |
| |
| |
| === Creating Core Tables |
| |
| Now let's create the main tables for the Chinook database. We will start with the Artist and Album tables. |
| |
| NOTE: Copy and paste the following SQL blocks at the `sql-cli>` prompt then hit enter. |
| |
| [source, sql] |
| ---- |
| CREATE TABLE Artist ( |
| ArtistId INT NOT NULL, |
| Name VARCHAR(120), |
| PRIMARY KEY (ArtistId) |
| ) ZONE Chinook; |
| |
| CREATE TABLE Album ( |
| AlbumId INT NOT NULL, |
| Title VARCHAR(160) NOT NULL, |
| ArtistId INT NOT NULL, |
| ReleaseYear INT, |
| PRIMARY KEY (AlbumId, ArtistId) |
| ) COLOCATE BY (ArtistId) ZONE Chinook; |
| ---- |
| |
| The `COLOCATE BY` clause in the *Album* table ensures that albums by the same artist are stored on the same nodes. This optimizes joins between Artist and Album tables by eliminating the need for network transfers during queries. |
| |
| Next, let's create the Genre and MediaType reference tables: |
| |
| [source, sql] |
| ---- |
| CREATE TABLE Genre ( |
| GenreId INT NOT NULL, |
| Name VARCHAR(120), |
| PRIMARY KEY (GenreId) |
| ) ZONE ChinookReplicated; |
| |
| CREATE TABLE MediaType ( |
| MediaTypeId INT NOT NULL, |
| Name VARCHAR(120), |
| PRIMARY KEY (MediaTypeId) |
| ) ZONE ChinookReplicated; |
| ---- |
| |
| These reference tables are placed in the `ChinookReplicated` zone with 3 replicas because they contain static data that is frequently joined with other tables. Having a copy on each node improves read performance. |
| |
| Now, let's create the Track table, which references the Album, Genre, and MediaType tables: |
| |
| [source, sql] |
| ---- |
| CREATE TABLE Track ( |
| TrackId INT NOT NULL, |
| Name VARCHAR(200) NOT NULL, |
| AlbumId INT, |
| MediaTypeId INT NOT NULL, |
| GenreId INT, |
| Composer VARCHAR(220), |
| Milliseconds INT NOT NULL, |
| Bytes INT, |
| UnitPrice NUMERIC(10,2) NOT NULL, |
| PRIMARY KEY (TrackId, AlbumId) |
| ) COLOCATE BY (AlbumId) ZONE Chinook; |
| ---- |
| |
| Tracks are colocated by AlbumId, not by TrackId, because most queries join tracks with their albums. This colocation optimizes these common join patterns. |
| |
| Let's also create tables to manage customers, employees, and sales: |
| |
| [source, sql] |
| ---- |
| CREATE TABLE Employee ( |
| EmployeeId INT NOT NULL, |
| LastName VARCHAR(20) NOT NULL, |
| FirstName VARCHAR(20) NOT NULL, |
| Title VARCHAR(30), |
| ReportsTo INT, |
| BirthDate DATE, |
| HireDate DATE, |
| Address VARCHAR(70), |
| City VARCHAR(40), |
| State VARCHAR(40), |
| Country VARCHAR(40), |
| PostalCode VARCHAR(10), |
| Phone VARCHAR(24), |
| Fax VARCHAR(24), |
| Email VARCHAR(60), |
| PRIMARY KEY (EmployeeId) |
| ) ZONE Chinook; |
| |
| CREATE TABLE Customer ( |
| CustomerId INT NOT NULL, |
| FirstName VARCHAR(40) NOT NULL, |
| LastName VARCHAR(20) NOT NULL, |
| Company VARCHAR(80), |
| Address VARCHAR(70), |
| City VARCHAR(40), |
| State VARCHAR(40), |
| Country VARCHAR(40), |
| PostalCode VARCHAR(10), |
| Phone VARCHAR(24), |
| Fax VARCHAR(24), |
| Email VARCHAR(60) NOT NULL, |
| SupportRepId INT, |
| PRIMARY KEY (CustomerId) |
| ) ZONE Chinook; |
| |
| CREATE TABLE Invoice ( |
| InvoiceId INT NOT NULL, |
| CustomerId INT NOT NULL, |
| InvoiceDate DATE NOT NULL, |
| BillingAddress VARCHAR(70), |
| BillingCity VARCHAR(40), |
| BillingState VARCHAR(40), |
| BillingCountry VARCHAR(40), |
| BillingPostalCode VARCHAR(10), |
| Total NUMERIC(10,2) NOT NULL, |
| PRIMARY KEY (InvoiceId, CustomerId) |
| ) COLOCATE BY (CustomerId) ZONE Chinook; |
| |
| CREATE TABLE InvoiceLine ( |
| InvoiceLineId INT NOT NULL, |
| InvoiceId INT NOT NULL, |
| TrackId INT NOT NULL, |
| UnitPrice NUMERIC(10,2) NOT NULL, |
| Quantity INT NOT NULL, |
| PRIMARY KEY (InvoiceLineId, TrackId) |
| ) COLOCATE BY (TrackId) ZONE Chinook; |
| ---- |
| |
| Invoices are colocated by CustomerId and InvoiceLines are colocated by InvoiceId. This creates an efficient chain of locality: Customer → Invoice → InvoiceLine, optimizing queries that analyze customer purchase history. |
| |
| Finally, let's create the playlist-related tables: |
| |
| [source, sql] |
| ---- |
| CREATE TABLE Playlist ( |
| PlaylistId INT NOT NULL, |
| Name VARCHAR(120), |
| PRIMARY KEY (PlaylistId) |
| ) ZONE Chinook; |
| |
| CREATE TABLE PlaylistTrack ( |
| PlaylistId INT NOT NULL, |
| TrackId INT NOT NULL, |
| PRIMARY KEY (PlaylistId, TrackId) |
| ) ZONE Chinook; |
| ---- |
| |
| Note that PlaylistTrack is not colocated with Track. This is a design decision that prioritizes playlist operations over joining with track details. In a real-world scenario, you might make a different colocation choice depending on your most common query patterns. |
| |
| === Verifying Table Creation |
| |
| Let's confirm that all our tables were created successfully: |
| |
| [source, sql] |
| ---- |
| SELECT * FROM system.tables WHERE schema = 'PUBLIC'; |
| ---- |
| |
| This query checks the system tables to verify that our tables exist. You should see a list of all the tables we've created. |
| |
| [source, bash] |
| ---- |
| sql-cli> SELECT * FROM system.tables WHERE schema = 'PUBLIC'; |
| ╔════════╤═══════════════╤════╤═════════════╤═══════════════════╤═════════════════╤══════════════════════╗ |
| ║ SCHEMA │ NAME │ ID │ PK_INDEX_ID │ ZONE │ STORAGE_PROFILE │ COLOCATION_KEY_INDEX ║ |
| ╠════════╪═══════════════╪════╪═════════════╪═══════════════════╪═════════════════╪══════════════════════╣ |
| ║ PUBLIC │ ALBUM │ 20 │ 21 │ CHINOOK │ default │ ARTISTID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ GENRE │ 22 │ 23 │ CHINOOKREPLICATED │ default │ GENREID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ ARTIST │ 18 │ 19 │ CHINOOK │ default │ ARTISTID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ TRACK │ 26 │ 27 │ CHINOOK │ default │ ALBUMID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ PLAYLIST │ 36 │ 37 │ CHINOOK │ default │ PLAYLISTID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ PLAYLISTTRACK │ 38 │ 39 │ CHINOOK │ default │ PLAYLISTID, TRACKID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ MEDIATYPE │ 24 │ 25 │ CHINOOKREPLICATED │ default │ MEDIATYPEID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ INVOICELINE │ 34 │ 35 │ CHINOOK │ default │ TRACKID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ EMPLOYEE │ 28 │ 29 │ CHINOOK │ default │ EMPLOYEEID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ CUSTOMER │ 30 │ 31 │ CHINOOK │ default │ CUSTOMERID ║ |
| ╟────────┼───────────────┼────┼─────────────┼───────────────────┼─────────────────┼──────────────────────╢ |
| ║ PUBLIC │ INVOICE │ 32 │ 33 │ CHINOOK │ default │ CUSTOMERID ║ |
| ╚════════╧═══════════════╧════╧═════════════╧═══════════════════╧═════════════════╧══════════════════════╝ |
| ---- |
| |
| TIP: **Checkpoint**: Verify that all tables appear in the `system.tables` output with their proper zones and colocation settings before proceeding to the next section. |
| |
| == Inserting Sample Data |
| |
| Now that we have our tables set up, let's populate them with sample data. |
| |
| === Adding Artists and Albums |
| |
| Let's start by adding some artists. |
| |
| - Exit the interactive sql mode by typing `exit;`. |
| - Then, load the current store catalog from the sql data file. |
| |
| [source, bash] |
| ---- |
| sql --file=/opt/ignite/downloads/current_catalog.sql |
| ---- |
| |
| [source, bash] |
| ---- |
| sql-cli> exit; |
| [node1]> sql --file=/opt/ignite/downloads/current_catalog.sql |
| Updated 275 rows. |
| Updated 347 rows. |
| ---- |
| |
| === Adding Genres and Media Types |
| |
| Let's populate our reference tables the same way: |
| |
| [source, bash] |
| ---- |
| sql --file=/opt/ignite/downloads/media_and_genre.sql |
| ---- |
| |
| [source, bash] |
| ---- |
| [node1]> sql --file=/opt/ignite/downloads/media_and_genre.sql |
| Updated 25 rows. |
| Updated 5 rows. |
| ---- |
| |
| === Adding Tracks |
| |
| Now let's add some tracks to our albums: |
| |
| [source, bash] |
| ---- |
| sql --file=/opt/ignite/downloads/tracks.sql |
| ---- |
| |
| [source, bash] |
| ---- |
| [node1]> sql --file=/opt/ignite/downloads/tracks.sql |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 503 rows. |
| ---- |
| |
| === Adding Employees and Customers |
| |
| Let's add some employee and customer data: |
| |
| [source, bash] |
| ---- |
| sql --file=/opt/ignite/downloads/ee_and_cust.sql |
| ---- |
| |
| [source, bash] |
| ---- |
| [node1]> sql --file=/opt/ignite/downloads/ee_and_cust.sql |
| Updated 8 rows. |
| Updated 59 rows. |
| ---- |
| |
| === Adding Invoices and Invoice Lines |
| |
| Finally, let's add some sales data: |
| |
| [source, bash] |
| ---- |
| sql --file=/opt/ignite/downloads/invoices.sql |
| ---- |
| |
| [source, bash] |
| ---- |
| [node1]> sql --file=/opt/ignite/downloads/invoices.sql |
| Updated 412 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 240 rows. |
| Updated 18 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 1000 rows. |
| Updated 715 rows. |
| ---- |
| |
| TIP: **Checkpoint**: Verify that all the data has been loaded successfully by checking that the "Updated X rows" messages match the expected row counts for each file. |
| |
| == Querying Data in Ignite SQL |
| |
| Now that we have data in our tables, let's run some SQL queries to explore the Chinook database. |
| |
| === Basic Queries |
| |
| Let's return to the `sql-cli>` and start with some simple SELECT queries: |
| |
| [source, bash] |
| ---- |
| sql |
| ---- |
| |
| [source, sql] |
| ---- |
| -- Get all artists |
| SELECT * FROM Artist; |
| |
| -- Get all albums for a specific artist |
| SELECT * FROM Album WHERE ArtistId = 3; |
| |
| -- Get all tracks for a specific album |
| SELECT * FROM Track WHERE AlbumId = 133; |
| ---- |
| |
| === Joins |
| |
| Now let's try some more complex queries with joins: |
| |
| [source, sql] |
| ---- |
| -- Get all tracks with artist and album information |
| SELECT |
| t.Name AS TrackName, |
| a.Title AS AlbumTitle, |
| ar.Name AS ArtistName |
| FROM |
| Track t |
| JOIN Album a ON t.AlbumId = a.AlbumId |
| JOIN Artist ar ON a.ArtistId = ar.ArtistId |
| LIMIT 10; |
| ---- |
| |
| This complex query uses multiple CTEs to achieve what would typically be done with the RANK() window function. We first count tracks per album, then determine the maximum track count per artist, and finally join these results to identify albums with the most tracks for each artist. |
| |
| == Data Manipulation in Ignite SQL |
| |
| Let's explore how to modify data using SQL in Ignite. |
| |
| === Understanding Distributed Updates |
| |
| When you update data in a distributed database, the changes need to be coordinated across multiple nodes: |
| |
| ++++ |
| <pre class="mermaid"> |
| sequenceDiagram |
| participant Client |
| participant Node1 |
| participant Node2 |
| participant Node3 |
| |
| Client->>Node1: UPDATE request |
| Node1->>Node1: Update local primary copy |
| Node1->>Node2: Propagate changes to backup copy |
| Node1-->>Client: Confirm update completed |
| ++++ |
| |
| === Inserting New Data |
| |
| Let's add a new artist and album: |
| |
| [source, sql] |
| ---- |
| -- Insert a new artist |
| INSERT INTO Artist (ArtistId, Name) |
| VALUES (276, 'New Discovery Band'); |
| |
| -- Insert a new album for this artist |
| INSERT INTO Album (AlbumId, Title, ArtistId, ReleaseYear) |
| VALUES (348, 'First Light', 276, 2023); |
| |
| -- Verify the insertions |
| SELECT * FROM Artist WHERE ArtistId = 276; |
| SELECT * FROM Album WHERE AlbumId = 348; |
| ---- |
| |
| === Updating Existing Data |
| |
| Now let's update some of our existing data: |
| |
| [source, sql] |
| ---- |
| -- Update the album release year |
| UPDATE Album |
| SET ReleaseYear = 2024 |
| WHERE AlbumId = 348; |
| |
| -- Update the artist name |
| UPDATE Artist |
| SET Name = 'New Discovery Ensemble' |
| WHERE ArtistId = 276; |
| |
| -- Verify the updates |
| SELECT * FROM Artist WHERE ArtistId = 276; |
| SELECT * FROM Album WHERE AlbumId = 348; |
| ---- |
| |
| In a distributed database like Ignite, these updates are automatically propagated to all replicas. The primary copy is updated first, then the changes are sent to the backup copies on other nodes. |
| |
| === Deleting Data |
| |
| Finally, let's clean up by deleting the data we added: |
| |
| [source, sql] |
| ---- |
| -- Delete the album |
| DELETE FROM Album WHERE AlbumId = 348; |
| |
| -- Delete the artist |
| DELETE FROM Artist WHERE ArtistId = 276; |
| |
| -- Verify the deletions |
| SELECT * FROM Artist WHERE ArtistId = 276; |
| SELECT * FROM Album WHERE AlbumId = 348; |
| ---- |
| |
| == Advanced SQL Features |
| |
| Let's explore some of Ignite's more advanced SQL features. |
| |
| === Querying System Views |
| |
| Ignite provides system views that let you inspect cluster metadata: |
| |
| [source, sql] |
| ---- |
| -- View all tables in the cluster |
| SELECT * FROM system.tables; |
| |
| -- View all zones |
| SELECT * FROM system.zones; |
| |
| -- View all columns for a specific table |
| SELECT * FROM system.table_columns WHERE TABLE_NAME = 'TRACK'; |
| ---- |
| |
| System views provide important metadata about your cluster configuration. They are essential for monitoring and troubleshooting in production environments. |
| |
| === Creating Indexes for Better Performance |
| |
| Let's add some indexes to improve query performance: |
| |
| [source, sql] |
| ---- |
| -- Create an index on the Name column of the Track table |
| CREATE INDEX idx_track_name ON Track (Name); |
| |
| -- Create a composite index on Artist and Album |
| CREATE INDEX idx_album_artist ON Album (ArtistId, Title); |
| |
| -- Create a composite index on Track's AlbumId and Name columns to optimize joins with Album table |
| -- and to improve performance when filtering or sorting by track name within an album |
| CREATE INDEX idx_track_albumid_name ON Track(AlbumId, Name); |
| |
| -- Create an index on Album Title to speed up searches and sorts by album title |
| CREATE INDEX idx_album_title ON Album(Title); |
| |
| -- Create a composite index on InvoiceLine connecting TrackId and InvoiceId |
| -- This supports efficient queries that join InvoiceLine with Track while filtering by InvoiceId |
| CREATE INDEX idx_invoiceline_trackid_invoiceid ON InvoiceLine(TrackId, InvoiceId); |
| |
| -- Create a hash index for lookups by email |
| CREATE INDEX idx_customer_email ON Customer USING HASH (Email); |
| |
| -- Check index information |
| SELECT * FROM system.indexes; |
| ---- |
| |
| Indexes improve query performance, but come with maintenance costs. Each write operation must also update all indexes. Choose indexes that support your most common query patterns rather than indexing everything. |
| |
| == Creating a Dashboard Using SQL |
| |
| Let's create SQL queries that could be used for a music store dashboard. These queries could be saved and run periodically to generate reports. |
| |
| === Monthly Sales Summary |
| |
| [source, sql] |
| ---- |
| -- Monthly sales summary for the last 12 months |
| SELECT |
| CAST(EXTRACT(YEAR FROM i.InvoiceDate) AS VARCHAR) || '-' || |
| CASE |
| WHEN EXTRACT(MONTH FROM i.InvoiceDate) < 10 |
| THEN '0' || CAST(EXTRACT(MONTH FROM i.InvoiceDate) AS VARCHAR) |
| ELSE CAST(EXTRACT(MONTH FROM i.InvoiceDate) AS VARCHAR) |
| END AS YearMonth, |
| COUNT(DISTINCT i.InvoiceId) AS InvoiceCount, |
| COUNT(DISTINCT i.CustomerId) AS CustomerCount, |
| SUM(i.Total) AS MonthlyRevenue, |
| AVG(i.Total) AS AverageOrderValue |
| FROM |
| Invoice i |
| GROUP BY |
| EXTRACT(YEAR FROM i.InvoiceDate), EXTRACT(MONTH FROM i.InvoiceDate) |
| ORDER BY |
| YearMonth DESC; |
| ---- |
| |
| This query formats the year and month into a sortable string (YYYY-MM) while calculating several key business metrics. |
| |
| === Top Selling Genres |
| |
| [source, sql] |
| ---- |
| -- Top selling genres by revenue |
| SELECT |
| g.Name AS Genre, |
| SUM(il.UnitPrice * il.Quantity) AS Revenue |
| FROM |
| InvoiceLine il |
| JOIN Track t ON il.TrackId = t.TrackId |
| JOIN Genre g ON t.GenreId = g.GenreId |
| GROUP BY |
| g.Name |
| ORDER BY |
| Revenue DESC; |
| ---- |
| |
| === Sales Performance by Employee |
| |
| [source, sql] |
| ---- |
| -- Sales performance by employee |
| SELECT |
| e.EmployeeId, |
| e.FirstName || ' ' || e.LastName AS EmployeeName, |
| COUNT(DISTINCT i.InvoiceId) AS TotalInvoices, |
| COUNT(DISTINCT i.CustomerId) AS UniqueCustomers, |
| SUM(i.Total) AS TotalSales |
| FROM |
| Employee e |
| JOIN Customer c ON e.EmployeeId = c.SupportRepId |
| JOIN Invoice i ON c.CustomerId = i.CustomerId |
| GROUP BY |
| e.EmployeeId, e.FirstName, e.LastName |
| ORDER BY |
| TotalSales DESC; |
| ---- |
| |
| === Top 20 Longest Tracks with Genres |
| |
| [source, sql] |
| ---- |
| -- Top 20 longest tracks with genre information |
| SELECT |
| t.trackid, |
| t.name AS track_name, |
| g.name AS genre_name, |
| ROUND(t.milliseconds / (1000 * 60), 2) AS duration_minutes |
| FROM |
| track t |
| JOIN genre g ON t.genreId = g.genreId |
| WHERE |
| t.genreId < 17 |
| ORDER BY |
| duration_minutes DESC |
| LIMIT |
| 20; |
| ---- |
| |
| === Customer Purchase Patterns by Month |
| |
| [source, sql] |
| ---- |
| -- Customer purchase patterns by month |
| SELECT |
| c.CustomerId, |
| c.FirstName || ' ' || c.LastName AS CustomerName, |
| CAST(EXTRACT(YEAR FROM i.InvoiceDate) AS VARCHAR) || '-' || |
| CASE |
| WHEN EXTRACT(MONTH FROM i.InvoiceDate) < 10 |
| THEN '0' || CAST(EXTRACT(MONTH FROM i.InvoiceDate) AS VARCHAR) |
| ELSE CAST(EXTRACT(MONTH FROM i.InvoiceDate) AS VARCHAR) |
| END AS YearMonth, |
| COUNT(DISTINCT i.InvoiceId) AS NumberOfPurchases, |
| SUM(i.Total) AS TotalSpent, |
| SUM(i.Total) / COUNT(DISTINCT i.InvoiceId) AS AveragePurchaseValue |
| FROM |
| Customer c |
| JOIN Invoice i ON c.CustomerId = i.CustomerId |
| GROUP BY |
| c.CustomerId, c.FirstName, c.LastName, |
| EXTRACT(YEAR FROM i.InvoiceDate), EXTRACT(MONTH FROM i.InvoiceDate) |
| ORDER BY |
| c.CustomerId, YearMonth; |
| ---- |
| |
| == Performance Tuning with Colocated Tables |
| |
| One of the key advantages of Ignite is its ability to optimize joins through data colocation. Let's explore this with our existing colocated tables. |
| |
| === Colocated Queries |
| |
| Let's start by looking at a query where there is a mismatch in the colocation strategy. |
| |
| [source, sql] |
| ---- |
| --This is an example of poorly created table. |
| CREATE TABLE InvoiceLine ( |
| InvoiceLineId INT NOT NULL, |
| InvoiceId INT NOT NULL, |
| TrackId INT NOT NULL, |
| UnitPrice NUMERIC(10,2) NOT NULL, |
| Quantity INT NOT NULL, |
| PRIMARY KEY (InvoiceLineId, InvoiceId) |
| ) COLOCATE BY (InvoiceId) ZONE Chinook; |
| ---- |
| |
| If we create the `InvoiceLine` table to be colocated by InvoiceId, we end up with a mismatch for our query. |
| |
| * Album is colocated by ArtistId |
| * Track is colocated by AlbumId |
| * InvoiceLine is colocated by InvoiceId |
| |
| This means that when you run a query joining InvoiceLine, Track, and Album, the data might be spread across different nodes because they're colocated on different keys. Our query is looking for invoice ID 1, then joining with Track and Album, but these tables are colocated on different keys. |
| |
| [source, sql] |
| ---- |
| EXPLAIN PLAN FOR |
| SELECT |
| il.InvoiceId, |
| COUNT(il.InvoiceLineId) AS LineItemCount, |
| SUM(il.UnitPrice * il.Quantity) AS InvoiceTotal, |
| t.Name AS TrackName, |
| a.Title AS AlbumTitle |
| FROM |
| InvoiceLine il |
| JOIN Track t ON il.TrackId = t.TrackId |
| JOIN Album a ON t.AlbumId = a.AlbumId |
| WHERE |
| il.InvoiceId = 1 |
| GROUP BY |
| il.InvoiceId, t.Name, a.Title; |
| ---- |
| |
| [source, text] |
| ---- |
| ╔═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗ |
| ║ PLAN ║ |
| ╠═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣ |
| ║ Project(INVOICEID=[$0], LINEITEMCOUNT=[$3], INVOICETOTAL=[$4], TRACKNAME=[$1], ALBUMTITLE=[$2]): rowcount = 4484471.100479999, cumulative cost = IgniteCost [rowCount=2.3054813220479995E7, cpu=2.3643376967575923E7, memory=9.866772781055996E7, io=2.0, network=50190.0], id = 23843 ║ |
| ║ ColocatedHashAggregate(group=[{0, 1, 2}], LINEITEMCOUNT=[COUNT()], INVOICETOTAL=[SUM($3)]): rowcount = 4484471.100479999, cumulative cost = IgniteCost [rowCount=1.8570341119999997E7, cpu=1.9158904867095925E7, memory=9.866772681055996E7, io=1.0, network=50189.0], id = 23842 ║ |
| ║ Project(INVOICEID=[$3], TRACKNAME=[$1], ALBUMTITLE=[$8], $f4=[*($5, $6)]): rowcount = 9189489.959999999, cumulative cost = IgniteCost [rowCount=9380851.159999998, cpu=9969414.907095924, memory=9362.6, io=1.0, network=50189.0], id = 23841 ║ |
| ║ MergeJoin(condition=[=($2, $7)], joinType=[inner], leftCollation=[[2]], rightCollation=[[0]]): rowcount = 9189489.959999999, cumulative cost = IgniteCost [rowCount=191360.19999999998, cpu=779923.9470959246, memory=9361.6, io=0.0, network=50188.0], id = 23840 ║ |
| ║ HashJoin(condition=[=($4, $0)], joinType=[inner]): rowcount = 176551.19999999998, cumulative cost = IgniteCost [rowCount=13421.0, cpu=65201.0, memory=6585.6, io=0.0, network=47412.0], id = 23836 ║ |
| ║ Exchange(distribution=[single]): rowcount = 3503.0, cumulative cost = IgniteCost [rowCount=7006.0, cpu=17515.0, memory=0.0, io=0.0, network=42036.0], id = 23833 ║ |
| ║ IndexScan(table=[[PUBLIC, TRACK]], tableId=[26], index=[IDX_TRACK_ALBUMID_NAME], type=[SORTED], requiredColumns=[{0, 1, 2}], collation=[[2, 1]]): rowcount = 3503.0, cumulative cost = IgniteCost [rowCount=3503.0, cpu=14012.0, memory=0.0, io=0.0, network=0.0], id = 23832 ║ |
| ║ Exchange(distribution=[single]): rowcount = 336.0, cumulative cost = IgniteCost [rowCount=2576.0, cpu=9296.0, memory=0.0, io=0.0, network=5376.0], id = 23835 ║ |
| ║ TableScan(table=[[PUBLIC, INVOICELINE]], tableId=[34], filters=[=($t0, 1)], requiredColumns=[{1, 2, 3, 4}]): rowcount = 336.0, cumulative cost = IgniteCost [rowCount=2240.0, cpu=8960.0, memory=0.0, io=0.0, network=0.0], id = 23834 ║ |
| ║ Exchange(distribution=[single]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=1041.0, cpu=7130.147095924681, memory=2776.0, io=0.0, network=2776.0], id = 23839 ║ |
| ║ Sort(sort0=[$0], dir0=[ASC]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=694.0, cpu=6783.147095924681, memory=2776.0, io=0.0, network=0.0], id = 23838 ║ |
| ║ TableScan(table=[[PUBLIC, ALBUM]], tableId=[20], requiredColumns=[{0, 1}]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=347.0, cpu=347.0, memory=0.0, io=0.0, network=0.0], id = 23837 ║ |
| ╚═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝ |
| ---- |
| |
| ==== Key Observations in the Execution Plan |
| |
| *ColocatedHashAggregate Operation*: The plan uses a `ColocatedHashAggregate` operation, which indicates Ignite recognizes that portions of the aggregation can happen on colocated data before results are combined. This reduces network transfer during the `GROUP BY` operation. |
| |
| *Exchange Operations*: Several `Exchange(distribution=[single])` operations appear in the plan, indicating data movement between nodes is still necessary. These operations are applied to the Album table, Track table, and InvoiceLine filtered results. |
| |
| *Join Implementation*: The plan shows a combination of `HashJoin` and `MergeJoin` operations rather than nested loop joins. The optimizer has determined these join types are more efficient for the data volumes involved: |
| |
| * HashJoin for joining Track with Album |
| * MergeJoin for joining the above result with InvoiceLine |
| |
| *Efficient Data Access*: The query uses an `IndexScan` with the `IDX_INVOICELINE_INVOICE_TRACK` index rather than a full table scan on InvoiceLine. This provides: |
| |
| * Efficient filtering with `searchBounds: [ExactBounds [bound=1], null]` for InvoiceId = 1 |
| * Pre-sorted results with `collation: [INVOICEID ASC, TRACKID ASC]` |
| |
| *Row Count Estimation*: There appears to be a significant increase in estimated row counts after joins: |
| |
| * Initial InvoiceLine filtered rows: 746 |
| * After HashJoin with Album: 182,331 |
| * After MergeJoin with Track: 20,400,668 |
| |
| === Improved Cololocation Strategy |
| |
| However, if we create the `InvoiceLine` table to be colocated by `TrackId`, we dramaticly optimize our query. |
| |
| [source, sql] |
| ---- |
| --This table was already created on an earlier step. |
| CREATE TABLE InvoiceLine ( |
| InvoiceLineId INT NOT NULL, |
| InvoiceId INT NOT NULL, |
| TrackId INT NOT NULL, |
| UnitPrice NUMERIC(10,2) NOT NULL, |
| Quantity INT NOT NULL, |
| PRIMARY KEY (InvoiceLineId, TrackId) |
| ) COLOCATE BY (TrackId) ZONE Chinook; |
| ---- |
| |
| And run `EXPLAIN PLAN FOR` again... |
| |
| [source, sql] |
| ---- |
| EXPLAIN PLAN FOR |
| SELECT |
| il.InvoiceId, |
| COUNT(il.InvoiceLineId) AS LineItemCount, |
| SUM(il.UnitPrice * il.Quantity) AS InvoiceTotal, |
| t.Name AS TrackName, |
| a.Title AS AlbumTitle |
| FROM |
| Track t |
| JOIN Album a ON t.AlbumId = a.AlbumId |
| JOIN InvoiceLine il ON t.TrackId = il.TrackId |
| WHERE |
| il.InvoiceId = 1 |
| GROUP BY |
| il.InvoiceId, t.Name, a.Title; |
| ---- |
| |
| [source, text] |
| ---- |
| ╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗ |
| ║ PLAN ║ |
| ╠════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣ |
| ║ Project(INVOICEID=[$0], LINEITEMCOUNT=[$3], INVOICETOTAL=[$4], TRACKNAME=[$1], ALBUMTITLE=[$2]): rowcount = 2.0019960269999995E9, cumulative cost = IgniteCost [rowCount=1.020839200715E10, cpu=1.0214411135647097E10, memory=4.404685537199999E10, io=2.0, network=2444814.0], id = 25112 ║ |
| ║ ColocatedHashAggregate(group=[{0, 1, 2}], LINEITEMCOUNT=[COUNT()], INVOICETOTAL=[SUM($3)]): rowcount = 2.0019960269999995E9, cumulative cost = IgniteCost [rowCount=8.20639597915E9, cpu=8.212415107647097E9, memory=4.404685537099999E10, io=1.0, network=2444813.0], id = 25111 ║ |
| ║ Project(INVOICEID=[$5], TRACKNAME=[$1], ALBUMTITLE=[$4], $f4=[*($7, $8)]): rowcount = 4.102450875E9, cumulative cost = IgniteCost [rowCount=4.10394510415E9, cpu=4.109964232647096E9, memory=2942777.0, io=1.0, network=2444813.0], id = 25110 ║ |
| ║ HashJoin(condition=[=($0, $6)], joinType=[inner]): rowcount = 4.102450875E9, cumulative cost = IgniteCost [rowCount=1494228.15, cpu=7513356.647095924, memory=2942776.0, io=0.0, network=2444812.0], id = 25109 ║ |
| ║ MergeJoin(condition=[=($2, $3)], joinType=[inner], leftCollation=[[2, 1]], rightCollation=[[0]]): rowcount = 182331.15, cumulative cost = IgniteCost [rowCount=11897.0, cpu=40045.14709592468, memory=2776.0, io=0.0, network=44812.0], id = 25106 ║ |
| ║ Exchange(distribution=[single]): rowcount = 3503.0, cumulative cost = IgniteCost [rowCount=7006.0, cpu=17515.0, memory=0.0, io=0.0, network=42036.0], id = 25102 ║ |
| ║ IndexScan(table=[[PUBLIC, TRACK]], tableId=[26], index=[IDX_TRACK_ALBUMID_NAME], type=[SORTED], requiredColumns=[{0, 1, 2}], collation=[[2, 1]]): rowcount = 3503.0, cumulative cost = IgniteCost [rowCount=3503.0, cpu=14012.0, memory=0.0, io=0.0, network=0.0], id = 25101 ║ |
| ║ Exchange(distribution=[single]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=1041.0, cpu=7130.147095924681, memory=2776.0, io=0.0, network=2776.0], id = 25105 ║ |
| ║ Sort(sort0=[$0], dir0=[ASC]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=694.0, cpu=6783.147095924681, memory=2776.0, io=0.0, network=0.0], id = 25104 ║ |
| ║ TableScan(table=[[PUBLIC, ALBUM]], tableId=[20], requiredColumns=[{0, 1}]): rowcount = 347.0, cumulative cost = IgniteCost [rowCount=347.0, cpu=347.0, memory=0.0, io=0.0, network=0.0], id = 25103 ║ |
| ║ Exchange(distribution=[single]): rowcount = 150000.0, cumulative cost = IgniteCost [rowCount=1150000.0, cpu=4150000.0, memory=0.0, io=0.0, network=2400000.0], id = 25108 ║ |
| ║ TableScan(table=[[PUBLIC, INVOICELINE]], tableId=[46], filters=[=($t0, 1)], requiredColumns=[{1, 2, 3, 4}]): rowcount = 150000.0, cumulative cost = IgniteCost [rowCount=1000000.0, cpu=4000000.0, memory=0.0, io=0.0, network=0.0], id = 25107 ║ |
| ╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝ |
| ---- |
| |
| ==== Key Observations in the Execution Plan |
| |
| *ColocatedHashAggregate Operation*: The plan uses a `ColocatedHashAggregate` operation, which indicates Ignite recognizes that portions of the aggregation can happen on colocated data before results are combined. This reduces network transfer during the `GROUP BY` operation. |
| |
| *Improved Row Count Estimates*: Notice the dramatic improvement in row count estimates, which now show just 1 row at each step. This indicates the optimizer has much better statistics and understanding of the actual data distribution compared to the original plan that estimated millions of rows. |
| |
| *Join Implementation*: The plan shows a combination of `HashJoin` and `MergeJoin` operations: |
| |
| * HashJoin for joining Track with InvoiceLine |
| * MergeJoin for joining the above result with Album |
| |
| *Efficient Index Usage*: The query now uses the composite index `IDX_TRACK_ALBUMID_NAME` on the Track table, providing: |
| |
| * Efficient sorted access by AlbumId and Name |
| * Direct access to the fields needed for the join and select operations |
| |
| *Exchange Operations*: While Exchange operations still appear in the plan, the estimated row counts are now minimal (just 1 row per exchange). This suggests much less data movement between nodes compared to the original plan where millions of rows were estimated to be transferred. |
| |
| ==== Colocation Impact |
| |
| The substantial improvement in this execution plan demonstrates the power of proper data colocation in Ignite. By: |
| |
| 1. Structuring the query to join the tables in the optimal order (Track → Album → InvoiceLine) |
| 2. Creating appropriate supporting indexes |
| 3. Ensuring proper colocation between related tables |
| |
| We've achieved a dramatic reduction in estimated row counts and data movement. The execution plan now shows streamlined operations with minimal row estimates at each step, indicating an efficient execution path that takes advantage of data locality. |
| |
| This optimization approach highlights three key principles for optimal performance in distributed SQL databases: |
| |
| * Proper colocation of related data |
| * Supporting indexes aligned with join patterns |
| * Query structure that follows the colocation model |
| |
| == Cleaning Up |
| |
| When you are finished with the Ignite SQL CLI, you can exit by typing: |
| |
| [source, sql] |
| ---- |
| exit; |
| ---- |
| |
| This will return you to the Ignite CLI. To exit the Ignite CLI, type: |
| |
| [source, bash] |
| ---- |
| exit |
| ---- |
| |
| To stop the Ignite cluster, run the following command in your terminal: |
| |
| [source, bash] |
| ---- |
| docker compose down |
| ---- |
| |
| This will stop and remove the Docker containers for your Ignite cluster. |
| |
| == Best Practices for Ignite SQL |
| |
| To get the most out of Ignite SQL, follow these best practices: |
| |
| === Schema Design |
| |
| * Use appropriate colocation for tables that are frequently joined; |
| * Choose primary keys that distribute data evenly across the cluster; |
| * Design with query patterns in mind, especially for large-scale deployments. |
| |
| === Query Optimization |
| |
| * Create indexes for columns used in `WHERE`, `JOIN`, and `ORDER BY` clauses; |
| * Use the `EXPLAIN` statement to analyze and optimize your queries; |
| * Avoid cartesian products and inefficient join conditions. |
| |
| === Transaction Management |
| |
| * Keep transactions as short as possible; |
| * Do not hold transactions open during user think time; |
| * Group related operations into a single transaction for atomicity. |
| |
| === Resource Management |
| |
| * Monitor query performance in production; |
| * Consider partitioning strategies for very large tables; |
| * Use appropriate data types to minimize storage requirements. |
| |
| == What's Next |
| |
| Ignite's SQL capabilities make it a powerful platform for building distributed applications that require high throughput, low latency, and strong consistency. By following the patterns and practices in this guide, you can leverage Ignite SQL to build scalable, resilient systems. |
| |
| Remember that Ignite is not just a SQL database—it's a comprehensive distributed computing platform with capabilities beyond what we've covered here. As you become more comfortable with Ignite SQL, you may want to explore other features such as compute grid, machine learning, and stream processing. |
| |
| Happy querying! |