| = Bulk Loading |
| |
| Bulk loading Apache Cassandra data is supported by different tools. |
| The data to bulk load must be in the form of SSTables. |
| Cassandra does not support loading data in any other format such as CSV, |
| JSON, and XML directly. |
| Although the cqlsh `COPY` command can load CSV data, it is not a good option |
| for amounts of data. |
| Bulk loading is used to: |
| |
| * Restore incremental backups and snapshots. Backups and snapshots are |
| already in the form of SSTables. |
| * Load existing SSTables into another cluster. The data can have a |
| different number of nodes or replication strategy. |
| * Load external data to a cluster. |
| |
| == Tools for Bulk Loading |
| |
| Cassandra provides two commands or tools for bulk loading data: |
| |
| * Cassandra Bulk loader, also called `sstableloader` |
| * The `nodetool import` command |
| |
| The `sstableloader` and `nodetool import` are accessible if the |
| Cassandra installation `bin` directory is in the `PATH` environment |
| variable. |
| Or these may be accessed directly from the `bin` directory. |
| The examples use the keyspaces and tables created in xref:cql/operating/backups.adoc[Backups]. |
| |
| == Using sstableloader |
| |
| The `sstableloader` is the main tool for bulk uploading data. |
| `sstableloader` streams SSTable data files to a running cluster, |
| conforming to the replication strategy and replication factor. |
| The table to upload data to does need not to be empty. |
| |
| The only requirements to run `sstableloader` are: |
| |
| * One or more comma separated initial hosts to connect to and get ring |
| information |
| * A directory path for the SSTables to load |
| |
| [source,bash] |
| ---- |
| sstableloader [options] <dir_path> |
| ---- |
| |
| Sstableloader bulk loads the SSTables found in the directory |
| `<dir_path>` to the configured cluster. |
| The `<dir_path>` is used as the target _keyspace/table_ name. |
| For example, to load an SSTable named `Standard1-g-1-Data.db` into `Keyspace1/Standard1`, |
| you will need to have the files `Standard1-g-1-Data.db` and `Standard1-g-1-Index.db` in a |
| directory `/path/to/Keyspace1/Standard1/`. |
| |
| === Sstableloader Option to accept Target keyspace name |
| |
| Often as part of a backup strategy, some Cassandra DBAs store an entire data directory. |
| When corruption in the data is found, restoring data in the same cluster (for large clusters 200 nodes) |
| is common, but with a different keyspace name. |
| |
| Currently `sstableloader` derives keyspace name from the folder structure. |
| As an option, to specify target keyspace name as part of `sstableloader`, |
| version 4.0 adds support for the `--target-keyspace` option |
| (https://issues.apache.org/jira/browse/CASSANDRA-13884[CASSANDRA-13884]). |
| |
| The following options are supported, with `-d,--nodes <initial hosts>` required: |
| |
| [source,none] |
| ---- |
| -alg,--ssl-alg <ALGORITHM> Client SSL: algorithm |
| |
| -ap,--auth-provider <auth provider> Custom |
| AuthProvider class name for |
| cassandra authentication |
| -ciphers,--ssl-ciphers <CIPHER-SUITES> Client SSL: |
| comma-separated list of |
| encryption suites to use |
| -cph,--connections-per-host <connectionsPerHost> Number of |
| concurrent connections-per-host. |
| -d,--nodes <initial hosts> Required. |
| Try to connect to these hosts (comma separated) initially for ring information |
| |
| -f,--conf-path <path to config file> cassandra.yaml file path for streaming throughput and client/server SSL. |
| |
| -h,--help Display this help message |
| |
| -i,--ignore <NODES> Don't stream to this (comma separated) list of nodes |
| |
| -idct,--inter-dc-throttle <inter-dc-throttle> Inter-datacenter throttle speed in Mbits (default unlimited) |
| |
| -k,--target-keyspace <target keyspace name> Target |
| keyspace name |
| -ks,--keystore <KEYSTORE> Client SSL: |
| full path to keystore |
| -kspw,--keystore-password <KEYSTORE-PASSWORD> Client SSL: |
| password of the keystore |
| --no-progress Don't |
| display progress |
| -p,--port <native transport port> Port used |
| for native connection (default 9042) |
| -prtcl,--ssl-protocol <PROTOCOL> Client SSL: |
| connections protocol to use (default: TLS) |
| -pw,--password <password> Password for |
| cassandra authentication |
| -sp,--storage-port <storage port> Port used |
| for internode communication (default 7000) |
| -spd,--server-port-discovery <allow server port discovery> Use ports |
| published by server to decide how to connect. With SSL requires StartTLS |
| to be used. |
| -ssp,--ssl-storage-port <ssl storage port> Port used |
| for TLS internode communication (default 7001) |
| -st,--store-type <STORE-TYPE> Client SSL: |
| type of store |
| -t,--throttle <throttle> Throttle |
| speed in Mbits (default unlimited) |
| -ts,--truststore <TRUSTSTORE> Client SSL: |
| full path to truststore |
| -tspw,--truststore-password <TRUSTSTORE-PASSWORD> Client SSL: |
| Password of the truststore |
| -u,--username <username> Username for |
| cassandra authentication |
| -v,--verbose verbose |
| output |
| ---- |
| |
| The `cassandra.yaml` file can be provided on the command-line with `-f` option to set up streaming throughput, client and server encryption |
| options. |
| Only `stream_throughput_outbound_megabits_per_sec`, `server_encryption_options` and `client_encryption_options` are read |
| from the `cassandra.yaml` file. |
| You can override options read from `cassandra.yaml` with corresponding command line options. |
| |
| === A sstableloader Demo |
| |
| An example shows how to use `sstableloader` to upload incremental backup data for the table `catalogkeyspace.magazine`. |
| In addition, a snapshot of the same table is created to bulk upload, also with `sstableloader`. |
| |
| The backups and snapshots for the `catalogkeyspace.magazine` table are listed as follows: |
| |
| [source,bash] |
| ---- |
| $ cd ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c && ls -l |
| ---- |
| |
| results in |
| |
| [source,none] |
| ---- |
| total 0 |
| drwxrwxr-x. 2 ec2-user ec2-user 226 Aug 19 02:38 backups |
| drwxrwxr-x. 4 ec2-user ec2-user 40 Aug 19 02:45 snapshots |
| ---- |
| |
| The directory path structure of SSTables to be uploaded using |
| `sstableloader` is used as the target keyspace/table. |
| You can directly upload from the `backups` and `snapshots` |
| directories respectively, if the directory structure is in the format |
| used by `sstableloader`. |
| But the directory path of backups and snapshots for SSTables is |
| `/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/backups` and |
| `/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/snapshots` |
| respectively, and cannot be used to upload SSTables to |
| `catalogkeyspace.magazine` table. |
| The directory path structure must be `/catalogkeyspace/magazine/` to use `sstableloader`. |
| Create a new directory structure to upload SSTables with `sstableloader` |
| located at `/catalogkeyspace/magazine` and set appropriate permissions. |
| |
| [source,bash] |
| ---- |
| $ sudo mkdir -p /catalogkeyspace/magazine |
| $ sudo chmod -R 777 /catalogkeyspace/magazine |
| ---- |
| |
| ==== Bulk Loading from an Incremental Backup |
| |
| An incremental backup does not include the DDL for a table; the table must already exist. |
| If the table was dropped, it can be created using the `schema.cql` file generated with every snapshot of a table. |
| Prior to using `sstableloader` to load SSTables to the `magazine` table, the table must exist. |
| The table does not need to be empty but we have used an empty table as indicated by a CQL query: |
| |
| [source,cql] |
| ---- |
| SELECT * FROM magazine; |
| ---- |
| results in |
| [source,cql] |
| ---- |
| id | name | publisher |
| ----+------+----------- |
| |
| (0 rows) |
| ---- |
| |
| After creating the table to upload to, copy the SSTable files from the `backups` directory to the `/catalogkeyspace/magazine/` directory. |
| |
| [source,bash] |
| ---- |
| $ sudo cp ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/backups/* \ |
| /catalogkeyspace/magazine/ |
| ---- |
| |
| Run the `sstableloader` to upload SSTables from the |
| `/catalogkeyspace/magazine/` directory. |
| |
| [source,bash] |
| ---- |
| $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/ |
| ---- |
| |
| The output from the `sstableloader` command should be similar to this listing: |
| |
| [source,bash] |
| ---- |
| $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/ |
| ---- |
| |
| results in |
| |
| [source,none] |
| ---- |
| Opening SSTables and calculating sections to stream |
| Streaming relevant part of /catalogkeyspace/magazine/na-1-big-Data.db |
| /catalogkeyspace/magazine/na-2-big-Data.db to [35.173.233.153:7000, 10.0.2.238:7000, |
| 54.158.45.75:7000] |
| progress: [35.173.233.153:7000]0:1/2 88 % total: 88% 0.018KiB/s (avg: 0.018KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% total: 176% 33.807KiB/s (avg: 0.036KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% total: 176% 0.000KiB/s (avg: 0.029KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:1/2 39 % total: 81% 0.115KiB/s |
| (avg: 0.024KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % total: 108% |
| 97.683KiB/s (avg: 0.033KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % |
| [54.158.45.75:7000]0:1/2 39 % total: 80% 0.233KiB/s (avg: 0.040KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % |
| [54.158.45.75:7000]0:2/2 78 % total: 96% 88.522KiB/s (avg: 0.049KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % |
| [54.158.45.75:7000]0:2/2 78 % total: 96% 0.000KiB/s (avg: 0.045KiB/s) |
| progress: [35.173.233.153:7000]0:2/2 176% [10.0.2.238:7000]0:2/2 78 % |
| [54.158.45.75:7000]0:2/2 78 % total: 96% 0.000KiB/s (avg: 0.044KiB/s) |
| ---- |
| |
| After the `sstableloader` has finished loading the data, run a query the `magazine` table to check: |
| |
| [source,cql] |
| ---- |
| SELECT * FROM magazine; |
| ---- |
| results in |
| [source,cql] |
| ---- |
| id | name | publisher |
| ----+---------------------------+------------------ |
| 1 | Couchbase Magazine | Couchbase |
| 0 | Apache Cassandra Magazine | Apache Cassandra |
| |
| (2 rows) |
| ---- |
| |
| ==== Bulk Loading from a Snapshot |
| |
| Restoring a snapshot of a table to the same table can be easily accomplished: |
| |
| If the directory structure needed to load SSTables to `catalogkeyspace.magazine` does not exist create the |
| directories and set appropriate permissions: |
| |
| [source,bash] |
| ---- |
| $ sudo mkdir -p /catalogkeyspace/magazine |
| $ sudo chmod -R 777 /catalogkeyspace/magazine |
| ---- |
| |
| Remove any files from the directory, so that the snapshot files can be copied without interference: |
| |
| [source,bash] |
| ---- |
| $ sudo rm /catalogkeyspace/magazine/* |
| $ cd /catalogkeyspace/magazine/ |
| $ ls -l |
| ---- |
| |
| results in |
| |
| [source,none] |
| ---- |
| total 0 |
| ---- |
| |
| Copy the snapshot files to the `/catalogkeyspace/magazine` directory. |
| |
| [source,bash] |
| ---- |
| $ sudo cp ./cassandra/data/data/catalogkeyspace/magazine-446eae30c22a11e9b1350d927649052c/snapshots/magazine/* \ |
| /catalogkeyspace/magazine |
| ---- |
| |
| List the files in the `/catalogkeyspace/magazine` directory. |
| The `schema.cql` will also be listed. |
| |
| [source,bash] |
| ---- |
| $ cd /catalogkeyspace/magazine && ls -l |
| ---- |
| |
| results in |
| |
| [source,none] |
| ---- |
| total 44 |
| -rw-r--r--. 1 root root 31 Aug 19 04:13 manifest.json |
| -rw-r--r--. 1 root root 47 Aug 19 04:13 na-1-big-CompressionInfo.db |
| -rw-r--r--. 1 root root 97 Aug 19 04:13 na-1-big-Data.db |
| -rw-r--r--. 1 root root 10 Aug 19 04:13 na-1-big-Digest.crc32 |
| -rw-r--r--. 1 root root 16 Aug 19 04:13 na-1-big-Filter.db |
| -rw-r--r--. 1 root root 16 Aug 19 04:13 na-1-big-Index.db |
| -rw-r--r--. 1 root root 4687 Aug 19 04:13 na-1-big-Statistics.db |
| -rw-r--r--. 1 root root 56 Aug 19 04:13 na-1-big-Summary.db |
| -rw-r--r--. 1 root root 92 Aug 19 04:13 na-1-big-TOC.txt |
| -rw-r--r--. 1 root root 815 Aug 19 04:13 schema.cql |
| ---- |
| |
| Alternatively create symlinks to the snapshot folder instead of copying |
| the data: |
| |
| [source,bash] |
| ---- |
| $ mkdir <keyspace_name> |
| $ ln -s <path_to_snapshot_folder> <keyspace_name>/<table_name> |
| ---- |
| |
| If the `magazine` table was dropped, run the DDL in the `schema.cql` to |
| create the table. |
| Run the `sstableloader` with the following command: |
| |
| [source,bash] |
| ---- |
| $ sstableloader --nodes 10.0.2.238 /catalogkeyspace/magazine/ |
| ---- |
| |
| As the output from the command indicates, SSTables get streamed to the |
| cluster: |
| |
| [source,none] |
| ---- |
| Established connection to initial hosts |
| Opening SSTables and calculating sections to stream |
| Streaming relevant part of /catalogkeyspace/magazine/na-1-big-Data.db to |
| [35.173.233.153:7000, 10.0.2.238:7000, 54.158.45.75:7000] |
| progress: [35.173.233.153:7000]0:1/1 176% total: 176% 0.017KiB/s (avg: 0.017KiB/s) |
| progress: [35.173.233.153:7000]0:1/1 176% total: 176% 0.000KiB/s (avg: 0.014KiB/s) |
| progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 % total: 108% 0.115KiB/s |
| (avg: 0.017KiB/s) |
| progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 % |
| [54.158.45.75:7000]0:1/1 78 % total: 96% 0.232KiB/s (avg: 0.024KiB/s) |
| progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 % |
| [54.158.45.75:7000]0:1/1 78 % total: 96% 0.000KiB/s (avg: 0.022KiB/s) |
| progress: [35.173.233.153:7000]0:1/1 176% [10.0.2.238:7000]0:1/1 78 % |
| [54.158.45.75:7000]0:1/1 78 % total: 96% 0.000KiB/s (avg: 0.021KiB/s) |
| ---- |
| |
| Some other requirements of `sstableloader` that should be kept into |
| consideration are: |
| |
| * The SSTables loaded must be compatible with the Cassandra |
| version being loaded into. |
| * Repairing tables that have been loaded into a different cluster does |
| not repair the source tables. |
| * Sstableloader makes use of port 7000 for internode communication. |
| * Before restoring incremental backups, run `nodetool flush` to backup |
| any data in memtables. |
| |
| == Using nodetool import |
| |
| Importing SSTables into a table using the `nodetool import` command is recommended instead of the deprecated |
| `nodetool refresh` command. |
| The `nodetool import` command has an option to load new SSTables from a separate directory. |
| |
| The command usage is as follows: |
| |
| [source,none] |
| ---- |
| nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)] |
| [(-pp | --print-port)] [(-pw <password> | --password <password>)] |
| [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)] |
| [(-u <username> | --username <username>)] import |
| [(-c | --no-invalidate-caches)] [(-e | --extended-verify)] |
| [(-l | --keep-level)] [(-q | --quick)] [(-r | --keep-repaired)] |
| [(-t | --no-tokens)] [(-v | --no-verify)] [--] <keyspace> <table> |
| <directory> ... |
| ---- |
| |
| The arguments `keyspace`, `table` name and `directory` are required. |
| |
| The following options are supported: |
| |
| [source,none] |
| ---- |
| -c, --no-invalidate-caches |
| Don't invalidate the row cache when importing |
| |
| -e, --extended-verify |
| Run an extended verify, verifying all values in the new SSTables |
| |
| -h <host>, --host <host> |
| Node hostname or ip address |
| |
| -l, --keep-level |
| Keep the level on the new SSTables |
| |
| -p <port>, --port <port> |
| Remote jmx agent port number |
| |
| -pp, --print-port |
| Operate in 4.0 mode with hosts disambiguated by port number |
| |
| -pw <password>, --password <password> |
| Remote jmx agent password |
| |
| -pwf <passwordFilePath>, --password-file <passwordFilePath> |
| Path to the JMX password file |
| |
| -q, --quick |
| Do a quick import without verifying SSTables, clearing row cache or |
| checking in which data directory to put the file |
| |
| -r, --keep-repaired |
| Keep any repaired information from the SSTables |
| |
| -t, --no-tokens |
| Don't verify that all tokens in the new SSTable are owned by the |
| current node |
| |
| -u <username>, --username <username> |
| Remote jmx agent username |
| |
| -v, --no-verify |
| Don't verify new SSTables |
| |
| -- |
| This option can be used to separate command-line options from the |
| list of argument, (useful when arguments might be mistaken for |
| command-line options |
| ---- |
| |
| Because the keyspace and table are specified on the command line for |
| `nodetool import`, there is not the same requirement as with |
| `sstableloader`, to have the SSTables in a specific directory path. |
| When importing snapshots or incremental backups with |
| `nodetool import`, the SSTables don’t need to be copied to another |
| directory. |
| |
| === Importing Data from an Incremental Backup |
| |
| Using `nodetool import` to import SSTables from an incremental backup, and restoring |
| the table is shown below. |
| |
| [source,cql] |
| ---- |
| DROP table t; |
| ---- |
| |
| An incremental backup for a table does not include the schema definition for the table. |
| If the schema definition is not kept as a separate |
| backup, the `schema.cql` from a backup of the table may be used to |
| create the table as follows: |
| |
| [source,cql] |
| ---- |
| CREATE TABLE IF NOT EXISTS cqlkeyspace.t ( |
| id int PRIMARY KEY, |
| k int, |
| v text) |
| WITH ID = d132e240-c217-11e9-bbee-19821dcea330 |
| AND bloom_filter_fp_chance = 0.01 |
| AND crc_check_chance = 1.0 |
| AND default_time_to_live = 0 |
| AND gc_grace_seconds = 864000 |
| AND min_index_interval = 128 |
| AND max_index_interval = 2048 |
| AND memtable_flush_period_in_ms = 0 |
| AND speculative_retry = '99p' |
| AND additional_write_policy = '99p' |
| AND comment = '' |
| AND caching = { 'keys': 'ALL', 'rows_per_partition': 'NONE' } |
| AND compaction = { 'max_threshold': '32', 'min_threshold': '4', |
| 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' } |
| AND compression = { 'chunk_length_in_kb': '16', 'class': |
| 'org.apache.cassandra.io.compress.LZ4Compressor' } |
| AND cdc = false |
| AND extensions = { } |
| ; |
| ---- |
| |
| Initially the table could be empty, but does not have to be. |
| |
| [source,cql] |
| ---- |
| SELECT * FROM t; |
| ---- |
| [source,cql] |
| ---- |
| id | k | v |
| ----+---+--- |
| |
| (0 rows) |
| ---- |
| |
| Run the `nodetool import` command, providing the keyspace, table and |
| the backups directory. |
| Don’t copy the table backups to another directory, as with `sstableloader`. |
| |
| [source,bash] |
| ---- |
| $ nodetool import -- cqlkeyspace t \ |
| ./cassandra/data/data/cqlkeyspace/t-d132e240c21711e9bbee19821dcea330/backups |
| ---- |
| |
| The SSTables are imported into the table. Run a query in cqlsh to check: |
| |
| [source,cql] |
| ---- |
| SELECT * FROM t; |
| ---- |
| [source,cql] |
| ---- |
| id | k | v |
| ----+---+------ |
| 1 | 1 | val1 |
| 0 | 0 | val0 |
| |
| (2 rows) |
| ---- |
| |
| === Importing Data from a Snapshot |
| |
| Importing SSTables from a snapshot with the `nodetool import` command is |
| similar to importing SSTables from an incremental backup. |
| Shown here is an import of a snapshot for table `catalogkeyspace.journal`, after |
| dropping the table to demonstrate the restore. |
| |
| [source,cql] |
| ---- |
| USE CATALOGKEYSPACE; |
| DROP TABLE journal; |
| ---- |
| |
| Use the `catalog-ks` snapshot for the `journal` table. |
| Check the files in the snapshot, and note the existence of the `schema.cql` file. |
| |
| [source,bash] |
| ---- |
| $ ls -l |
| ---- |
| [source,none] |
| ---- |
| total 44 |
| -rw-rw-r--. 1 ec2-user ec2-user 31 Aug 19 02:44 manifest.json |
| -rw-rw-r--. 3 ec2-user ec2-user 47 Aug 19 02:38 na-1-big-CompressionInfo.db |
| -rw-rw-r--. 3 ec2-user ec2-user 97 Aug 19 02:38 na-1-big-Data.db |
| -rw-rw-r--. 3 ec2-user ec2-user 10 Aug 19 02:38 na-1-big-Digest.crc32 |
| -rw-rw-r--. 3 ec2-user ec2-user 16 Aug 19 02:38 na-1-big-Filter.db |
| -rw-rw-r--. 3 ec2-user ec2-user 16 Aug 19 02:38 na-1-big-Index.db |
| -rw-rw-r--. 3 ec2-user ec2-user 4687 Aug 19 02:38 na-1-big-Statistics.db |
| -rw-rw-r--. 3 ec2-user ec2-user 56 Aug 19 02:38 na-1-big-Summary.db |
| -rw-rw-r--. 3 ec2-user ec2-user 92 Aug 19 02:38 na-1-big-TOC.txt |
| -rw-rw-r--. 1 ec2-user ec2-user 814 Aug 19 02:44 schema.cql |
| ---- |
| |
| Copy the DDL from the `schema.cql` and run in cqlsh to create the |
| `catalogkeyspace.journal` table: |
| |
| [source,cql] |
| ---- |
| CREATE TABLE IF NOT EXISTS catalogkeyspace.journal ( |
| id int PRIMARY KEY, |
| name text, |
| publisher text) |
| WITH ID = 296a2d30-c22a-11e9-b135-0d927649052c |
| AND bloom_filter_fp_chance = 0.01 |
| AND crc_check_chance = 1.0 |
| AND default_time_to_live = 0 |
| AND gc_grace_seconds = 864000 |
| AND min_index_interval = 128 |
| AND max_index_interval = 2048 |
| AND memtable_flush_period_in_ms = 0 |
| AND speculative_retry = '99p' |
| AND additional_write_policy = '99p' |
| AND comment = '' |
| AND caching = { 'keys': 'ALL', 'rows_per_partition': 'NONE' } |
| AND compaction = { 'min_threshold': '4', 'max_threshold': |
| '32', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' } |
| AND compression = { 'chunk_length_in_kb': '16', 'class': |
| 'org.apache.cassandra.io.compress.LZ4Compressor' } |
| AND cdc = false |
| AND extensions = { } |
| ; |
| ---- |
| |
| Run the `nodetool import` command to import the SSTables for the |
| snapshot: |
| |
| [source,bash] |
| ---- |
| $ nodetool import -- catalogkeyspace journal \ |
| ./cassandra/data/data/catalogkeyspace/journal- |
| 296a2d30c22a11e9b1350d927649052c/snapshots/catalog-ks/ |
| ---- |
| |
| Subsequently run a CQL query on the `journal` table to check the imported data: |
| |
| [source,cql] |
| ---- |
| SELECT * FROM journal; |
| ---- |
| [source,cql] |
| ---- |
| id | name | publisher |
| ----+---------------------------+------------------ |
| 1 | Couchbase Magazine | Couchbase |
| 0 | Apache Cassandra Magazine | Apache Cassandra |
| |
| (2 rows) |
| ---- |
| |
| == Bulk Loading External Data |
| |
| Bulk loading external data directly is not supported by any of the tools |
| we have discussed which include `sstableloader` and `nodetool import`. |
| The `sstableloader` and `nodetool import` require data to be in the form |
| of SSTables. |
| Apache Cassandra supports a Java API for generating SSTables from input data, using the |
| `org.apache.cassandra.io.sstable.CQLSSTableWriter` Java class. |
| Subsequently, either `sstableloader` or `nodetool import` is used to bulk load the SSTables. |
| |
| === Generating SSTables with CQLSSTableWriter Java API |
| |
| To generate SSTables using the `CQLSSTableWriter` class the following are required: |
| |
| * An output directory to generate the SSTable in |
| * The schema for the SSTable |
| * A prepared statement for the `INSERT` |
| * A partitioner |
| |
| The output directory must exist before starting. Create a directory |
| (`/sstables` as an example) and set appropriate permissions. |
| |
| [source,bash] |
| ---- |
| $ sudo mkdir /sstables |
| $ sudo chmod 777 -R /sstables |
| ---- |
| |
| To use `CQLSSTableWriter` in a Java application, create a Java constant for the output directory. |
| |
| [source,java] |
| ---- |
| public static final String OUTPUT_DIR = "./sstables"; |
| ---- |
| |
| `CQLSSTableWriter` Java API can create a user-defined type. Create a new type to store `int` data: |
| |
| [source,java] |
| ---- |
| String type = "CREATE TYPE CQLKeyspace.intType (a int, b int)"; |
| // Define a String variable for the SSTable schema. |
| String schema = "CREATE TABLE CQLKeyspace.t (" |
| + " id int PRIMARY KEY," |
| + " k int," |
| + " v1 text," |
| + " v2 intType," |
| + ")"; |
| ---- |
| |
| Define a `String` variable for the prepared statement to use: |
| |
| [source,java] |
| ---- |
| String insertStmt = "INSERT INTO CQLKeyspace.t (id, k, v1, v2) VALUES (?, ?, ?, ?)"; |
| ---- |
| |
| The partitioner to use only needs setting if the default partitioner `Murmur3Partitioner` is not used. |
| |
| All these variables or settings are used by the builder class |
| `CQLSSTableWriter.Builder` to create a `CQLSSTableWriter` object. |
| |
| Create a File object for the output directory. |
| |
| [source,java] |
| ---- |
| File outputDir = new File(OUTPUT_DIR + File.separator + "CQLKeyspace" + File.separator + "t"); |
| ---- |
| |
| Obtain a `CQLSSTableWriter.Builder` object using `static` method `CQLSSTableWriter.builder()`. |
| Set the following items: |
| |
| * output directory `File` object |
| * user-defined type |
| * SSTable schema |
| * buffer size |
| * prepared statement |
| * optionally any of the other builder options |
| |
| and invoke the `build()` method to create a `CQLSSTableWriter` object: |
| |
| [source,java] |
| ---- |
| CQLSSTableWriter writer = CQLSSTableWriter.builder() |
| .inDirectory(outputDir) |
| .withType(type) |
| .forTable(schema) |
| .withBufferSizeInMB(256) |
| .using(insertStmt).build(); |
| ---- |
| |
| Set the SSTable data. If any user-defined types are used, obtain a |
| `UserType` object for each type: |
| |
| [source,java] |
| ---- |
| UserType userType = writer.getUDType("intType"); |
| ---- |
| |
| Add data rows for the resulting SSTable: |
| |
| [source,java] |
| ---- |
| writer.addRow(0, 0, "val0", userType.newValue().setInt("a", 0).setInt("b", 0)); |
| writer.addRow(1, 1, "val1", userType.newValue().setInt("a", 1).setInt("b", 1)); |
| writer.addRow(2, 2, "val2", userType.newValue().setInt("a", 2).setInt("b", 2)); |
| ---- |
| |
| Close the writer, finalizing the SSTable: |
| |
| [source,java] |
| ---- |
| writer.close(); |
| ---- |
| |
| Other public methods the `CQLSSTableWriter` class provides are: |
| |
| [cols=",",options="header",] |
| |=== |
| |Method |Description |
| |
| |addRow(java.util.List<java.lang.Object> values) |Adds a new row to the |
| writer. Returns a CQLSSTableWriter object. Each provided value type |
| should correspond to the types of the CQL column the value is for. The |
| correspondence between java type and CQL type is the same one than the |
| one documented at |
| www.datastax.com/drivers/java/2.0/apidocs/com/datastax/driver/core/DataType.Name.html#asJavaC |
| lass(). |
| |
| |addRow(java.util.Map<java.lang.String,java.lang.Object> values) |Adds a |
| new row to the writer. Returns a CQLSSTableWriter object. This is |
| equivalent to the other addRow methods, but takes a map whose keys are |
| the names of the columns to add instead of taking a list of the values |
| in the order of the insert statement used during construction of this |
| SSTable writer. The column names in the map keys must be in lowercase |
| unless the declared column name is a case-sensitive quoted identifier in |
| which case the map key must use the exact case of the column. The values |
| parameter is a map of column name to column values representing the new |
| row to add. If a column is not included in the map, it's value will be |
| null. If the map contains keys that do not correspond to one of the |
| columns of the insert statement used when creating this SSTable writer, |
| the corresponding value is ignored. |
| |
| |addRow(java.lang.Object... values) |Adds a new row to the writer. |
| Returns a CQLSSTableWriter object. |
| |
| |CQLSSTableWriter.builder() |Returns a new builder for a |
| CQLSSTableWriter. |
| |
| |close() |Closes the writer. |
| |
| |rawAddRow(java.nio.ByteBuffer... values) |Adds a new row to the writer |
| given already serialized binary values. Returns a CQLSSTableWriter |
| object. The row values must correspond to the bind variables of the |
| insertion statement used when creating by this SSTable writer. |
| |
| |rawAddRow(java.util.List<java.nio.ByteBuffer> values) |Adds a new row |
| to the writer given already serialized binary values. Returns a |
| CQLSSTableWriter object. The row values must correspond to the bind |
| variables of the insertion statement used when creating by this SSTable |
| writer. |
| |
| |rawAddRow(java.util.Map<java.lang.String, java.nio.ByteBuffer> values) |
| |Adds a new row to the writer given already serialized binary values. |
| Returns a CQLSSTableWriter object. The row values must correspond to the |
| bind variables of the insertion statement used when creating by this |
| SSTable writer. |
| |
| |getUDType(String dataType) |Returns the User Defined type used in this |
| SSTable Writer that can be used to create UDTValue instances. |
| |=== |
| |
| Other public methods the `CQLSSTableWriter.Builder` class provides are: |
| |
| [cols=",",options="header",] |
| |=== |
| |Method |Description |
| |inDirectory(String directory) |The directory where to write the |
| SSTables. This is a mandatory option. The directory to use should |
| already exist and be writable. |
| |
| |inDirectory(File directory) |The directory where to write the SSTables. |
| This is a mandatory option. The directory to use should already exist |
| and be writable. |
| |
| |forTable(String schema) |The schema (CREATE TABLE statement) for the |
| table for which SSTable is to be created. The provided CREATE TABLE |
| statement must use a fully-qualified table name, one that includes the |
| keyspace name. This is a mandatory option. |
| |
| |withPartitioner(IPartitioner partitioner) |The partitioner to use. By |
| default, Murmur3Partitioner will be used. If this is not the partitioner |
| used by the cluster for which the SSTables are created, the correct |
| partitioner needs to be provided. |
| |
| |using(String insert) |The INSERT or UPDATE statement defining the order |
| of the values to add for a given CQL row. The provided INSERT statement |
| must use a fully-qualified table name, one that includes the keyspace |
| name. Moreover, said statement must use bind variables since these |
| variables will be bound to values by the resulting SSTable writer. This |
| is a mandatory option. |
| |
| |withBufferSizeInMB(int size) |The size of the buffer to use. This |
| defines how much data will be buffered before being written as a new |
| SSTable. This corresponds roughly to the data size that will have the |
| created SSTable. The default is 128MB, which should be reasonable for a |
| 1GB heap. If OutOfMemory exception gets generated while using the |
| SSTable writer, should lower this value. |
| |
| |sorted() |Creates a CQLSSTableWriter that expects sorted inputs. If |
| this option is used, the resulting SSTable writer will expect rows to be |
| added in SSTable sorted order (and an exception will be thrown if that |
| is not the case during row insertion). The SSTable sorted order means |
| that rows are added such that their partition keys respect the |
| partitioner order. This option should only be used if the rows can be |
| provided in order, which is rarely the case. If the rows can be provided |
| in order however, using this sorted might be more efficient. If this |
| option is used, some option like withBufferSizeInMB will be ignored. |
| |
| |build() |Builds a CQLSSTableWriter object. |
| |=== |