OA Data Migration to Spot 1.0

This document is intended for any developer or sysadmin who wants to migrate their existing OA data to Spot 1.0. In previous Spot releases, OA data was stored in CSV files in a given location in the server used for OA (specified in spot.conf during original installation). In Spot 1.0, OA data is stored in Impala tables. The purposes of these scripts are to migrate independently each use case (flow, proxy and dns) from those CSV files into the new Impala tables.

This migration process is optional and only for those users who want to keep usable their OA data (scores, edges, chords, dendros, ingest summaries, storyboards, threat investigations and timelines) generated in the previous Spot version.

Requirements

  • You must run first new Spot setup installation to have the new tables created in Impala.
  • You must log in to the OA server and run these scripts from there.
  • You must run these scripts from the spot-setup/migration folder in the new Spot 1.0 location

CSV to Impala Tables Mapping

Flow

CSV FileImpala Table
flow_scores.csvflow_scores
chord-*.tsvflow_chords
edge-*.tsvflow_edge
is_*.csvflow_ingest_summary
threats.csvflow_storyboard
flow_scores.csv (only scored values)flow_threat_investigation
sbdet-*.tsvflow_timeline

DNS

CSV FileImpala Table
flow_scores.csvdns_scores
edge-*.csvdns_edge
dendro-*.csvdns_dendro
threat-dendro-*.csvdns_threat_dendro
is_*.csvdns_ingest_summary
threats.csvdns_storyboard
dns_scores.csv (only scored values)dns_threat_investigation

Proxy

CSV FileImpala Table
edge-*.csvproxy_edge
is_*.csvproxy_ingest_summary
proxy_scores.csvproxy_scores
threats.csvproxy_storyboard
proxy_scores.csv (only scored values)proxy_threat_investigation
timeline-*.csvproxy_timeline

Data Flow

There is a launch and single script that will migrate all specified pipelines. This process will read each of the CSV from the existing location and import data to Impala tables accordingly, creating first a staging database and tables to load the records in the CSV and then insert that data into the new Spot 1.0 tables. You must execute this migration process from the server where old Spot release is located. You may provide one pipeline or all (flow, dns and proxy) according to your needs and your existing data. At the end of each script, the old data pipeline folder will be moved from the original location to a backup folder. Staging tables and their respective HDFS paths will be removed.

Execution

./migrate_to_spot_1_0.py PIPELINES OLD_OA_PATH STAGING_DB_NAME STAGING_DB_HDFS_PATH NEW_SPOT_IMPALA_DB IMPALA_DAEMON

where variables mean:

  • PIPELINES - Comma-separated list of the pipelines to be migrated
  • OLD_OA_PATH - Path to the old Spot-OA release directory in the local filesystem
  • STAGING_DB_NAME - Name of the staging database to be created to temporarily store these records
  • STAGING_DB_HDFS_PATH - HDFS path of the staging database to be created to temporarily store these records
  • NEW_SPOT_IMPALA_DB - Database name of the Spot 1.0 Impala tables. Use the same as in the spot.conf when the new Spot release was installed
  • IMPALA_DAEMON - Choose an Impala daemon to be used to run scripts' queries.

Example:

./migrate_to_spot_1_0.py 'flow,dns,proxy' '/home/spotuser/incubator-spot_old/spot-oa' 'spot_migration' '/user/spotuser/spot_migration/' 'migrated' 'node01'