Spot ingest for the Open Data Model uses Envelope to enable configuration-driven Spark Streaming ingest application.
From this directory (spot-ingest/odm
):
git clone https://github.com/cloudera-labs/envelope.git
cd envelope/ wget https://raw.githubusercontent.com/curtishoward/incubator-spot/SPOT-181_files/spot-ingest/odm/workers/envelope_mods/0001-ENV-252-Add-Hive-output-option-to-align-step-schema-.patch wget https://raw.githubusercontent.com/curtishoward/incubator-spot/SPOT-181_files/spot-ingest/odm/workers/envelope_mods/0001-ENV-256-Add-an-option-to-the-delimited-translator-to.patch wget https://raw.githubusercontent.com/curtishoward/incubator-spot/SPOT-181_files/spot-ingest/odm/workers/envelope_mods/0001-ENV-258-Fix-delimited-translator-to-handle-missing-f.patch patch -p1 < 0001-ENV-258-Fix-delimited-translator-to-handle-missing-f.patch patch -p1 < 0001-ENV-256-Add-an-option-to-the-delimited-translator-to.patch patch -p1 < 0001-ENV-252-Add-Hive-output-option-to-align-step-schema-.patch
mvn clean && mvn -DskipTests package
Required Roles
The following roles are required in all the nodes where the Ingest Workers will be running.
Ingest Configuration
kafka-topics --zookeeper zookeeper-host:2181 --create --topic spot_dns --replication-factor 3 --partitions 4 kafka-topics --zookeeper zookeeper-host:2181 --create --topic spot_flow --replication-factor 3 --partitions 4 kafka-topics --zookeeper zookeeper-host:2181 --create --topic spot_proxy --replication-factor 3 --partitions 4
broker
and topic
parameters in the workers/spot_*.conf
Envelope configuration filesStarting the Ingest
Start the Spark Streaming application defined by the Envelope configuration (Spark driver logs will be in the working directory):
bash start_ingest.sh [dns|flow|proxy]
Collector Examples
While the collector can be any application that is a Kafka producer for the source topic, examples are prodived which make use of nfdump, tshark and unzip for flow, DNS and proxy data (respectively), to dissect and then forward records to the relevant Kafka topics using Flume.
The following are required on all (Edge) nodes where the collector examples will be running:
To run the collector example (from the spot-ingest/collectors
directory):
brokerList
parameter in each spot_flume_*.conf
Flume configuration file<source_type>/new
directory:bash process_files.sh [dns|flow|proxy]
process_files.sh
has been started for DNS:mv sample.pcap dns/new
spot.event
Hive table