Having a forensic hash, such as TLSH, is a useful tool in cybersecurity. In short, the notion is that semantically similar documents should hash to a value which also similar. Contrast this with your standard cryptographic hashes, such as SHA and MD, where small deviations in the input data will yield large deviations in the hashes.
The traditional use-case is to hash input documents or binaries and compare against a known blacklist of malicious hashes. A sufficiently similar hash will indicate a match. This will avoid malicious parties fuzzing input data to avoid detection.
While this is interesting, it still requires metric-space searches in a blacklist. I envisioned a slightly more interesting streaming use-case of on-the-fly clustering of data. While the TLSH hashes created do not necessarily hash to precisely the same value on similar documents, more traditional non-forensic hashes do collide when sufficiently similar. Namely, the Hamming distance LSH applied to the TLSH hash would give us a way to bin semantic hashes such that similar hashes (by hamming distance) have the same hash.
Inspired by a good talk by Andy LoPresto and Andre Fucs de Miranda from Apache NiFi, we will proceed to take logs from the Cowrie honeypot and compute TLSH hashes and semantic bins so that users can easily find similarly malicious activity to known threats in logs.
Consider the following excerpts from the Cowrie logs the authors above have shared:
{ "eventid": "cowrie.command.success" , "timestamp": "2017-09-18T11:45:25.028091Z" , "message": "Command found: /bin/busybox LSUCT" , "system": "CowrieTelnetTransport,787,121.237.129.163" , "isError": 0 , "src_ip": "121.237.129.163" , "session": "21caf72c6358" , "input": "/bin/busybox LSUCT" , "sensor": "a927e8b28666" }
and
{ "eventid": "cowrie.command.success" , "timestamp": "2017-09-17T04:06:39.673206Z" , "message": "Command found: /bin/busybox XUSRH" , "system": "CowrieTelnetTransport,93,94.51.110.74" , "isError": 0 , "src_ip": "94.51.110.74" , "session": "4c047bbc016c" , "input": "/bin/busybox XUSRH" , "sensor": "a927e8b28666" }
You will note the /bin/busybox
call with a random selection afterwards.
Excerpting from an analysis of an IOT exploit here:
The use of the command "busybox ECCHI" appears to have two functions. First of all, cowrie, and more "complete" Linux distrubtions then commonly found on DVRs will respond with a help screen if a wrong module is used. So this way, "ECCHI" can be used to detect honeypots and irrelevant systems if the reply isn't simply "ECCHI: applet not found". Secondly, the command is used as a market to indicate that the prior command finished. Later, the attacker adds "/bin/busybox ECCHI" at the end of each line, following the actual command to be executed.
We have a few options at our disposal:
/bin/busybox
we would include false positives./bin/busybox XUSRH
, we'd miss many attempts with a different value as XUSRH
is able to be swapped out for another random sequence to foil overly strict rules./bin/busybox *
then we‘d capture this scenario well, but it’d be nice to be able to not be specific to detecting the /bin/busybox
style of exploits.Indeed, this is precisely what semantic hashing and binning allows us, the ability to group by semantic similarity without being too specific about what we mean of as “semantic” or “similar”. We want to cast a wide net, but not pull back every fish in the sea.
For this demonstration, we will
We assume that the following environment variables are set:
METRON_HOME
- the home directory for metronZOOKEEPER
- The zookeeper quorum (comma separated with port specified: e.g. node1:2181
for full-dev)BROKERLIST
- The Kafka broker list (comma separated with port specified: e.g. node1:6667
for full-dev)ES_HOST
- The elasticsearch master (and port) e.g. node1:9200
for full-dev.Also, this does not assume that you are using a kerberized cluster. If you are, then the parser start command will adjust slightly to include the security protocol.
Before editing configurations, be sure to pull the configs from zookeeper locally via
$METRON_HOME/bin/zk_load_configs.sh --mode PULL -z $ZOOKEEPER -o $METRON_HOME/config/zookeeper/ -f
First we must set up the cowrie log data in our cluster's access node.
cowrie
in ~ and untar the tarball into that directory via:mkdir ~/cowrie cd ~/cowrie tar xzvf ~/180424243034750.tar.gz
The Cowrie data is coming in as simple JSON blobs, so it's easy to parse. We really just need to adjust the timestamp and a few fields and we have valid data.
$METRON_HOME/config/zookeeper/parsers/cowrie.json
with the following content:{ "parserClassName":"org.apache.metron.parsers.json.JSONMapParser", "sensorTopic":"cowrie", "fieldTransformations" : [ { "transformation" : "STELLAR" ,"output" : [ "timestamp"] ,"config" : { "timestamp" : "TO_EPOCH_TIMESTAMP( timestamp, 'yyyy-MM-dd\\'T\\'HH:mm:ss.SSS')" } } ] }
Before we start, we will want to install ES mappings so ES knows how to interpret our fields:
curl -XPUT 'http://$ES_HOST/cowrie*/_mapping/cowrie_doc' -d ' { "properties" : { "adapter:stellaradapter:begin:ts" : { "type" : "string" }, "adapter:stellaradapter:end:ts" : { "type" : "string" }, "blacklisted" : { "type" : "boolean" }, "compCS" : { "type" : "string" }, "data" : { "type" : "string" }, "dst_ip" : { "type" : "string" }, "dst_port" : { "type" : "long" }, "duration" : { "type" : "double" }, "encCS" : { "type" : "string" }, "enrichmentjoinbolt:joiner:ts" : { "type" : "string" }, "enrichmentsplitterbolt:splitter:begin:ts" : { "type" : "string" }, "enrichmentsplitterbolt:splitter:end:ts" : { "type" : "string" }, "eventid" : { "type" : "string" }, "guid" : { "type" : "string" }, "input" : { "type" : "string" }, "isError" : { "type" : "long" }, "is_alert" : { "type" : "string" }, "kexAlgs" : { "type" : "string" }, "keyAlgs" : { "type" : "string" }, "macCS" : { "type" : "string" }, "message" : { "type" : "string" }, "original_string" : { "type" : "string" }, "password" : { "type" : "string" }, "sensor" : { "type" : "string" }, "session" : { "type" : "string" }, "similarity_bin" : { "type" : "string" }, "size" : { "type" : "long" }, "source:type" : { "type" : "string" }, "src_ip" : { "type" : "string" }, "src_port" : { "type" : "long" }, "system" : { "type" : "string" }, "threat:triage:rules:0:comment" : { "type" : "string" }, "threat:triage:rules:0:name" : { "type" : "string" }, "threat:triage:rules:0:reason" : { "type" : "string" }, "threat:triage:rules:0:score" : { "type" : "long" }, "threat:triage:score" : { "type" : "double" }, "threatinteljoinbolt:joiner:ts" : { "type" : "string" }, "threatintelsplitterbolt:splitter:begin:ts" : { "type" : "string" }, "threatintelsplitterbolt:splitter:end:ts" : { "type" : "string" }, "timestamp" : { "type" : "long" }, "tlsh" : { "type" : "string" }, "ttylog" : { "type" : "string" }, "username" : { "type" : "string" }, "version" : { "type" : "string" }, "alert" : { "type" : "nested" } } } '
cowrie
kafka topic via:/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $ZOOKEEPER --create --topic cowrie --partitions 1 --replication-factor 1
Here, to build out a scenario, we will assume that we have a blacklist of known malicious hosts. For our purposes, we'll choose one particular host IP to be malicious.
~/blacklist.csv
to contain the following:94.51.110.74
~/blacklist_extractor.json
to contain the following:{ "config" : { "columns" : { "ip" : 0 }, "indicator_column" : "ip", "type" : "blacklist", "separator" : "," }, "extractor" : "CSV" }
$METRON_HOME/bin/flatfile_loader.sh -i ~/blacklist.csv -t threatintel -c t -e ~/blacklist_extractor.json
This will create a new enrichment type “blacklist” with a single entry “94.51.110.74”.
We will want to do the following:
message
fieldinput
fieldisError
fieldtlsh
similarity_bin
10
. In production, this would be more complex.Now, we can create the enrichments thusly by creating $METRON_HOME/config/zookeeper/enrichments/cowrie.json
with the following content:
{ "enrichment": { "fieldMap": { "stellar" : { "config" : [ "characteristic_rep := JOIN([ 'message', exists(message)?message:'', 'input', exists(input)?input:'', 'isError', exists(isError)?isError:''], '|')", "forensic_hashes := HASH(characteristic_rep, 'tlsh', { 'hashes' : 16, 'bucketSize' : 128 })", "similarity_bin := MAP_GET('tlsh_bin', forensic_hashes)", "tlsh := MAP_GET('tlsh', forensic_hashes)", "forensic_hashes := null", "characteristic_rep := null" ] } } ,"fieldToTypeMap": { } }, "threatIntel": { "fieldMap": { "stellar" : { "config" : [ "blacklisted := ENRICHMENT_EXISTS( 'blacklist', src_ip, 'threatintel', 't')", "is_alert := (exists(is_alert) && is_alert) || blacklisted" ] } }, "fieldToTypeMap": { }, "triageConfig" : { "riskLevelRules" : [ { "name" : "Blacklisted Host", "comment" : "Determine if a host is blacklisted", "rule" : "blacklisted != null && blacklisted", "score" : 10, "reason" : "FORMAT('IP %s is blacklisted', src_ip)" } ], "aggregator" : "MAX" } } }
Notice that we have specified a number of hash functions of 16
when constructing the similarity bin.
I arrived at that by trial and error, which is not always tenable, frankly. What is more sensible is likely to construct multiple similarity bins of size 8
, 16
, 32
at minimum.
We want to pull a snapshot of the cowrie logs, so create ~/load_data.sh
with the following content:
COWRIE_HOME=~/cowrie for i in cowrie.1626302-1636522.json cowrie.16879981-16892488.json cowrie.21312194-21331475.json cowrie.698260-710913.json cowrie.762933-772239.json cowrie.929866-939552.json cowrie.1246880-1248235.json cowrie.19285959-19295444.json cowrie.16542668-16581213.json cowrie.5849832-5871517.json cowrie.6607473-6609163.json;do echo $i cat $COWRIE_HOME/$i | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic cowrie sleep 2 done
+x
bit on the executable via:chmod +x ~/load_data.sh
From here, we've set up our configuration and can push the configs:
$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z $ZOOKEEPER -i $METRON_HOME/config/zookeeper/
$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s cowrie
cowrie
topic via~/load_data.sh
Once this data is loaded, we can use the Alerts UI, starting from known malicious actors, to find others doing similar things.
First we can look at the alerts directly and find an instance of our /bin/busybox
activity:
We can now pivot and look for instances of messages with the same semantic_hash
but who are not alerts:
As you can see, we have found a few more malicious actors:
Now we can look at other things that they're doing to build and refine our definition of what an alert is without resorting to hard-coding of rules. Note that nothing in our enrichments actually used the string busybox
, so this is a more general purpose way of navigating similar things.