The spot-ml jar
The spot-ml jar contains one main routine and it is in the class SuspiciousConnects
; it is by submitting this class to Spark that the Suspicious Connects analyses are invoked.
Command line arguments
- analysis The analysis to perform. One of flow, proxy, dns
- input Path to data on HDFS. Data is expected to be stored in parquet with schema consistent with schema used by the suspicious connects analyses.
- feedback Local path of file containing feedback scores.
- dupfactor Duplication factor controlling how to down rate non-threatening events from the feedback file.
- ldatopiccount Number of topics in the topic model.
- userdomain The user domain of the network being analyzed.
- scored The HDFS path where results will be stored.
- threshold Threshold for determination of anomalies. Records with scores above the threshold will not be returned.
- maxresults Maximum number of record to return. If -1, all records are returned. Results are filtered by the threshold and sorted and the most suspicious (lowest score) records are returned first.
- delimiter Separation character used for CSVs containing most suspicious results.
- prgseed Seed for the pseudorandom generator used in topic modelling.
- ldamaxiteration Maximum number of iterations to execute the LDA topic modelling procedure.
- ldaalpha Document concentration for LDA, default 1.02
- ldabeta Topic concentration for LDA, default 1.001
Example