The spot-ml jar

The spot-ml jar contains one main routine and it is in the class SuspiciousConnects; it is by submitting this class to Spark that the Suspicious Connects analyses are invoked.

Command line arguments

  • analysis The analysis to perform. One of flow, proxy, dns
  • input Path to data on HDFS. Data is expected to be stored in parquet with schema consistent with schema used by the suspicious connects analyses.
  • feedback Local path of file containing feedback scores.
  • dupfactor Duplication factor controlling how to down rate non-threatening events from the feedback file.
  • ldatopiccount Number of topics in the topic model.
  • userdomain The user domain of the network being analyzed.
  • scored The HDFS path where results will be stored.
  • threshold Threshold for determination of anomalies. Records with scores above the threshold will not be returned.
  • maxresults Maximum number of record to return. If -1, all records are returned. Results are filtered by the threshold and sorted and the most suspicious (lowest score) records are returned first.
  • delimiter Separation character used for CSVs containing most suspicious results.
  • prgseed Seed for the pseudorandom generator used in topic modelling.
  • ldamaxiteration Maximum number of iterations to execute the LDA topic modelling procedure.
  • ldaalpha Document concentration for LDA, default 1.02
  • ldabeta Topic concentration for LDA, default 1.001