Flow sub-module extracts and transforms Flow data already ranked by spot-ml and will load into impala tables for presentation the layer.
Flow spot-oa main script executes the following steps:
1. Creates required folder structure if does not exist for output files. This is: data: data/flow/<date>/ ipython Notebooks: ipynb/flow/<date>/ 2. Creates a copy of iPython notebooks out of templates in ipynb_templates folder into output folder. 3. Reads Flow spot-ml results for a given date and loads only the requested limit. 4. Add network context to source and destination IPs. 5. Add geolocation to source and destination IPs. 6. Stores transformed data in the selected database. 7. Generates details and chord diagram data. These details include information about aditional connections and some additional information to draw chord diagrams in the UI.
The following files and modules are already included but some of them require configuration. See the following sections for more information:
The following files are not included:
Before running Flow OA users need to configure components for the first time. It is important to mention that configuring these components make them work for other data sources as DNS and Proxy.
Main results for Flow OA. The data stored in this table is limited by the number of rows the user selected when running oa/start_oa.py.
Table schema: 0. tstart: string 1. srcip: string 2. dstip: string 3. sport: int 4. dport: int 5. proto: string 6. ipkt: int 7. ibyt: int 8. opkt: int 9. obyt: int 10. score: float 11. rank: int 12. srcip_internal: bit 13. destip_internal: bit 15. src_geoloc: string 16. dst_geoloc: string 17. src_domain: string 18. dst_domain: string 19. src_rep: string 20. dst_rep: string
A query will be executed for each suspicious connection detected, to find the details for each connection occurred during the same specific minute between given source IP and destination IP.
Table schema: 0. tstart: string 1. srcip: string 2. dstip: string 3. sport: int 4. dport: int 5. proto: string 6. flags: string 7. tos: int 8. ibyt: bigint 9. ipkt: bigint 10. pkts: bigint 11. input: int 12. output: int 13. rip: string 14. obyt: bigint 15. opkt: bigint 16. hh: int 17. md: int
A query will be executed for each distinct client ip that has connections to 2 or more other suspicious IP. This query will retrieve the sum of input packets and bytes transferred between the client ip and every other suspicious IP it connected to.
Table schema: 0. ip_threat: string 1. srcip: string 2. dstip: string 3. ibyt: bigint 4. ipkt: bigint
This table is populated with the number of connections ingested by minute during that day.
Table schema: 0. tdate: string 1. total: bigint
Flow spot-oa configuration. Contains columns name and index for input and output files. This Json file contains 3 main arrays:
- flow_results_fields: list of column name and index of ML flow_results.csv file. Flow OA uses this mapping to reference columns by name. - column_indexes_filter: the list of indices to take out of flow_results_fields for the OA process. - flow_score_fields: list of column name and index for flow_scores.csv. After the OA process completes more columns are added.
Templates for iPython notebooks. After OA process completes, a copy of each iPython notebook is going to be copied to the ipynb/<pipeline>/<date> path. With these iPython notebooks user will be able to perform further analysis and score connections. User can also experiment adding or modifying the code. If a new functionality is required for the ipython notebook, the templates need to be modified to include the functionality for posterior executions. For further reference on how to work with this notebook, you can read:
To reset all scored connections for a day, a specific cell with a preloaded function is included in the Advanced Mode Notebook. The cell is commented to avoid accidental executions, but is properly labeled.