Analyzing Security ML Models with Imperfect Data in Production

The landing page of the dashboard, showing the Filter panel to the left, and the Visualization panel with three tabs on the right, with the Model Metrics tab selected. The first two charts are Detection (True Positive) Rates and False Positive Rates of our selected models and other anonymized detection engines for detecting malicious portable executable (PE) files. Then we show the Ratio of Samples scanned per engines, followed by TPR, FPR and the finally the ROC curve

SophosAI team develops numerous machine learning models that get directly integrated to our products. Currently we have more than 30 models deployed in production that detect malicious files, URLs, and emails in our client environments. When we develop new machine learning models, we typically train and evaluate the models on manually curated fixed data sets, so that we can accurately track research progress.

Once the model gets deployed into production, we need to move away from static dataset evaluation to automatically evaluating performance on new incoming data. This sounds simpler than it is practice, mainly because in a fully automatic pipeline it is difficult to understand if any observed performance issues are due to model performance, pipeline issues, emerging data distribution biases, or some combination of the above. 

In our organization, data infrastructure team, data science team, and malware researcher team work together to develop, deliver, and continuously improve production models. We need a shared source of truth where all the teams can have a common situational awareness of what is going on with our models. Considering all these use-cases, we built a real-time visualization dashboard, AI Total, that helps our team to fulfill these goals: 

  • Monitor the deployed models’ regular performance, observe time trends, and detect anomalies 
  • Detect issue with the evaluation data, models, or labeling 
  • Investigate the reasons behind the issues 

Our web-based visualization system, AI Total, allows the users to quickly gather headline performance numbers while maintaining confidence that the underlying data pipeline is functioning properly. It also enables us to immediately observe the root cause of an issue when something goes wrong. We introduce a novel way to analyze performance under data issues using a data coverage equalizer. Our work titled AI Total: Analyzing Security ML Models with Imperfect Data in Production is published in the IEEE Symposium on Visualization for Cyber Security, 2021*.

Check out our paper to see how the power of visualization fulfilled the operational needs of our industry research team to detect and resolve the frequently seen issues in our productionized operational security models. We described the full step-by-step design of the user interface and shared the lessons we learned, and demonstrated how we used the system. We added multiple simple views rather than one complex view to support data scientists’ workflow while keeping it simple for high-level users. We focused on finding trends and anomalies in data feeds relevant to the models. A combination of several charts enabled the team to ask questions, verify their hypotheses and generate insights. If you are further interested to know about our behind-the-scenes data infrastructure work, check out this talk.

*The IEEE Symposium on Visualization for Cyber Security (VizSec) is a forum that brings together researchers and practitioners from academia, government, and industry to address the needs of the cyber security community through new and insightful visualization and analysis techniques. VizSec provides an excellent venue for fostering greater exchange and new collaborations on a broad range of security and privacy-related topics.