Introduction

Users may perform thousands of actions (events) per day in their Amazon Web Services (AWS) environments. This generates a massive amount of IAM user activity* data making it infeasible for security monitoring. Anomaly detection is one technique that can enable security teams to uncover security issues in this huge volume of data, help save time and make user activity monitoring feasible. With this in place, security teams can rely on a few anomalies raised daily instead of thousands of individual actions and use them to identify both accidental changes as well as malicious activity from compromised credentials.

Our research in IAM user anomaly detection focuses on detecting deviations from normal user behavior in AWS environments. Sophos Cloud Optix Anomaly Detection Service continuously analyzes AWS CloudTrail events to learn users’ past activity patterns (kind, type, source, time of actions, etc..). It sorts through heaps of user activity events generated daily and flags any potentially suspicious patterns when a user’s current activity deviates from their past normal behavior. With a focus on interpretable ML, the outputs of the anomaly detection model are presented in straightforward language along with the confidence level to aid easy follow-up.

*Figure 1: Interpretable ML: Reasons for anomaly explained in layman language*

As an example, let’s consider a user having privileges to do several actions inside an AWS environment. On a day-to-day basis, this user does not perform all the actions that they had permission for i.e., overprivileged. From the actions the user performed, they were completed in a rather consistent way, one that our model learns. In case of a credential compromise for this user, when an attacker gets into the system, the adversary might suddenly start executing actions that the user had access to but never performed (new actions), some of these could be risky like AuthorizeSecurityGroupIngress, DeleteGroupPolicy, AttachRolePolicy, etc. Also, this attacker might execute actions that the user normally did but, in a pattern, different from the past behavior, e.g., time of the event, frequency of an event, event source behavior, etc. Our approach captures and scores such deviations from normal behavior, and thereby used for alerting the security teams for further investigation.

*Figure 2: Sample activity set of a potential malicious user*

Considerations

While building our anomaly detection models, we made certain considerations to help our models learn the most out of user activity data. The following are some of those considerations and explanations on how we approached them.

1. Incorporating Events Context in Data Modelling
Single isolated IAM user events in AWS environments don’t offer much value in learning user behavior. To learn deeper activity patterns, we put these isolated events in a sequence. Through this approach, we captured user activity patterns like what type of events a user normally does and at what time. As an example, our approach also learns and identifies the normal working hours of a user and thereby considers how many events a user does outside their normal working hours on any given date as one of the inputs.

*Figure 3: Sample event sequence snapshot on a given day*

2. Customized approach for users to score new events
Each user has a different probability distribution of doing any kind of action in an environment. Every time, a user performs an action, it is either something the user has done before or a new event. For a new event, we don’t have any past data for a user since the action is performed for the first time. So instead of treating each new event the same, we score it using a new event model which assigns a probability distribution for each user to do a new kind of event.

Another piece of data we leverage that makes the model more robust is learning from the activity of other users in the environment. In any environment, different users have different IAM roles. Thereby, some users are more similar to each other than others based on the kind of actions they can perform in the environment. We trained a user-similarity model for each user through which we identify which users are more probable of performing a certain set of actions. Using this model, we then adjust the probability distribution of a new event for a user if that event has been performed previously by the set of similar users.

3. Tackling Events from Automation IAM roles
Inside AWS environments, certain roles are created to execute automation processes. These automated processes could be scheduled processes that run daily(say), could be on demand processes that perform actions at an exceptionally higher rate than a normal user or processes with limited event set. Such IAM user roles are scored differently through a combination of rule and model-based approach. As an example, when a user role running a scheduled automation process does an action outside it’s normal working hours, it is penalized more compared to a normal user. We detect such user roles using an automation user detection model and flags anomalies corresponding to them as well.

4. Ability to Raise Proactive Alerts
Security teams need a way to figure out which alerts they should prioritize first. So, we trained an anomaly detection model for each user that assigns a confidence level to the flagged anomalies, scoring new activity against past activity. It buckets anomalies into three categories – Low Confidence, Medium Confidence and High Confidence. We raise proactive alerts for those anomalies that the model flagged with high confidence, and these can be further prioritized and investigated by security teams.

Figure 5: Alert raised for a sample High Confidence anomaly

5. Explainable Anomalies
When our model flags an anomaly, the output is displayed in an accessible language. Moving away from the black box approach, this helps the security team to instantly pinpoint unusual activity and investigate anomalies without ambiguity. Actions which are contributors to the anomaly are highlighted separately in a timeline to provide deeper context of any anomaly raised. All of this helps security teams to build a clear and detailed picture of individual user role activity. Example detections include when a user:

Performs actions they have never done before

Completes actions outside of their normal working hours

Executes riskier actions they have never done before etc.

Figure 6: Sample Action Timeline for a raised anomaly

Conclusion

The Sophos AWS Activity Anomaly detection models, available with Sophos Cloud Optix, enhance and extend AWS CloudTrail events by flagging user activities that deviate from their normal behavior. Clear anomaly explanations and a detailed timeline view allow for faster investigation by bringing the flagged anomalous activities to life.

_{^{*IAM user activity refers to both actions performed directly by IAM users and action performed by IAM roles that are assumed by authorized entities.}}