Since 2018, MITRE has conducted its annual ATT&CK evaluations to set the industry standard for measuring the threat detection and prevention capabilities of vendors in the endpoint security market.
The experts at MITRE study some of the most sophisticated real-world threat groups, analyze their methods, and develop custom tools that replicate the techniques and tactics these adversaries use to extort hundreds of millions of dollars from businesses annually.
This year’s evaluations saw big changes that more closely simulate today’s changing threat landscape. These changes brought a much higher degree of rigor to the evaluation, making the evaluation results much more impactful.
Let’s explore how the MITRE evaluation works and how this year was different.
What’s different this year?
Participants in MITRE’s 2024 evaluations faced new evaluation challenges and more operating system platforms to defend. Each of these changes resulted in a more accurate portrayal of real-world effectiveness.
False positives
In this year’s evaluation, vendors faced the challenge of not alerting on 20 false positives in the detection phase and 30 false positives in the protection phase. These false positive signals replicated normal business activity that should not have been reported or prevented as a threat.
Anyone who has worked in or adjacent to a SOC knows the importance of assessing false positives. Tools that don’t create unnecessary alerts on benign activity make life infinitely easier for analysts, allowing them to focus response efforts on what truly matters and prioritize those incidents most effectively. False positives at the prevention stage have the most significant impact, potentially disrupting business-critical processes running on the endpoint.
Measuring false positives also guards against spamming tactics from participants — that is, ratcheting up sensitivity to catch malicious behavior in an evaluation but in practice creating too much noise to be useful to a real analyst.
Continuous evaluation model
This year, MITRE’s evaluation ran over several days without breaks, rather than in the neatly defined stages of years past. By adopting this model, MITRE aims to assess how security products perform under sustained pressure, similar to the relentless nature of modern cyber threats.
A continuous evaluation model tests the product’s ability to correlate events over time. In addition to mounting an immediate response to individual attack techniques, the product must adapt to the evolving threat and create the most complete picture possible of a dynamic, ongoing attack campaign.
Expanded scope
MITRE challenged vendors with a more diverse array of attack types and adversary techniques in this year’s evaluation, with attacks targeting Windows, Linux, and MacOS operating systems. Such a broadened scope assesses the participants’ abilities to stand up to the speed, sophistication, and scale of modern attacks and the most well-resourced threat actors.
Inclusion of cloud environments
This year’s evaluations included more cloud-based attack scenarios, reflecting the growing importance of cloud security and the unique challenges posed by cloud environments. Considering today’s hybrid and multi-cloud landscapes, this change is vital for organizations to better understand how these solutions perform in protecting their cloud assets and data — especially given the focus on out-of-the-box performance.
The MITRE Evaluation Structure
To test the performance of participating tool vendors, MITRE emulated two types of adversaries.
- Ransomware-as-a-Service attacks against Windows and Linux systems. These emulations showcased common features across high-profile ransomware campaigns, like abusing legitimate tools, encrypting data, and disabling critical services or processes.
- Attack patterns demonstrated by the Democratic People’s Republic of North Korea (DPRK) against macOS. These emulations mimicked multi-staged and modular malware in operations involving elevated privileges and credential targeting.
Why these two emulations?
Ransomware remains one of the most popular and quickly evolving attacks across the globe, thanks to advanced technology and ransomware-as-a-service models.
As for the DPRK, North Korea is one of the most dynamic and sophisticated adversaries out there, regularly targeting high-value systems and organizations.
Phase 1: Detection
The detection phase of the evaluation assesses how well the solution autonomously identified malicious and suspicious events, met the detection criteria for the tactic, and detailed the technique with which that action was performed.
Performance categories are determined by the level of detail provided.
- General detection identifies the malicious event and answers the “who, what, when, and where” of that event.
- Tactic-level detection identifies the four Ws, but also addresses the fifth W — “why” the adversary might perform this tactic. Tactic-grade detections link a description at the ATT&CK Tactic level with the behavior under test., e.g. spear phishing.
- Technique- level detection, the highest-performing category, details the five Ws and explicitly links a description with the ATT&CK (Sub-) technique level with the behavior under test, e.g., spear phishing via attachments.
For example, a general detection would flag “powershell.exe Invoke-Mimikatz*” as a suspicious activity. A tactic-level detection would additionally relate the activity to credential access. A technique-level detection would go even further, specifying how the behavior is related to what occurred, such as credentials stored in the process memory of the Local Security Authority Subsystem Service (LSASS) for OS credential dumping.
The evaluations also note “Detection modifiers.” Vendors are allowed a chance to adjust their performance, testing whether they can detect an attack they first missed, with some additional manual intervention. In the real world, security teams can’t try again on the same attack step, so these types of detections should be used for informational purposes only, not to truly assess the effectiveness of the solution.
- A delayed detection modifier means an autonomously generated event required manual human augmentation to meet the documented Detection Criteria.
- A configuration change modifier means the vendor’s solution was changed while the evaluation was still in progress. Often, this modifier shows that additional data can be collected and/or processed with new detection content. Detections with a configuration change modifier are identified during an optional 4th day of testing when vendors can redo steps in the evaluation in hopes of achieving a better detection result.
Phase 2: Protection
The Protection phase evaluates whether the tool blocked the malicious behavior.
The performance categories for the Protection phase are binary:
- The None category indicates the solution did not block the behavior under test.
- The Blocked category indicates the solution successfully blocked the behavior under test.
The Protection phase also assesses how far the malicious activity progressed before the tool stopped the attack.
MITRE: A barometer for the cybersecurity world
MITRE stands as a thorough, modern evaluation of whether a vendor is keeping up with leading-edge techniques that adversaries deploy against enterprise businesses. We at Palo Alto are grateful to them for helping us discern how we stack up against the industry.
For decision-makers choosing products, MITRE provides a valuable scorecard to guide their search.
Join us for an in-depth look at the just-released MITRE ATT&CK Round 6 evaluations and learn how Palo Alto Networks excels in stopping advanced cyberattacks. Hear from experts on key changes, real-world adversary testing, and how vendors performed in this challenging evaluation. RSVP now!