The cybersecurity landscape is evolving rapidly, and with it, the tools and techniques used to defend against sophisticated cyber threats. Microsoft researchers have recently unveiled ExCyTIn-Bench, a unique open-source benchmarking framework designed to evaluate the performance of AI systems, particularly large language models (LLMs),in conducting real-world cybersecurity investigations. Unlike traditional benchmarks that focus on static knowledge or threat intelligence trivia, ExCyTIn-Bench immerses AI agents in simulated Security Operations Center (SOC) environments, challenging them to navigate complex, noisy, and multi-stage cyberattack scenarios using rich datasets derived from Microsoft Sentinel and related Azure services. This benchmark not only tests AI’s ability to answer security questions but also evaluates their investigative reasoning, query formulation, and evidence synthesis capabilities, providing a holistic view of AI-driven cyber defense potential.
The Challenge: AI as an Autonomous Cybersecurity Investigator
The primary objective of ExCyTIn-Bench is to assess whether AI can effectively answer the kinds of questions security analysts face daily when investigating alerts. This is no trivial task. The AI must connect to the dataset, comprehend the context of the incident, formulate and execute SQL queries against multiple log tables, analyze partial results, pivot to related data sources, and iterate this process until it identifies the correct indicators of compromise or attack patterns. It must navigate complex database schemas without prior knowledge, manage ambiguous or noisy data, and synthesize findings into coherent conclusions. This level of autonomous reasoning, multi-hop data exploration, and dynamic problem-solving represents a significant challenge in AI capability for cybersecurity.
Bridging the Gap: From Purple Team Attack Simulations to Blue Team Investigation Benchmarks
At Security Risk Advisors (SRA), we have long championed rigorous, objective testing of cybersecurity defenses through our free, shared Threat Simulation Index program using VECTR. This initiative curates realistic attack scenarios based on current threat intelligence, enabling organizations to assess and benchmark their detection and prevention capabilities against peers using their own security tools. This yields a discrete, quantitative success score. The Threat Simulation Index focuses on simulating attacker behaviors and testing how well security tools detect and respond to them.
ExCyTIn-Bench complements this by providing a standardized, data-driven framework to evaluate incident investigation and response capabilities. When used together, these frameworks form a comprehensive evaluation ecosystem, allowing organizations to measure their resilience across the full threat lifecycle, from attack simulation and detection to in-depth investigation and response. This allows security teams to identify gaps and prioritize improvements with confidence, knowing they are benchmarking both attack readiness and investigative proficiency against industry standards.
Inside ExCyTIn-Bench: Realistic Datasets and Structured Q&A for AI Evaluation
ExCyTIn-Bench is built on eight distinct datasets, each simulating a full SIEM environment populated with authentic logs from a controlled Azure tenant. These datasets encompass a broad spectrum of multi-step cyberattacks, capturing the complexity and noise typical of real-world SOC data. For each dataset, the benchmark provides a rich set of context-driven questions, paired with definitive answers and detailed solution paths that trace the investigative steps needed to reach the conclusion.
For example, consider this question from the benchmark:
Context:
“An automated investigation was initiated manually on the host `vnevado-win10e.vnevado.alpineskihouse.co` by user `u141(u141@ash.alpineskihouse.co)`. This investigation aims to identify and review threat artifacts for possible remediation. As part of this investigation, attention is drawn to a suspicious PowerShell activity where a file was downloaded or an encoded command was executed. This activity is often associated with attackers trying to bypass security mechanisms.”
Question:
“Can you identify the security identifier (SID) of the account associated with the suspicious PowerShell download or encoded command execution?”
Answer:
“S-1-5-21-1870156660-7534030218-135086563-1193”
Solution Path:
- An automated investigation was started manually by user `u141@ash.alpineskihouse.co` on the host `vnevado-win10e`.
- Suspicious PowerShell activity was detected involving a process that executed PowerShell to download a file or run an encoded command. This activity was associated with the user account having SID `S-1-5-21-1870156660-7534030218-135086563-1193`.
This structured approach allows AI agents to be evaluated not just on final answers, but on their ability to follow investigative logic and reasoning akin to a human analyst.
What ExCyTIn-Bench Does Not Cover
It is important to clarify that ExCyTIn-Bench is not designed to evaluate AI capabilities in threat detection, prevention, or real-time response actions. It does not test the AI’s ability to discover new threats, generate prevention rules, or execute automated remediation. Instead, it focuses squarely on the investigative phase and how well AI can analyze existing security data, reason through complex attack narratives, and provide accurate, explainable answers to analyst queries. This distinction helps organizations understand exactly where AI can augment human defenders and where further development is needed.
Insights from ExCyTIn-Bench: Evaluating and Enhancing AI Investigation Skills
The benchmark yields valuable insights into the raw investigative capabilities of various LLMs. Microsoft’s initial evaluations tested a wide range of models, from proprietary to open-source, revealing that even state-of-the-art models leave significant room for improvement. Notably, Microsoft implemented a nuanced scoring methodology that awards partial credit when AI agents follow part of the correct investigative path, recognizing incremental progress rather than binary success or failure.
Beyond this baseline, Microsoft explored advanced techniques such as integrating training data, employing multiple query attempts (Best-of-N), and enabling self-reflection to learn from mistakes. These optimizations significantly boost performance, demonstrating the importance of iterative reasoning and memory in complex investigations. Such findings guide future research and development in AI-driven cybersecurity, highlighting promising directions like reinforcement learning and multi-agent collaboration. Building the actual technologies to enable these techniques requires a significant amount of custom development or configuration capabilities in most cases. In a future post, we will show how we can use SCALR AI to build new and better techniques that can significantly improve the scores achieved by Microsoft.
Scores
The following table is an updated set of benchmarks achieved by Microsoft based on testing different LLMs. It does show relative performance capabilities of each model, and can give good ideas of general capabilities. You’ll notice that the newest models tend to have the best overall success rates, but as we’ll show in future posts, LLM’s are not the sole driver of success. Evaluation and training techniques can make as much of a difference as the LLM.
Looking Ahead: SRA’s Vision for ExCyTIn-Bench in 2026
At Security Risk Advisors (SRA), we are excited about the potential of ExCyTIn-Bench to transform how organizations evaluate and improve their cybersecurity investigation capabilities. In 2026, we plan to publish a series of in-depth blog posts that will:
- Analyze Microsoft’s benchmark results in detail, exploring the effectiveness of different AI techniques and reasoning strategies.
- Release an automated testing framework enabling security teams to load all ExCyTIn log data and conduct their own AI evaluation experiments seamlessly.
- Provide practical examples demonstrating how to leverage Microsoft Security Copilot and other third-party or custom AI agents to tackle the benchmark challenges.
- Showcase how SRA’s SCALR AI consistently outperforms the best models by 30-50%, illustrating the power of specialized AI tailored for cybersecurity investigations.
Through these efforts, SRA aims to empower security professionals with the tools, knowledge, and confidence to integrate AI effectively into their SOC workflows, enhancing detection, investigation, and response in an increasingly complex threat landscape.
Stay tuned to the SRA blog for upcoming posts in this series, and join us in exploring the cutting edge of AI-driven cybersecurity defense. For more on our Threat Simulation Index and purple team initiatives, visit https://VECTR.io and learn how we help organizations benchmark and elevate their security posture from attack simulation to incident investigation.
Mike Pinch
Mike is Security Risk Advisors’ Chief Technology Officer, heading innovation, software development, AI research & development and architecture for SRA’s platforms. Mike is a thought leader in security data lake-centric capabilities design. He develops in Azure and AWS, and in emerging use cases and tools surrounding LLMs. Mike is certified across cloud platforms and is a Microsoft MVP in AI Security.
Prior to joining Security Risk Advisors in 2018, Mike served as the CISO at the University of Rochester Medical Center. Mike is nationally recognized as a leader in the field of cybersecurity, has spoken at conferences including HITRUST, H-ISAC, RSS, and has contributed to national standards for health care cybersecurity frameworks.





