Examining the ExCYTIn-Bench Approach for Benchmarking AI Incident Response Capabilities

by | Jan 20, 2026

Introduction: The Growing Need for AI in Cybersecurity

As cybersecurity professionals, we understand the growing complexity and volume of threats that Security Operations Centers (SOCs) face daily. Microsoft researchers have recently introduced ExCyTIn-Bench, a pioneering benchmarking framework designed to rigorously evaluate how well AI systems, especially large language models (LLMs), can perform in real-world cyber threat investigations. Unlike traditional tests that measure memorized knowledge or simple Q&A, ExCyTIn-Bench immerses AI agents in realistic SOC environments, challenging them to navigate complex, multi-stage cyberattack scenarios using authentic security logs. This testing program matters because it provides a standardized, data-driven way to assess AI’s investigative reasoning, query formulation, and evidence synthesis capabilities, skills critical for augmenting human analysts and enhancing automated incident response.  Today we will look a little deeper at how the test works.

 

What is ExCyTIn-Bench?

The creation of ExCyTIn-Bench is grounded in realistic data and a thoughtful question generation process. The dataset originates from a controlled Azure tenant called “Alpine Ski House,” a fictional company environment populated with authentic security logs from Microsoft Sentinel and related services. To protect privacy, personally identifiable information (PII) was meticulously anonymized using a combination of pattern-based substitutions and large language model-driven replacements, ensuring data utility without compromising confidentiality. The benchmark’s questions are generated through a novel approach: constructing bipartite incident graphs that link alerts and entities, mimicking how SOC analysts explore relationships during investigations. The ‘graph’ is really just a series of linkages between an incident, an alert, devices, users, and other IOCs.  As happens in the real world, to correctly answer a question, you must first answer intermediary questions to get to the final answer; this is what the graph represents.  LLM’s then craft context-rich, specific questions anchored to these graph paths, ensuring each query has a deterministic answer and reflects realistic investigative challenges.

 

Realistic Data and Question Generation: The Backbone of ExCyTIn-Bench

ExCyTIn-Bench employs a nuanced scoring methodology that goes beyond simple pass/fail judgments. Instead of treating answers as binary outcomes, it assigns partial credit based on how closely an AI agent’s submitted response aligns with the ground truth. This is achieved by evaluating the agent’s intermediate investigative steps along the shortest path in the incident graph, with decayed rewards applied for partial progress. For example, if an agent uncovers some but not all key indicators of compromise (IoCs) or intermediate evidence, it receives proportional credit. This fine-grained evaluation framework not only reflects real-world investigative workflows but also provides valuable feedback for training and improving AI agents through reinforcement learning.  This scoring process is somewhat challenging to emulate without the original graphs made available, so in future efforts SRA will rely on more strict but more simple pass/fail scoring.

 

Advanced Prompting and Reasoning Strategies

Several advanced prompting and reasoning strategies have been explored to enhance AI performance on ExCyTIn-Bench. Key approaches include:

  • Base Prompting – The foundational method where the AI is instructed to think aloud and generate SQL queries iteratively to explore the database and answer questions. While straightforward, this method sets the baseline for comparison.
  • Strategy Prompting – Enhances the base prompt by explicitly guiding the AI to start from alert tables and use them as anchors for investigation, emulating human analyst workflows. This typically improves the relevance and efficiency of queries.
  • ReAct (Reasoning and Acting) – Incorporates few-shot examples demonstrating how to combine reasoning with actions (SQL queries) in a stepwise manner. This approach helps the model better plan its investigative steps and reduces errors.
  • Expel (Experiential Learning) – This technique distills rules from successful past investigations (supplied ‘training’ data) and uses them as external memory during inference, guiding the AI agent towards more effective querying patterns and reducing redundant or irrelevant searches.
  • Best-of-N Sampling – The AI tries to answer each question several times in different ways and then picks the best answer from those attempts. This approach increases the chances of finding the right solution.
  • Reflection – Builds on Best-of-N by allowing the AI to critique its failed attempts, learn from mistakes, and retry with improved strategies, thereby iteratively refining its approach.

These methods collectively illustrate how incorporating memory, self-reflection, and structured reasoning can significantly boost AI investigative performance.

 

Key Takeaways: Process Engineering Over Model Fixation

One minor criticism of the overall benchmarking process is its primary focus on specific LLMs.  It is a very valid focus, but will be irrelevant in a year’s time, as all of these models will have been replaced.  An underrepresented takeaway from this framework is the difference in results achieved using the same LLM when you utilize the different evaluation methods shown above.  These techniques and their relative performance are likely to remain far more relevant for a longer period of time, and get proportionally better as you upgrade your LLMs going forward.  For example, researchers cite that for the same LLM (gpt-4o), they were able to move from a 26% (.26) success (‘reward’) score using the Base approach, all the way to a 56% (.563) success score using a combination of multiple methods.  A key takeaway from this should be less fixation on a specific type or brand of LLM, and instead understanding the rigors of process engineering to take advantage of the best LLMs available at the time.

 

Noteworthy Results: Trends in AI Performance

Regarding noteworthy results, the latest evaluations of ExCyTIn-Bench reveal encouraging trends. Proprietary models such as GPT-5 with high reasoning settings lead the pack, achieving average rewards above 56%, indicating strong investigative capabilities. GPT-4.1 and GPT-4o variants also perform well, with average rewards around 30% to 34%. Importantly, open-source models are slowly closing the gap: Llama4-17b-Mav and similar models delivered competitive results, with average rewards nearing 30%, showcasing the maturation of accessible AI technologies for cybersecurity. These findings highlight that while model architecture and size matter, prompting strategies and training approaches play critical roles in maximizing AI effectiveness.  It should be noted that Gemini-2.5-flash was removed from updated results after initial publishing because Google’s terms of service do not allow benchmarking (it scored about the same as the Llama4 model shown above).

 

Conclusion: Pushing the Frontier of AI-Driven Cybersecurity

In summary, ExCyTIn-Bench represents a significant leap forward in benchmarking AI for cybersecurity investigations, providing a realistic, rigorous, and granular evaluation framework. It underscores the challenges AI faces in navigating complex, noisy security data and the promise of advanced reasoning and memory-augmented techniques. At Security Risk Advisors, we are excited to build on this foundation. In our next blog post, we will release open-source tools on the SRA GitHub repository to help security teams set up their own ExCyTIn-Bench testing environments, enabling evaluation of any AI solutions integrated into their SOC workflows. Following that, we’ll showcase how SRA’s proprietary SCALR AI framework leverages sophisticated workflow orchestration and training methods to outperform Microsoft’s benchmarks using the same underlying models. Stay tuned as we continue to explore and push the frontier of AI-driven cybersecurity defense.

Mike Pinch
Chief Technology Officer |  Archive

Mike is Security Risk Advisors’ Chief Technology Officer, heading innovation, software development, AI research & development and architecture for SRA’s platforms.  Mike is a thought leader in security data lake-centric capabilities design.  He develops in Azure and AWS, and in emerging use cases and tools surrounding LLMs. Mike is certified across cloud platforms and is a Microsoft MVP in AI Security.

Prior to joining Security Risk Advisors in 2018, Mike served as the CISO at the University of Rochester Medical Center. Mike is nationally recognized as a leader in the field of cybersecurity, has spoken at conferences including HITRUST, H-ISAC, RSS, and has contributed to national standards for health care cybersecurity frameworks.