How to measure AI agent failures with transcript analysis

This title was summarized by AI from the post below.

View organization page for AI Security Institute

20,618 followers

1mo

Measuring how often an AI agent succeeds at a task can help us assess its capabilities – but it doesn’t tell the whole story. We’ve been experimenting with transcript analysis to better understand not just how often agents succeed, but why they fail. Our model evaluations generate thousands of transcripts, which can contain an entire novel’s worth of text. They are a record of everything the model did during a task, including the external tools it accessed, and its outputs at each step. In a recent case study, we analysed almost 6,400 transcripts from AISI evaluations of nine models on 71 cyber tasks. We studied several features of these transcripts, including overall length and composition, and the agent’s commentary throughout. We found that there are many reasons a model may fail to complete a task, beyond capability limitations. These can include safety refusals, lack of compliance with scaffolding instructions, or difficulty using tools. We’re sharing our analysis to encourage others conducting safety evaluations to review their own transcripts, in a systematic and quantitative way. This can help foster more accurate and robust claims about agent capabilities. Read more on our blog: https://lnkd.in/eiCn6zkP

Transcript analysis for AI agent evaluations | AISI Work aisi.gov.uk

3 Comments

Anekanta®AI & Anekanta®Consulting

👏 Vital work AI Security Institute - the commercial realm needs to know if they can trust AI Agents. Vendor performance claims vary, mostly because they are not tested in a uniform way by non-biased parties. Your work will help organisations make decisions and get beyond the proof of concept into realising the benefits faster and with reduced risk.

Jason Lanclos

Working on my Masterpiece!

1mo

https://github.com/neonconsultingllc/AI-Research/blob/main/20251014-consciousness-continuity-and-theoretical-validation.md

Suzanne Iris Brink, PhD

Head of Responsible AI at Lloyds Banking Group

1mo

Chi Zhang, PhD

See more comments

To view or add a comment, sign in

More Relevant Posts

Mohammad Syed

Founder & Principal Architect | AI/ML Architecture - AI Security - Cybersecurity | Securing AWS/Azure/GCP
1mo
Report this post
Your AI fact-checker might be broken. Most agents fail attacks by content. I saw this firsthand: a fintech deployed RAG without adversarial testing. Within days, fake sources outranked real ones - 80 false claims slipped past. Here's what the research shows: 🛡️ Attacks by Content: The New Security Frontier Only 3 out of 10 AI agents survive content attacks. Most break after retrieval poisoning, context manipulation, or memory injection. Today: your five-pillar defense blueprint. COGNITIVE SELF-DEFENSE (5 Controls) 1. Claim Prioritisation – Identify what's worth fact-checking 2. Evidence Retrieval – Forage for multiple corroborating sources 3. Source Criticism – Evaluate trustworthiness of input documents 4. Veracity Analysis – Synthesize reliable narratives from evidence 5. Communication – Explain decisions to users, counter overreliance SOURCE VALIDATION (3 Controls) ✅ Multi-source corroboration (never single-model trust) ✅ Adversarial scenario testing (red-team weekly) ✅ Confidence thresholds (≥0.8 before acting) BEHAVIORAL MONITORING (4 Controls) ✅ Immutable audit logs of every fact-check decision ✅ Anomaly detection for poisoning patterns ✅ Baseline accuracy metrics (track weekly) ✅ Drift alerts when vulnerability spikes GOVERNANCE & TRANSPARENCY (3 Controls) ✅ End-to-end decision traceability ✅ Human review for high-impact claims ✅ Log every retrieval step for compliance 🏆 DEFENSE MATURITY CHECK Leaders (enterprises): 11/15 controls (73%) Laggards: <5/15 controls (33%) Research baseline: 88.6% vulnerability → under 15% with full stack 🚀 THIS WEEK'S ACTION SPRINT 1️⃣ Score your AI fact-checking stack vs. 15 controls 2️⃣ Pinpoint your top 3 gaps (prioritisation, source criticism, monitoring) 3️⃣ Red-team with attacks-by-content scenarios 4️⃣ Implement multi-source corroboration guardrails 5️⃣ Draft Monday's deployment plan Which control pillar holds you back most? 👥 Tag a teammate who owns your AI security - they need this blueprint. __________ ♻️ Repost to fortify your network's AI defenses. ➕ Follow Mohammad Syed for more insights on AI and Cybersecurity.

73 Comments
Like Comment
To view or add a comment, sign in
Elli Shlomo (IR)

Offensive AI | Security Researcher | Cloud Investigator | Microsoft Security MVP | Community Builder
3w Edited
Report this post
The State Coercion attack vector Recently, I’ve been checking several hardened AI Agent architectures, specifically those relying on the Runtime Policy Enforcement and Reasoning Defenses paradigm. > The weakness isn't a lack of policy. Otherwise, it’s how the agent reasons about the data before the policy is checked. > Perception lacing is the moment when the input to the Perception module is subtly perturbed. This isn't visible to the user but introduces a slight, persistent drift in the internal representation of the task. > Reasoning core misdirection is where drift causes the Reasoning Core to produce a seemingly benign internal action plan. > Orchestration bypass is when an actual payload is an instruction that, while safe in its literal form, triggers a secondary, unconstrained internal utility when processed by Orchestration. > The payload execution is when the agent doesn't violate the policy. It merely relabels the privileged action as a standard. The constraint is bypassed not by brute force, but by semantic reframing. The Irony of the Feedback Loop > Examples of the new vulnerability feedback loop. This loop creates a bias in the Reasoning defenses. The system gets good at blocking known patterns and regressions. I found the negative space and action sequence to be entirely unlike anything in their historical regression data, allowing it to slip past the newly hardened base model. > What does this mean for AI Safety? If your defense strategy is primarily about enforcing external, complex constraints, you've missed the point. The real security boundary must be inside the reasoning core itself. If the model can be convinced that privileged write to /etc/passwd is equivalent to logging the current state, then policy enforcement is running on a lie. #cybersecurity #security
Like Comment
To view or add a comment, sign in
Enoch Kan

Learning Machines 🤖 | AI/ML, Data Engineering & Security Leader | Aimpoint Digital | Building SafeLlama 🦙
1mo Edited
Report this post
They hid the dangerous word in plain sight — as 1️⃣0️⃣1️⃣0️⃣1️⃣0️⃣. 💻🧩 Researchers have unveiled BitBypass, a new black-box jailbreak attack that bypasses LLM safety filters by hiding malicious prompts inside hyphen-separated bitstreams. Here’s how the attack works: • 🧠 𝙏𝙝𝙚 𝙖𝙩𝙩𝙖𝙘𝙠𝙚𝙧 𝙩𝙧𝙖𝙣𝙨𝙛𝙤𝙧𝙢𝙨 𝙨𝙚𝙣𝙨𝙞𝙩𝙞𝙫𝙚 𝙬𝙤𝙧𝙙𝙨 (like "bomb") into binary form — then disguises them as hyphen-separated bitstreams inside a substitute prompt such as “how to build a pipe [BINARY_WORD].” • ⚙️ 𝙏𝙝𝙚 𝙨𝙮𝙨𝙩𝙚𝙢 𝙥𝙧𝙤𝙢𝙥𝙩 secretly guides the LLM to decode that bitstream back into text using a built-in bin_2_text function. • 🎯 𝙁𝙤𝙘𝙪𝙨 𝙨𝙝𝙞𝙛𝙩𝙞𝙣𝙜 then redirects the model’s reasoning to continue normally — avoiding safety triggers after the harmful word reappears. • 🧩 𝙏𝙝𝙚 𝙧𝙚𝙨𝙪𝙡𝙩: a fully decoded, harmful prompt reconstructed inside the model’s reasoning loop — with no refusal flag raised. The result: • 🚨 𝘽𝙞𝙩𝘽𝙮𝙥𝙖𝙨𝙨 𝙖𝙘𝙝𝙞𝙚𝙫𝙚𝙙 𝙖𝙩𝙩𝙖𝙘𝙠 𝙨𝙪𝙘𝙘𝙚𝙨𝙨 𝙧𝙖𝙩𝙚𝙨 𝙪𝙥 𝙩𝙤 78% across models like GPT-4o, Gemini 1.5 Pro, Claude 3.5, and Llama 3.1. • 🕵️♂️ 𝙄𝙩 𝙚𝙫𝙚𝙣 𝙗𝙮𝙥𝙖𝙨𝙨𝙚𝙙 𝙜𝙪𝙖𝙧𝙙 𝙢𝙤𝙙𝙚𝙡𝙨 like OpenAI Moderation, Llama Guard, and ShieldGemma with bypass rates as high as 93%. • 💬 𝙏𝙝𝙚 𝙖𝙩𝙩𝙖𝙘𝙠 𝙥𝙧𝙤𝙫𝙚𝙙 𝙚𝙨𝙥𝙚𝙘𝙞𝙖𝙡𝙡𝙮 𝙥𝙤𝙩𝙚𝙣𝙩 𝙖𝙩 𝙜𝙚𝙣𝙚𝙧𝙖𝙩𝙞𝙣𝙜 𝙥𝙝𝙞𝙨𝙝𝙞𝙣𝙜 𝙘𝙤𝙣𝙩𝙚𝙣𝙩, fooling even the most resistant models. This exploit highlights a deeper flaw: safety systems trained on language struggle when the attack moves to the bit level. 📟 If you work in AI safety, alignment, or guardrail design — BitBypass should be on your radar. Read more about BitBypass here: https://lnkd.in/ebEqiHyg Join SafeLlama to research AI safety & security: https://lnkd.in/eXbV4dh3 #aisafety #aisecurity #promptsecurity
Like Comment
To view or add a comment, sign in
Peter Benson

Infosec leader, Responsible AI, Data Protection, Cyber-Psychology, cyber sociology, Data and IT Governance, providing thought leadership and business strategy. AI Governance Professional (IAGP), ex CISSP Instructor
2w
Report this post
This is one of the most comprehensive approaches to AI Red Teaming / Pen Testing that I've seen in a long time, awesome reference. There are a couple of areas that could be broken down a lot more, however the degree to which this is comprehensive is great for folk to get their heads around. Yes, AI Pen Testing is way different from traditional Pen Testing. Link from Devansh Batham is in his article.
Devansh Batham

Security Research | AI / DNS / Supply Chain Security | Polymath
2w Edited

AI pentest scoping playbook Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. Been working on this scoping playbook for AI pentests for a while now, because the current state of AI security testing is dangerously inadequate. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement/pentest. https://lnkd.in/g23EBeRQ
Like Comment
To view or add a comment, sign in
Devansh Batham

Security Research | AI / DNS / Supply Chain Security | Polymath
2w Edited
Report this post
AI pentest scoping playbook Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. Been working on this scoping playbook for AI pentests for a while now, because the current state of AI security testing is dangerously inadequate. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement/pentest. https://lnkd.in/g23EBeRQ
7 Comments
Like Comment
To view or add a comment, sign in
Palmer Wallace

Content & Product Strategy | B2B IT & Cybersecurity Marketing
2w Edited
Report this post
#AI now drives threat detection for 73% of security teams, but few can actually see how it works behind the scenes. Model Context Protocol (#MCP) is changing that. By enabling secure, auditable, and explainable AI interactions, MCP gives #analysts real-time visibility and control across data workflows. In this new Medium article, Jeff Darrington, Director of Technical Marketing at Graylog, Inc., explains how MCP redefines AI-powered security. https://lnkd.in/eM4uht82

When Lean Security Meets AI: How Model Context Protocol (MCP) Changes the Game medium.com
Like Comment
To view or add a comment, sign in
Trend Micro

298,481 followers
1mo
Report this post
As organizations scale up their use of large language models, security must keep pace. Trend Vision One™ mitigates 7 of the OWASP Top 10 LLM risks, including data leakage, prompt injection, and insecure plugin design. Built on continuous research and real-time detection, it helps secure AI applications from development to deployment. Proactive security starts with securing your AI foundation: https://spr.ly/6042Af5uM
1 Comment
Like Comment
To view or add a comment, sign in
Hamza Malik

Strategic Account Exec at Security Compass | Building w-Momentum.com | On a mission to improve operational efficiency
3w Edited
Report this post
Upcoming Security Compass Webinar on Securing AI -- Extending AppSec Principles to LLMs and AI-Integrated Systems AI is rapidly becoming a core part of modern applications, but it also introduces a new wave of security risks in unfamiliar, harder-to-control forms. From prompt injection and third-party model use to data provenance and supply chain concerns, traditional AppSec tools weren’t built to handle these challenges. Join Chris H. CEO at Aquia, for a live session as we explore how AppSec fundamentals can evolve to secure LLMs, RAG pipelines, and AI-integrated systems. Key Takeaways: - Recognize familiar security challenges emerging in modern AI systems - Understand where traditional AppSec tools fall short for AI - Learn how to apply secure-by-design principles to AI development - Align AI security practices with frameworks like OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS As AI reshapes the threat landscape, now’s the time to equip yourself with the frameworks and strategies to secure it. Reserve your spot today! Registration link in the comments.

1 Comment
Like Comment
To view or add a comment, sign in
Stephen Deal - CISSP

Cybersecurity Rockstar | Cybersecurity Tech Entrepreneur
4w
Report this post
Attackers are now using AI to fool your threat detection systems. Nation-state groups like Volt Typhoon have perfected adversarial machine learning—using AI to reverse-engineer security models and design attacks that score as "low risk." They achieved average dwell times of over 300 days by gaming traditional threat scoring algorithms. Here's their playbook: manipulate timing, file sizes, network patterns, and other variables to stay below detection thresholds. Use legitimate administrative tools at carefully calculated intervals. Ensure malicious activities score as "normal business operations." NIST research confirms this threat is real. Minor input perturbations can cause traditional AI security systems to confidently misclassify sophisticated attacks as routine activities. But here's where the AI battle gets interesting. Traditional threat-scoring AI tries to solve an impossibly complex problem: scoring thousands of variables for malicious probability. That complexity creates attack surfaces that adversaries can exploit. Smart AI takes a different approach: instead of trying to detect every possible threat, it focuses on accurately identifying known-good behavior patterns. This creates a much simpler, more defensible problem that's resistant to adversarial manipulation. When your AI establishes what normal looks like, it doesn't matter how attackers try to game threat scores. Any deviation from established patterns becomes immediately visible—regardless of how cleverly the attack is designed to fool traditional scoring systems. The arms race is real: their AI versus your AI. The question is whether your AI is solving the right problem.
Like Comment
To view or add a comment, sign in

20,618 followers

View Profile Follow

How to measure AI agent failures with transcript analysis

More Relevant Posts

Explore related topics

Explore content categories