Measuring how often an AI agent succeeds at a task can help us assess its capabilities – but it doesn’t tell the whole story. We’ve been experimenting with transcript analysis to better understand not just how often agents succeed, but why they fail. Our model evaluations generate thousands of transcripts, which can contain an entire novel’s worth of text. They are a record of everything the model did during a task, including the external tools it accessed, and its outputs at each step. In a recent case study, we analysed almost 6,400 transcripts from AISI evaluations of nine models on 71 cyber tasks. We studied several features of these transcripts, including overall length and composition, and the agent’s commentary throughout. We found that there are many reasons a model may fail to complete a task, beyond capability limitations. These can include safety refusals, lack of compliance with scaffolding instructions, or difficulty using tools. We’re sharing our analysis to encourage others conducting safety evaluations to review their own transcripts, in a systematic and quantitative way. This can help foster more accurate and robust claims about agent capabilities. Read more on our blog: https://lnkd.in/eiCn6zkP
Chi Zhang, PhD
👏 Vital work AI Security Institute - the commercial realm needs to know if they can trust AI Agents. Vendor performance claims vary, mostly because they are not tested in a uniform way by non-biased parties. Your work will help organisations make decisions and get beyond the proof of concept into realising the benefits faster and with reduced risk.