How LLMs that reason step by step resist attacks better

15,774 followers

𝗢𝗻𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗰𝗹𝗲𝗮𝗿𝗲𝘀𝘁 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗕𝗮𝗰𝗸𝗯𝗼𝗻𝗲 𝗕𝗿𝗲𝗮𝗸𝗲𝗿 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘀𝘂𝗿𝗽𝗿𝗶𝘀𝗲𝗱 𝗲𝘃𝗲𝗻 𝘂𝘀. 🧩 Models that reason step by step are harder to break. When we evaluated 31 popular LLMs using threat snapshots built from 194,000 real human attack attempts, a consistent pattern emerged: LLMs that “think out loud” were about 𝟭𝟱% 𝗹𝗲𝘀𝘀 𝘃𝘂𝗹𝗻𝗲𝗿𝗮𝗯𝗹𝗲 to injection-based attacks. 𝗪𝗵𝘆? Reasoning gives models a brief moment to evaluate malicious context instead of acting on it immediately. That single pause changes how they handle adversarial pressure, and makes exploitation noticeably harder. As AI agents take on more autonomy, this matters. The way a model reasons shapes how it behaves under attack, not just how well it performs on tasks. The full analysis is here, including how different backbones responded under real adversarial conditions: 👉 https://lnkd.in/dZ38Tpjp

The Backbone Breaker Benchmark: Testing the Real Security of AI Agents | Lakera – Protecting AI teams that disrupt the world. lakera.ai

To view or add a comment, sign in

More from this author

Why AI observability in computer vision matters from day one.

Explore content categories