Most AI tools firehose data. Then wonder why users drown. This paper asks: what if confidence could be 𝘧𝘦𝘭𝘵, not just read? Spoiler: decisions get 8.2% more accurate. Min Hun Lee and Martyn Zhe Yu Tok from Singapore Management University tested this in a high-stakes setting: physical stroke rehabilitation. Their AI assistant assessed patient motion quality—but instead of just showing a probability score (e.g. "91% confident"), it visualized why the model was confident. Here’s what they changed: 1️⃣ 𝗔𝗱𝗱𝗲𝗱 𝘃𝗶𝘀𝘂𝗮𝗹 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝘀: Cases were shown in a mapped space, clustered by similarity to known outcomes. 2️⃣ 𝗠𝗮𝗿𝗸𝗲𝗱 𝗶𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝘃𝗲 𝘁𝗵𝗿𝗲𝘀𝗵𝗼𝗹𝗱𝘀: Users set the cutoff for when to trust the AI—before seeing its output. 3️⃣ 𝗚𝗮𝘃𝗲 𝗲𝘅𝗮𝗺𝗽𝗹𝗲-𝗯𝗮𝘀𝗲𝗱 𝗲𝘅𝗽𝗹𝗮𝗻𝗮𝘁𝗶𝗼𝗻𝘀: Users could explore real, similar cases to understand the AI’s reasoning. 📊 Compared to traditional tools: ▶ 8.2% more correct decisions ▶ 7.15% more course corrections ▶ 7.14% fewer wrong changes after seeing AI suggestions 𝗠𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆 Confidence shouldn’t be a lonely number—like “0.83” or “92%”—floating beneath a prediction. Too often, interfaces assume users are fluent in machine learning. But most people are visual thinkers—we process icons faster than text, and maps faster than tables. It’s got me rethinking how we present certainty in AI. Numbers are easy to output—but not always easy to trust. What if confidence was something users could navigate, not just read? 🤔 𝗞𝗲𝘆 𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀 𝘁𝗼 𝗱𝗶𝗴 𝗶𝗻𝘁𝗼 𝘁𝗵𝗲 𝘁𝗼𝗽𝗶𝗰 ▶ "One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques" by Vijay Arya, Rachel Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind et al. (IBM Research) https://lnkd.in/epa-PWnp ▶ "Does the whole exceed its parts? the effect of ai explanations on complementary team performance" by Gagan Bansal, Sherry Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Ribeiro, and Daniel Weld https://lnkd.in/eTJCm2WD ▶ "How Much Decision Power Should (A) I Have?: Investigating Patients’ Preferences Towards AI Autonomy in Healthcare Decision Making" by Dajung Kim, Niko Vegt, Valentijn Visch, and Marina Bos-de Vos. https://lnkd.in/e66CDJi4 ▶ "The effects of example-based explanations in a machine learning interface." by Carrie J Cai, Jonas Jongejan, and Jess Holbrook (Google) https://lnkd.in/e9U76gqm
Measuring trust in high-stakes tools
Explore top LinkedIn content from expert professionals.
Summary
Measuring trust in high-stakes tools means assessing how much users can depend on technologies like AI when the consequences of decisions are significant, such as in healthcare, transportation, or software development. Trust isn’t just about technical accuracy—it’s about transparency, accountability, and the emotional confidence people feel when using these systems.
- Prioritize transparency: Make sure your tool explains its decisions in a clear and accessible way, helping users understand not just the outcome but also the reasoning behind it.
- Build accountability: Incorporate checks, safeguards, and human oversight to ensure users feel someone is responsible for the results and any unexpected issues that arise.
- Balance quality and trust: Regularly evaluate both technical performance and user confidence by choosing benchmarks that reflect real-world needs and adapting them as your system evolves.
-
-
✩░▒▓▆▅▃▂▁𝐂𝐚𝐧 𝐖𝐞 𝐓𝐫𝐮𝐥𝐲 𝐓𝐫𝐮𝐬𝐭 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈?▁▂▃▅▆▓▒░✩ 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀 (GenFMs) are advancing at an unprecedented pace, but can we trust them in high-stakes applications? Excited to share our latest research, a 231-page deep dive titled: 📖 "𝐎𝐧 𝐭𝐡𝐞 𝐓𝐫𝐮𝐬𝐭𝐰𝐨𝐫𝐭𝐡𝐢𝐧𝐞𝐬𝐬 𝐨𝐟 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 – 𝐆𝐮𝐢𝐝𝐞𝐥𝐢𝐧𝐞, 𝐀𝐬𝐬𝐞𝐬𝐬𝐦𝐞𝐧𝐭, 𝐚𝐧𝐝 𝐏𝐞𝐫𝐬𝐩𝐞𝐜𝐭𝐢𝐯𝐞" 👉 A massive collaboration across 33 𝒊𝒏𝒔𝒕𝒊𝒕𝒖𝒕𝒆𝒔, led by the amazing Yue Huang, to tackle one of AI’s biggest challenges: 𝒕𝒓𝒖𝒔𝒕𝒘𝒐𝒓𝒕𝒉𝒊𝒏𝒆𝒔𝒔. Despite their breakthroughs, GenFMs—from LLMs to text-to-image and vision-language models (VLMs)—continue to face critical risks. Yet, there’s no universal trustworthiness standard across academia, enterprises, and developers. ⚠️ How do we ensure that GenFMs are not just powerful, but also safe, fair, and reliable?💡 In this work, we introduce a holistic, evolving framework to standardize and evaluate trustworthiness in GenFMs. ▀▄▀▄▀🔍 𝐊𝐄𝐘 𝐂𝐎𝐍𝐓𝐑𝐈𝐁𝐔𝐓𝐈𝐎𝐍𝐒 🔍 ▄▀▄▀▄ 🏛 1️⃣ 𝑬𝒔𝒕𝒂𝒃𝒍𝒊𝒔𝒉𝒊𝒏𝒈 𝑮𝒐𝒗𝒆𝒓𝒏𝒂𝒏𝒄𝒆 𝑷𝒓𝒊𝒏𝒄𝒊𝒑𝒍𝒆𝒔 We propose a multidisciplinary framework integrating: ✅ AI Governance ✅ AI Law ✅ Computer Science Research This provides a flexible, evolving foundation for developing, evaluating, and governing GenFMs across different applications. 🛠 2️⃣ 𝑰𝒏𝒕𝒓𝒐𝒅𝒖𝒄𝒊𝒏𝒈 𝑻𝑹𝑼𝑺𝑻𝑮𝑬𝑵: The First Dynamic Benchmarking Platform Traditional benchmarks fail to capture the rapidly evolving trustworthiness threats like: 🚨 Jailbreaks 🚨 Prompt Injections 🚨 Adversarial Attacks TRUSTGEN enables: ✅ Modular, Adaptive Evaluation – covering text-to-image, LLMs, and VLMs ✅ Metadata Curation & Test Case Generation – going beyond static testing ✅ Contextual Variants – evaluating models in real-world-like scenarios. 📊 3️⃣ 𝑺𝒚𝒔𝒕𝒆𝒎𝒂𝒕𝒊𝒄 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏 𝒐𝒇 𝑺𝑶𝑻𝑨 𝑮𝒆𝒏𝑭𝑴𝒔 We benchmarked leading generative models and found: ✔️ Significant improvements in trustworthiness compared to last year’s evaluations ⚠️ Persistent challenges remain – including hidden vulnerabilities in open-source models & excessive safety constraints ⚖️ Balancing Utility & Safety – Models must navigate trade-offs between security and usability »»»💡 𝗪𝗵𝗮𝘁’𝘀 𝗡𝗲𝘅𝘁?»»» Trustworthiness is not static—it evolves as models become more powerful and complex. Our work highlights the most urgent challenges and sets a roadmap for the future of trustworthy AI in: 🔹 Healthcare 🏥 🔹 AI for Science 🧬 🔹 Cybersecurity 🔒 🔹 Human-AI Interaction 🤖 🔹 Transportation 🚃 , and more interdisciplinary applications ... 📜 𝗥𝗲𝗮𝗱 𝗠𝗼𝗿𝗲: https://lnkd.in/gvNZ2vts 🛠 𝗧𝗿𝘆 𝗢𝘂𝗿 𝗧𝗼𝗼𝗹𝗸𝗶𝘁: https://lnkd.in/gwDVEkJx Let’s build safer, more reliable GenAI systems together! 🚀 #TrustworthyAI #GenerativeAI #LLMs #VLMs #AIResearch #AIEthics #AIRegulation #Benchmarking #MachineLearning #AI4Science
-
Can We Trust AI to Write Code? Here’s What the Latest Research Says 👉 WHY TRUST MATTERS MORE THAN EVER Software engineering has always been a team sport. But what happens when your newest teammate isn’t human? The paper "AI Software Engineer: Programming with Trust" (arXiv:2502.13767v1) highlights a critical shift: as AI-generated code becomes commonplace, trust—not just functionality—will define success. Traditional software practices rely on human accountability (e.g., code reviews, developer reputation). 🚨 With AI, we lack that accountability. 👉. Developers hesitate to adopt AI tools not because of technical limitations, but because they can’t verify why the code works—or when it might fail. This isn’t hypothetical. Studies show developers reject AI-generated code not for errors, but for opaque reasoning. The barrier isn’t capability—it’s confidence. 👉 WHAT CHANGES WHEN AI JOINS THE TEAM? AI software engineers, like Devin, promise autonomy in tasks like bug fixes and feature development. But autonomy without accountability creates risk. The paper argues LLM agents—AI systems augmented with testing, analysis, and formal verification tools—offer a solution. These agents: - Validate code integrity through automated testing and static analysis. - Explain decisions by linking code changes to inferred specifications (e.g., pre/post-conditions). - Enforce guardrails to block vulnerabilities or malicious inputs, similar to frameworks like Guardrails AI. Unlike standalone LLMs, agents integrate tools humans already trust. For example, an agent might run a generated patch through a test suite and a formal verifier before deployment. 👉 HOW TO BUILD TRUST AT SCALE At Quantalogic, we’re applying these principles to tackle software debt and modernization. Here’s how: 1. Shift from output to process: Measure AI success not by lines of code, but by traceability. Every AI-generated change includes metadata (e.g., tests passed, vulnerabilities scanned). 2. Prioritize hybrid workflows: Use AI for repetitive tasks (e.g., retro documentation), but require human sign-off for critical systems. 3. Leverage guardrails: Implement runtime checks (e.g., data leakage prevention, SQL injection detection) to filter unsafe outputs. The paper warns against over-reliance on metrics like “code velocity.” Instead, focus on qualitative trust signals: - Can developers understand AI’s reasoning? - Does the AI adapt to feedback? - Are safeguards in place to prevent cascading errors? 👉 Final Thought AI won’t replace developers—but it will redefine their role. The future belongs to teams that treat AI as a collaborator, not a tool. This means investing in systems that make AI’s decisions transparent, verifiable, and aligned with human expertise. For Quantalogic, this research validates our mission: building agentic platforms where trust is engineered into every interaction. Let’s engineer not just code, but confidence.
-
🚨 𝐖𝐡𝐚𝐭 𝐫𝐞𝐚𝐥𝐥𝐲 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 𝐟𝐨𝐫 𝐀𝐈 𝐚𝐝𝐨𝐩𝐭𝐢𝐨𝐧 𝐢𝐬 𝐭𝐫𝐮𝐬𝐭. So argues this great paper by Natalia Vuori, Barbara Burkhard, and Leena Pitkäranta. They ran a qualitative study where they tracked the introduction and use of a new AI technology (a competency mapping tool) in a company. They looked at 𝐭𝐰𝐨 𝐭𝐫𝐮𝐬𝐭 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧𝐬: 1️⃣ Cognitive: the rational evaluation of AI. 2️⃣ Emotional: people's feelings towards AI. And they identified 𝐟𝐨𝐮𝐫 𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭 𝐜𝐨𝐧𝐟𝐢𝐠𝐮𝐫𝐚𝐭𝐢𝐨𝐧𝐬: 1️⃣ Full trust (high cognitive/high emotional). 2️⃣ Full distrust (low cognitive/low emotional). 3️⃣ Uncomfortable trust (high cognitive/low emotional). 4️⃣ Blind trust (low cognitive/high emotional). They found that employees exhibited 𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭 𝐛𝐞𝐡𝐚𝐯𝐢𝐨𝐮𝐫𝐬 under these configurations: ➡️ Some people responded by "detailing" their digital footprints. ➡️ Others engaged in manipulating or withdrawing them. (𝘛𝘩𝘪𝘴 𝘪𝘴 𝘴𝘰𝘮𝘦𝘵𝘩𝘪𝘯𝘨 𝘐 𝘩𝘦𝘢𝘳 𝘢 𝘭𝘰𝘵 𝘰𝘧 𝘸𝘩𝘦𝘯 𝘳𝘶𝘯𝘯𝘪𝘯𝘨 𝘦𝘮𝘱𝘭𝘰𝘺𝘦𝘦 𝘧𝘰𝘤𝘶𝘴 𝘨𝘳𝘰𝘶𝘱𝘴, 𝘣𝘵𝘸. 𝘗𝘦𝘰𝘱𝘭𝘦 𝘢𝘳𝘦 𝘷𝘦𝘳𝘺 𝘸𝘢𝘳𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯 𝘵𝘩𝘦𝘺 𝘳𝘦𝘤𝘰𝘳𝘥 𝘪𝘯 𝘴𝘺𝘴𝘵𝘦𝘮𝘴, 𝘊𝘢𝘭𝘦𝘯𝘥𝘢𝘳𝘴 𝘰𝘳 𝘛𝘦𝘢𝘮𝘴.) These behaviours triggered a ‘𝐯𝐢𝐜𝐢𝐨𝐮𝐬 𝐜𝐲𝐜𝐥𝐞’ where "manipulated" data degraded the tool's performance, further eroding trust and stalling adoption. 𝐖𝐡𝐚𝐭'𝐬 𝐭𝐨 𝐛𝐞 𝐝𝐨𝐧𝐞? 🔺 Leaders need to develop a 𝐩𝐞𝐨𝐩𝐥𝐞-𝐜𝐞𝐧𝐭𝐫𝐢𝐜 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲 for AI adoption. 🔺 Training, clear principles, 𝐫𝐞𝐚𝐥𝐢𝐬𝐭𝐢𝐜 𝐞𝐱𝐩𝐞𝐜𝐭𝐚𝐭𝐢𝐨𝐧𝐬 can help to build cognitive trust. 🔺 Role-modelling, 𝐩𝐬𝐲𝐜𝐡𝐨𝐥𝐨𝐠𝐢𝐜𝐚𝐥 𝐬𝐚𝐟𝐞𝐭𝐲, and strong ethical guidelines can help to build emotional trust. 🎤 𝐖𝐡𝐚𝐭 𝐝𝐨 𝐲𝐨𝐮 𝐭𝐡𝐢𝐧𝐤? 𝐖𝐡𝐚𝐭 𝐰𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐚𝐝𝐝? 👉 Find a summary of the research here: https://buff.ly/4f9fHFG 📰 Reference: Vuori, N., Burkhard, B. and Pitkäranta, L. (2025), It's Amazing – But Terrifying!: Unveiling the Combined Effect of Emotional and Cognitive Trust on Organizational Member' Behaviours, AI Performance, and Adoption. J. Manage. Stud.. https://buff.ly/aq7ZFkW #AI #BehavioralScience #PsychologicalSafety #Trust #EmployeeExperience 🚴♂️ Enjoy posts like this? 👉 Get my 𝐪𝐮𝐚𝐫𝐭𝐞𝐫𝐥𝐲 𝐫𝐨𝐮𝐧𝐝𝐮𝐩: https://buff.ly/E7Ta0aB
-
𝔼𝕍𝔸𝕃 field note (2 of 3): Finding the benchmarks that matter for your own use cases is one of the biggest contributors to AI success. Let's dive in. AI adoption hinges on two foundational pillars: quality and trust. Like the dual nature of a superhero, quality and trust play distinct but interconnected roles in ensuring the success of AI systems. This duality underscores the importance of rigorous evaluation. Benchmarks, whether automated or human-centric, are the tools that allow us to measure and enhance quality while systematically building trust. By identifying the benchmarks that matter for your specific use case, you can ensure your AI system not only performs at its peak but also inspires confidence in its users. 🦸♂️ Quality is the superpower—think Superman—able to deliver remarkable feats like reasoning and understanding across modalities to deliver innovative capabilities. Evaluating quality involves tools like controllability frameworks to ensure predictable behavior, performance metrics to set clear expectations, and methods like automated benchmarks and human evaluations to measure capabilities. Techniques such as red-teaming further stress-test the system to identify blind spots. 👓 But trust is the alter ego—Clark Kent—the steady, dependable force that puts the superpower into the right place at the right time, and ensures these powers are used wisely and responsibly. Building trust requires measures that ensure systems are helpful (meeting user needs), harmless (avoiding unintended harm), and fair (mitigating bias). Transparency through explainability and robust verification processes further solidifies user confidence by revealing where a system excels—and where it isn’t ready yet. For AI systems, one cannot thrive without the other. A system with exceptional quality but no trust risks indifference or rejection - a collective "shrug" from your users. Conversely, all the trust in the world without quality reduces the potential to deliver real value. To ensure success, prioritize benchmarks that align with your use case, continuously measure both quality and trust, and adapt your evaluation as your system evolves. You can get started today: map use case requirements to benchmark types, identify critical metrics (accuracy, latency, bias), set minimum performance thresholds (aka: exit criteria), and choose complementary benchmarks (for better coverage of failure modes, and to avoid over-fitting to a single number). By doing so, you can build AI systems that not only perform but also earn the trust of their users—unlocking long-term value.
-
𝗟𝗟𝗠 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀? Google Cloud 𝘁𝗮𝗰𝗸𝗹𝗲𝘀 𝘁𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝗵𝗲𝗮𝗱 𝗼𝗻! I'm happy to share Gemini Hallcheck, a new open-source toolkit for evaluating model trustworthiness! A groundbreaking paper by Kalai et al. at OpenAI and Georgia Tech explains why hallucinations are a statistically inevitable result of pre-training. Our work provides the first open-source implementation of their core proposal to manage this reality. Existing benchmarks measure accuracy, but often reward models for confident guessing. This hides the real-world risk of hallucinations and makes it difficult to choose the truly most reliable model for high-stakes tasks. Building on the theoretical framework from the paper, we've created a practical evaluation suite that moves beyond simple accuracy to measure Behavioural Calibration. Here are the highlights: 🎯 Confidence-Targeted Prompting: A new evaluation method that tests if a model can follow risk/reward rules. ⚖️ Abstention-Aware Scoring: Implements the paper's novel penalty scheme to reward honest "I don't know" answers instead of penalizing them. 📈 Trustworthiness Curves: Generates a trade-off curve between a model's answer coverage and its conditional accuracy, revealing its true reliability. Our initial tests show that some models that look best on traditional accuracy benchmarks are not the most behaviourally calibrated. Choosing the right model for your enterprise use case just got a lot clearer 🤗 We're open-sourcing our work to help the community build and select more trustworthy AI. Feel free to explore the GitHub repo and run the evaluation on your own models, link to the code in the comments below!
-
If you can't measure the reasoning chain, you can't trust the agent. We’re entering the age of agentic AI - networks of autonomous agents collaborating to solve scientific, technical, and operational challenges. But there’s a problem: → Most evaluation methods only check the final output. → They ignore how agents exchange information, build hypotheses, and plan together. → They offer no window into why the system made its decisions. The “Agent-as-a-Judge” approach is a step toward fixing this. This isn’t a new paper, but it’s becoming more and more relevant as agentic systems move from labs to real-world deployments. Instead of humans grading black boxes or LLMs scoring final answers, it uses agentic systems to evaluate agentic systems. This means: → Mapping how tasks, information, and reasoning flow across the agent network. → Checking if the agents’ communication and distributed planning are coherent. → Auditing how coordination impacts outcomes. The result? Transparent, auditable, and trusted agentic systems capable of tackling novel, high-stakes problems and producing verifiable, interpretable outputs. For leaders building agentic AI: → This shifts evaluation from pass/fail to process-level diagnostics. → It enables safer deployment in data-scarce, mission-critical environments. → It lays the groundwork for self-improving agent collectives. As agentic AI scales, how you evaluate will determine how you trust and deploy. It’s time to measure the chain, not just the outcome.