This paper by Apple researchers introduces a benchmark called TOOLSANDBOX, which intends to be a comprehensive evaluation framework for assessing how well LLMs can handle stateful, conversational, and interactive tasks using tools, and to offer new insights into the capabilities and limitations of these models. TOOLSANDBOX is a testing framework created to see how good LLMs are at using tools to complete various tasks. The tools could be anything from APIs to databases or even simple functions like checking the weather or making a restaurant reservation. Key concepts in the paper: - Stateful Tasks: This means the tasks require the AI to remember previous actions or decisions it made earlier in the conversation. For example, if the AI turned on the internet in a previous step, it should remember that the internet is now on and not try to turn it on again. - Tool Use: The AI needs to know when and how to use different tools. Some tasks might require using multiple tools in a sequence, and the AI has to figure out the correct order and timing for using these tools. - Evaluation: The benchmark tests the AI on various scenarios to see how well it handles tasks that require multiple steps, state management, and decision-making with limited information. The paper concludes that while AI models are getting better at handling simple tasks, they still struggle with more complex scenarios where they need to use multiple tools, remember previous actions, and make decisions based on incomplete information. This research helps in understanding the limitations of current AI models and where improvements are needed. Specifically, the text highlights the difficulty models like Mistral and Hermes face in identifying when to issue a tool call. E.g., Mistral often mistakenly treats a tool-use scenario as a code generation task, leading to poor performance. GPT-4o and Claude-3-Opus are also evaluated, with GPT-4o achieving the highest similarity score, although both models struggle with complex tool call sequences. In general, the challenges include managing tasks dependent on prior states, ensuring consistent tool use across contexts, and handling situations with incomplete data or on-the-fly decision-making. TOOLSANDBOX is compared with other benchmarks like BFCL (Berkeley Function Calling Leaderboard), ToolEval, and API-Bank. While these other benchmarks also focus on tool-use capabilities, TOOLSANDBOX is distinguished by its focus on stateful, interactive, and conversational tool use, along with a human-authored ground truth for evaluation. The benchmark highlights that even the most advanced state-of-the-art LLMs (SOTA LLMs) struggle with the complex tasks posed by TOOLSANDBOX, indicating the challenges in making LLMs effective tool users in real-world scenarios. by Apple researchers Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu (Janet) Li, Guoli Yin, Zirui Wang, Ruoming Pang
Understanding AI Limitations in Research
Explore top LinkedIn content from expert professionals.
Summary
Understanding the limitations of AI in research is crucial as it helps us address areas where these models fall short, such as reasoning, decision-making, and adapting to nuanced, real-world contexts. While AI tools excel at analyzing patterns and automating tasks, they lack true understanding, creative innovation, and the capacity to navigate ambiguity or environmental complexities.
- Embrace human oversight: Use AI as a supportive tool but rely on human judgment to interpret data, resolve conflicts, and address complexities that AI cannot handle.
- Focus on contextual evaluation: Evaluate AI performance in real-world scenarios to better grasp its practical limitations and areas for improvement.
- Prioritize safety and governance: Implement robust safety protocols, transparent practices, and continuous evaluations to ensure responsible and reliable use of AI in research.
-
-
What AI Still Can’t Do in UX Research (YET!) In my HCI class, we discussed LLMs, and their expanding role in UXR. I had a mix of students from psychology and computer science, and the contrast in their perspectives was one of the most engaging parts of the discussion. While the CS students were excited by AI’s analytical power, the psychology students were quick to point out what gets lost when you remove the human element. That tension sparked an honest and nuanced conversation, one I think many UX teams are navigating right now. We often hear how AI can streamline processes and surface insights faster than ever before. But the real question is: where does AI stop, and where must human insight step in? That led us to areas where AI still can’t match human capabilities. For example, while AI can analyze transcripts or summarize sentiment, it doesn’t understand people. It won’t notice a pause before someone answers, frustration in a participant’s tone, or the subtle body language that signals confusion. These small cues are often where the richest insights emerge, and they’re simply not available to models trained on words alone. Creativity and innovation were another focus. AI can recombine existing ideas, but it can’t invent something fundamentally new or rethink a problem from the ground up. In UX, this matters. We’re not just validating existing solutions, we’re often trying to uncover unmet needs or explore new directions. That requires intuition, original thinking, and the ability to question the assumptions behind the product itself. When it comes to strategy and synthesis, the limitations become even more apparent. Sure, AI can cluster feedback or generate dashboards. But it can’t resolve contradictions, weigh trade-offs, or prioritize conflicting user needs against business goals. Those are messy, context-dependent decisions that rely on human judgment and experience. Another key gap is environmental awareness. AI doesn’t experience screen glare, background noise, or a laggy connection. It won’t notice how someone adjusts their behavior when a feature breaks or when a multitasking parent is using an app one-handed. These real-world conditions often reveal the most meaningful usability issues, things that can’t be spotted in clean, structured data alone. We also talked about trust. AI can confidently generate summaries or highlight patterns but it doesn’t know when it’s wrong. It can reproduce bias, oversimplify nuance and give the illusion of certainty. This is why human oversight isn’t just helpful, it’s necessary. Researchers need to validate, question, and interpret results, not just take them at face value. These are the kinds of conversations we need to be having as AI becomes more integrated into our research practice. AI is a powerful assistant, but it’s not a researcher. The most meaningful UX insights still come from empathy, critical thinking, and a deep understanding of context. And for now, those remain human strengths.
-
Let me translate neural networks and LLMs into actual reality: NEURAL NETWORKS EXPLAINED: Not: • Digital brains • Thinking machines • Intelligent systems • Magic boxes But: • Math functions • Pattern matchers • Statistical models • Probability calculators Real Example: Like a complicated voting system: • Neurons = voters • Weights = voting power • Training = adjusting votes • Output = election result LLMs DECODED: Not: • Understanding language • Having conversations • Being intelligent • Thinking thoughts But: • Advanced autocomplete • Pattern recognition • Token prediction • Probability distribution How They Actually Work: 1. Training Phase: • Ingest massive text data • Find statistical patterns • Map token relationships • Create prediction models 2. Usage Phase: • Get input tokens • Calculate probabilities • Predict next tokens • Generate responses WHY IT'S NOT AGI: Real Limitations: 1. No Understanding • Can't reason about new concepts • Doesn't understand causation • No real-world model • Pure pattern matching 2. No Learning Framework • Can't learn from single examples • No conceptual transfer • No true adaptation • Static after training 3. No Internal Model • No real consciousness • No actual thinking • No true reasoning • No understanding Example of Limitation: LLM can write about making coffee but: • Doesn't understand what coffee is • Can't learn from burning tongue • Doesn't know why hot liquid hurts • Just matches patterns about coffee WHY AGI IS FAR: We're missing: 1. Core Intelligence • Real understanding • Causal reasoning • Conceptual learning • True adaptation 2. Consciousness Framework • Self-awareness • Original thought • Real reasoning • Actual understanding 3. Learning Architecture • One-shot learning • Knowledge transfer • Adaptive reasoning • True intelligence (From someone who's built both neural networks and real solutions) #AIReality #TechTruth #NoBS
-
"Our analysis of eleven case studies from AI-adjacent industries reveals three distinct categories of failure: institutional, procedural, and performance... By studying failures across sectors, we uncover critical lessons about risk assessment, safety protocols, and oversight mechanisms that can guide AI innovators in this era of rapid development. One of the most prominent risks is the tendency to prioritize rapid innovation and market dominance over safety. The case studies demonstrated a crucial need for transparency, robust third-party verification and evaluation, and comprehensive data governance practices, among other safety measures. Additionally, by investigating ongoing litigation against companies that deploy AI systems, we highlight the importance of proactively implementing measures that ensure safe, secure, and responsible AI development... Though today’s AI regulatory landscape remains fragmented, we identified five main sources of AI governance—laws and regulations, guidance, norms, standards, and organizational policies—to provide AI builders and users with a clear direction for the safe, secure, and responsible development of AI. In the absence of comprehensive, AI-focused federal legislation in the United States, we define compliance failure in the AI ecosystem as the failure to align with existing laws, government-issued guidance, globally accepted norms, standards, voluntary commitments, and organizational policies–whether publicly announced or confidential–that focus on responsible AI governance. The report concludes by addressing AI’s unique compliance issues stemming from its ongoing evolution and complexity. Ambiguous AI safety definitions and the rapid pace of development challenge efforts to govern it and potentially even its adoption across regulated industries, while problems with interpretability hinder the development of compliance mechanisms, and AI agents blur the lines of liability in the automated world. As organizations face risks ranging from minor infractions to catastrophic failures that could ripple across sectors, the stakes for effective oversight grow higher. Without proper safeguards, we risk eroding public trust in AI and creating industry practices that favor speed over safety—ultimately affecting innovation and society far beyond the AI sector itself. As history teaches us, highly complex systems are prone to a wide array of failures. We must look to the past to learn from these failures and to avoid similar mistakes as we build the ever more powerful AI systems of the future." Great work from Mariami Tkeshelashvili and Tiffany Saade at the Institute for Security and Technology (IST). Glad I could support alongside Chloe Autio, Alyssa Lefaivre Škopac, Matthew da Mota, Ph.D., Hadassah Drukarch, Avijit Ghosh, PhD, Alexander Reese, Akash Wasil and others!
-
The Ada Lovelace Institute has released an in-depth report on the limitations of evaluations of foundational models (e.g., LLMs). There's a lot here, but it's worth reading! This work is important, as model evaluations (as well as 'red teaming') are the primary mechanism policy makers are relying on for mitigating "systemic risk" from these systems. We therefore need to carefully explore their benefits and limitations. The key takeaways are: 🔍 Evaluations are valuable for understanding AI models, but face theoretical, practical, and social challenges. Governments, companies, and researchers must collaborate to improve their effectiveness in AI governance. 🛠️ While useful, evaluations alone can't guarantee AI safety in real-world conditions. They need to be combined with other governance tools like codes of practice, incident reporting, and post-market monitoring. 🎯 Current evaluation methods (e.g., red teaming, benchmarking) have limitations and can be manipulated. Developers might train models on evaluation datasets or cherry-pick favorable tests, compromising assessment integrity. 🔄 Model versions matter significantly. Even small changes or fine-tuning can cause unpredictable behavior shifts and potentially override safety features, complicating evaluation efforts. 🌍 Assessing AI safety requires considering the broader context, including users, interfaces, tool access, and environmental impacts. Lab tests are valuable but insufficient; context-specific evaluations are crucial. 🏢 Many evaluations seem designed for corporate or academic purposes rather than public or regulatory needs. Limited model transparency from developers hinders meaningful third-party assessments. Link in the comments. h/t Borhane Blili-Hamelin, PhD #llms #chatgpt #airegulation Khoa Lam, Dinah Rabe, Jeffery Recker, Bryan Ilg