When To Use Binary vs. Score Evals
It's your friendly monthly content roundup from the team at Arize. Check out the latest on evals, agent engineering, and more.
Teams often have wildly different methods for how they are defining their LLM evals – some use strictly boolean, while others use a variation of binary or multi-categorical values, score ranges, explanations, and other techniques. Are LLMs equally competent at all these approaches? This blog by Aparna Dhinakaran , Srilakshmi Chavali , and Elizabeth Hutton dives into best practices based on our testing. Read it.
Inspired by Anthropic 's "Building Effective AI Agents," we dive into orchestrator-worker agents and compare how leading frameworks – including Agno, Autogen, CrewAI, OpenAI, LangGraph, and Mastra – approach and implement this pattern. Learn more about orchestrator-worker agents in this blog by Sanjana Yeddula , Aparna Dhinakaran , and Srilakshmi Chavali .
AI data use-cases demand an interface that can handle both large files (like custom datasets) and highly scaled real-time events (like traces and spans). The Arize AX platform is designed to handle both, consistently. See adb benchmarks in this piece by Jason Lopatecki .
Useful Guides & Updates
📦 Freshly Shipped: What's new in Arize AX in September
📚 AI Researcher Show-and-Tell: Atropos Health 's Arjun Mukerji, PhD , PhD, explains RWESummary: a framework for using LLMs to summarize real-world evidence.
📊 Learn Something: When to use COT, reasoning & explanations for LLM-as-a-judge.
Upcoming Events
Get in the room with other agent engineers and builders.
- Tonight, San Francisco | #sftechweek Happy Hour
- October 8, Virtual | Paper Reading: Why Language Models Hallucinate
- October 8-9, Austin | 6th Annual MLOps World | GenAI Summit David Scharbach Faraz Thambi
- October 15, Virtual | PagerDuty AI Drives Modern Tech Stacks—But What Happens When It Fails?
- November 6, London | Building & Shipping Reliable Agents, featuring Google DeepMind
Build. Learn. Connect.
Want a personal walk-through of the Arize AX platform? Book a demo.