Evals to Improve Consistency in LLM-based Classification

Evals to Improve Consistency in LLM-based Classification

Authored by: Ritvvij Parrikh and Devesh Gobind

Co-Authored by: Ashish Jaiswal 

LLMs are powerful but inherently non-deterministic. Their variability can introduce engineering and product risks. In sensitive areas like news distribution or editorial judgement, even small inconsistencies can affect user trust and brand reputation.

This post explains how we improved consistency in one of our key systems - the Editorial Judgement Agent - by building a systematic evaluation framework that strengthens editorial decision-making.


Problem

As a team, we have been building Machine Learning models for over 3 years now. However, using LLMs in production almost feels like writing fragile rule engines. Here’s why:

How to converge? Unlike in ML, there is no native convergence loop - no loss function, no backpropagation, no built-in feedback loop - so improvement requires external scaffolding.

How to bring in reproducibility? Outputs from LLMs keep varying. Output of the same prompt in the same (minor version) model can vary across runs. Behind the scenes, the LLM provider can do minor upgrades within the same major model. For example, foundation model providers roll out several minor upgrades to specific models. Switching across major versions? The risk increases.

How to optimize across iterations? Iteration of the prompts is inevitable. As requirements evolve or as editors cover new niches, we might have to expand the prompt to handle more scenarios. However, there is a tendency of prompt writers to overfit to narrow test data. Changes might break regression scenarios. Finally, small prompt changes cascade globally, breaking prior assumptions like brittle glue code.

How to maintain it? As the prompt grows, it becomes harder and harder to debug, reason, or refactor. Finally, fixes are fragile. You patch one issue, and some other regression scenarios break.

Why it matters

In editorial contexts, inconsistent model behaviour can lead to systematic errors that are difficult to trace. The Editorial Judgement Agent underpins several downstream systems - from algorithmic feeds to push notifications - where precision and transparency are essential. Ensuring auditability and traceability helps maintain editorial consistency and user trust.


A quick note on Evals

Evals defines systematic estimation of application quality. The goal isn’t to be deterministic but to be systematic in continuously improving. There are three parts to building evals:

  • Labeled dataset: Evals require domain knowledge and taste. This is requirements-writing as much as QC. It forces clarity on what “acceptable” means.
  • Iteration Loop: Once an eval fails, we have to look at the output from LLMs, notice what feels off, and then make changes to the prompt or system to fix it. 
  • Measurements: Without measurement, iteration is blind. Evals give a trend of whether product quality is improving.


Evals for Editorial Judgement Agent

Decomposition for Auditability

We broke the monolith prompt into more than sixteen discrete, observable variables. Each variable is independently testable and auditable, which localizes failure, reduces variance, and keeps reasoning legible. It also made the prompt maintainable and cheaper to evolve.

Avoiding Combinatorial Explosion

These sixteen variables are stitched together into a sequential pipeline. This creates hierarchical epistemic layering: higher-variance variables rest on auditable primitive variables rather than opaque end-to-end inference.

Image below shows how each variable’s prompt is dependent on other variables. 

Article content

Class Design for Control

We recognize that news and journalism use cases, especially cultural and political, are subjective in nature. Accuracy cannot be measured only in accuracy terms. Hence, each of the valid values has three checks:

The eval checks each of the 16 variables for:

  • “Accurate” for exact matches. For the Editorial Judgement Agent, accurate values represent the editor’s choice.
  • “Acceptable” for near-matches. For the Editorial Judgement Agent, these values are editorially okay and give a slightly less than desirable judgement. 
  • “Not acceptable” for failures. For the Editorial Judgement Agent, these values are absolutely unacceptable.

This decision influenced our class design. Class design was engineered to minimize ambiguity and enforce control over LLM outputs:

  • All classes were mutually exclusive and collectively exhaustive to prevent overlap or omission.
  • Edge cases were handled through manual override rules and conflict-resolution logic, avoiding reliance on probabilistic inference.
  • Classification followed either a “best fit” strategy or sequential matching, stopping at the first valid match to reduce noise.
  • All class names were in snake_case to eliminate string-matching errors during post-processing.
  • Class names were semantically grouped to allow fuzzy rule logic - for example, using prefixes like “yes_1”, “yes_2”, “yes_3” enabled partial matches while retaining intent.
  • To ensure the LLM returns only one of the approved valid values, we enforce tool calls using enums.  

Maintaining Deterministic FallBacks Outside of LLM

The final decision on whether a story appears in a feed is made through deterministic logic implemented outside the LLM. These checks aggregate multiple model signals but keep editorial authority with human editors.

Measures for each variable and each valid value

Based on the nature of the classification variable, each one has a different accuracy and acceptable benchmark.

Article content

Each variable has 2-30 different valid values. We generate an accuracy, acceptable, and not acceptable score per valid value. This tells us what we need to improve on.

Article content

Debugging and Traceability

To accelerate correction cycles, each variable produces structured reasoning traces. Editors can review these traces to see where the model inferred or skipped logic. Cross-variable correlations help identify hidden dependencies - for example, when changes in one classification disproportionately affect another. This leads to targeted refactors rather than blind prompt iteration.

Article content

Measuring Precision

We need to measure how stable the predicted output is across multiple LLM calls.

Why it matters: This happens because the output of LLM models are not deterministic intrinsically. Additionally, if our instructions are vague/contradictory, it might interpret it differently across 2 iterations, hence creating a divergence in the output. The model also becomes less deterministic as we increase the temperature of the model. 

Here’s how:

  • Flip rate: % of outputs that change across reruns.
  • Flip rate acceptable: % of outputs drifting beyond acceptable thresholds.
  • Temperature sweeps: Identify stability vs. accuracy sweet spots.

By quantifying stability, we reduce the product risk of drift across environments or model upgrades.

Coverage of labeled data

But building labeled datasets is costly (editor time + API runs). The product question is: how much is enough?

Data scientists can guide how many examples are required to estimate accuracy with the desired confidence. But instinctively, the labeled dataset must cover as many editorial use cases as possible. Additionally, as we discover edge cases, we add more labeled datasets. 

Decision we make based on evals

Evals give us clarity to make product decisions, such as:

  • Removing unreliable valid values that can’t hit acceptable thresholds.
  • Splitting classes until edge cases are reliably caught.


How output from Editorial Judgement Agent became an eval for ‘Algorithmic Distribution’

The real test of this system is not in isolation, but in how it powers downstream. Output from the Editorial Judgement Agent now acts as the eval for Algorithmic Distribution - if the feed being predicted right now broadly has a good mix of immediacy and editorial relevance.


What next

We are continuing to evolve this system, migrating from LangChain to DSPy and GEPA to further improve stability and consistency across models - while keeping human judgement firmly at the center.


Credits

Abhijith Chandran

Brand Alchemist • Chaos Whisperer • Soul Scout • Thought Twister • I hunt the truths the surface hides! Shikamaru Nara with destiny as my guide.

3w

In a world of drift and digital haze, You built a compass that forever stays. 

Like
Reply
Abhijith Chandran

Brand Alchemist • Chaos Whisperer • Soul Scout • Thought Twister • I hunt the truths the surface hides! Shikamaru Nara with destiny as my guide.

3w

When logic meets poetry in lines of code, Consistency becomes the truth we owed. 

Like
Reply

To view or add a comment, sign in

More articles by Times Internet

Explore content categories