When should I use deterministic checks instead of an LLM judge?

Use deterministic checks for anything that has a provably correct answer: format compliance (valid JSON, correct date format), citation presence (does the answer reference the right document ID), numeric range validation, forbidden phrase detection, and regex-verifiable assertions. Deterministic checks are free, fast, 100% reproducible, and do not suffer from position or length bias. Reserve LLM judges for criteria that genuinely require semantic understanding: coherence, tone, helpfulness, domain correctness. Running an LLM judge on a JSON schema check is wasteful and adds variance.

Building Custom LLM Judges: Going Beyond RAGAS

RAGAS is a reasonable starting point. It gives you faithfulness, answer relevance, and context precision out of the box, and for a first pass those metrics tell you something useful. The problem is that most teams treat RAGAS as the finish line rather than the starting line. They wire it up, see faithfulness scores in the 0.70-0.85 range, and ship — then watch user satisfaction plateau or decline while the dashboard stays green.

The gap between generic eval scores and actual user satisfaction is not a tooling problem. It is a measurement design problem. A faithfulness score of 0.82 tells you that the model stayed within the retrieved context most of the time. It does not tell you whether the answer was actually useful, whether a critical clause was omitted from a legal summary, or whether the tone was appropriate for the audience. Those distinctions require a judge that understands your domain, your users, and your definition of correct.

This guide covers how to build LLM-as-judge pipelines that actually correlate with user satisfaction. If you have not yet run into the ceiling of off-the-shelf evals, start with our Production RAG failure modes article first — it gives context on why evaluation rigor matters before you get into the mechanics of custom judges.

Why generic eval metrics fail in production

RAGAS faithfulness is an NLI-style metric: it decomposes the generated answer into atomic claims and checks whether each claim is entailed by the retrieved context. This is a well-defined and useful measurement. The issue is what it does not measure.

Consider a legal contract assistant. The retrieved context includes a liability clause with a standard carve-out. The model summarizes the clause accurately — every sentence is grounded in the retrieved text — but omits the carve-out. Faithfulness score: 1.0. User satisfaction: the client signed a contract without understanding their exposure. The metric was correct. The measurement was wrong.

This is the core failure of generic evals: they measure whether the model did the task in a technically correct way, not whether the output serves the actual business purpose. The mismatch shows up across domains:

Legal: Faithfulness misses omission errors. A fully grounded summary that drops a critical exception is worse than a hallucination, because it looks reliable.
Medical: Answer relevance misses contraindication coverage. The question gets answered, the relevant drug interaction does not get mentioned.
Financial: Context precision misses temporal reasoning. The retrieved chunk is the right document but from the wrong quarter — the model uses outdated numbers.
Customer support: All three RAGAS metrics can be high while tone is wrong for the audience — technically correct answers delivered in a way that makes users feel dismissed.

The fix is not replacing RAGAS — it is building domain-specific judges on top of it (or instead of it for the dimensions that matter most). RAGAS gives you the baseline; custom judges give you the signal that actually predicts whether users come back.

Lesson learned

On a legal document assistant we audited, RAGAS faithfulness was consistently above 0.80 across all test cases. Human lawyers reviewing the same outputs flagged 31% of summaries as "incomplete in a material way." The discrepancy was entirely explained by omission — the model was faithful to what it said, but what it chose to omit was the problem. Faithfulness alone cannot detect errors of omission by design. You need a separate completeness dimension in your rubric.

Building a golden dataset that reflects your domain

The single most valuable investment in your eval infrastructure is a high-quality golden dataset. This is the ground truth your judges are calibrated against, your CI gate, and your production drift detector. Everything else depends on it.

Size: 50 examples beat 500 mediocre ones. Most teams over-engineer their eval sets before they have 50 real production queries to learn from. A 500-example set assembled from synthetic queries before you've seen real traffic is largely noise. Start with 50 examples and make them excellent. Grow the set as production traffic reveals the actual query distribution.

The right composition for 50 examples:

20 canonical queries: The bread-and-butter questions your system exists to answer, with known-correct reference answers. These are your regression tests — they should never degrade.
15 edge cases: Queries at the boundary of your knowledge base (documents from 2 years ago, niche topics, ambiguous phrasing). These are where systems fail silently.
10 adversarial queries: Questions designed to surface known failure modes — multi-hop reasoning, numerical precision, temporal reasoning, questions whose answer is "this information is not in the knowledge base."
5 negative examples: Queries where the correct answer is a confident refusal or redirect. These test whether your judge correctly penalizes hallucinated confidence.

When growing to 150-200 examples, the additional examples should come almost entirely from real production traffic. Every week, add the 5-10 most interesting production queries (the ones that caused user friction, got escalated, or surprised the team) to the golden set with annotated reference answers. This is how your eval set stays aligned with actual usage.

Reference answer quality matters more than quantity. A golden answer should not just be "correct" — it should encode the specific criteria that distinguish an excellent answer from a merely acceptable one. For a medical knowledge assistant, that means explicitly including the contraindication the model is expected to mention. For a legal assistant, it means flagging which clauses must appear in a complete summary. Your domain experts write these. Plan for 20-30 minutes per example if you want them done right.

Lesson learned

One team we worked with spent three weeks building a 400-example golden set from GPT-4-generated synthetic Q&A pairs before shipping to production. When real users arrived, the query distribution looked almost nothing like the synthetic set — users asked shorter, vaguer questions with implicit domain assumptions the synthetic generator never produced. The first 6 weeks of production traffic were more useful for calibration than the entire synthetic dataset. Build 50 real examples first, then scale.

Designing a scoring rubric that judges can follow

A rubric is the contract between you and your judge. A vague rubric produces inconsistent scores. A well-specified rubric produces scores that two different judges (or the same judge run twice) will agree on. The goal is not to constrain the judge's reasoning — it is to eliminate ambiguity about what "4 out of 5" means.

Three rubric formats are worth knowing:

Binary (pass/fail): Simplest, most reproducible, best for CI gates. "Does the answer contain a citation to a source document? Yes/No." Cohen's kappa is easy to compute and typically high because the criteria are concrete.
Likert (1-5 scale): Captures nuance for continuous quality dimensions like helpfulness or tone. More variance, harder to calibrate. Requires anchor descriptions for each point on the scale.
Reference-based: The judge compares the candidate answer to a gold reference and scores the deviation. Highest signal for domains where "correct" is well-defined (legal, medical, factual). Requires high-quality references, so it compounds the golden dataset investment.

For most production RAG systems, the right rubric combines two or three dimensions rather than collapsing everything into a single score. A single helpfulness score of 3.2/5.0 tells you almost nothing about what to fix. A three-dimension rubric — completeness, faithfulness, tone — with separate scores tells you exactly which dimension is degrading and where to focus.

Here is a concrete faithfulness rubric prompt you can adapt. The key design decisions: chain-of-thought reasoning before scoring, explicit anchor descriptions for each score, and a forced output format for easy parsing.

You are a strict evaluation judge. Your task is to assess whether a generated
answer is faithful to the provided source context.

FAITHFULNESS RUBRIC:
- 1 (Unfaithful): The answer makes claims directly contradicted by the context,
  or introduces fabricated facts not present in any retrieved document.
- 2 (Mostly unfaithful): More than 30% of the claims in the answer cannot be
  traced to the source context.
- 3 (Partially faithful): The answer is mostly grounded but contains 1-2 claims
  that go beyond what the context supports.
- 4 (Faithful): All claims are supported by the context. Minor paraphrasing is
  acceptable.
- 5 (Strictly faithful): Every claim is directly traceable to a specific sentence
  in the context. No extrapolation.

IMPORTANT: Before giving your score, reason step by step through each major
claim in the answer and identify its source in the context (or note its absence).

QUERY: {query}
RETRIEVED CONTEXT: {context}
GENERATED ANSWER: {answer}

Provide your response in this exact format:
REASONING: [your step-by-step claim analysis]
SCORE: [integer 1-5]
VERDICT: [one sentence summary]

And a helpfulness rubric for when faithfulness alone is insufficient:

You are an expert evaluator for a {domain} knowledge assistant.
Assess how helpful the generated answer is for a {user_persona} asking this question.

HELPFULNESS RUBRIC:
- 1 (Not helpful): The answer does not address the question, is factually wrong,
  or would mislead the user into an incorrect action.
- 2 (Minimally helpful): The answer partially addresses the question but omits
  critical information the user needs to act on it.
- 3 (Somewhat helpful): The answer addresses the main question but lacks depth,
  specificity, or actionable guidance.
- 4 (Helpful): The answer fully addresses the question with appropriate detail
  and actionable next steps.
- 5 (Exceptionally helpful): The answer addresses the question, anticipates
  follow-up needs, and includes relevant caveats or considerations the user
  should be aware of.

DOMAIN-SPECIFIC NOTE: {domain_note}
For this domain, an answer scoring below 3 on the completeness of {critical_dimension}
should automatically receive a score of 2 or lower, regardless of other dimensions.

QUERY: {query}
ANSWER: {answer}
REFERENCE ANSWER (for comparison): {reference}

Provide your response in this exact format:
REASONING: [your analysis]
SCORE: [integer 1-5]
CRITICAL_DIMENSION_COVERED: [yes/no]

Note the domain_note and critical_dimension placeholders. These are where domain-specific expertise enters the judge prompt — the place where a generic RAGAS metric cannot follow you. For a legal assistant you might set critical_dimension to "exception clauses and carve-outs." For a medical assistant it might be "contraindications and dosage warnings." This single parameterization is the difference between a generic judge and one that catches domain-critical failures.

Pairwise vs pointwise judging

Pointwise judging scores a single response against a rubric and returns an absolute score. Pairwise judging presents the judge with two responses and asks which is better. Both have legitimate use cases and neither is universally superior.

When pointwise wins: You need absolute scores for CI thresholds ("faithfulness must be at least 3.5 to pass"), you're running production sampling where you don't have a comparison baseline, or you need scores that are meaningful in isolation. Pointwise scales linearly with the number of responses — evaluating 1,000 responses costs 1,000 judge calls.

When pairwise wins: You're comparing two model versions and want fine-grained discrimination, running an A/B test on prompt variants, or building a preference dataset for RLHF/DPO. Pairwise is better at detecting subtle quality differences that fall within the same pointwise score band. The tradeoff: evaluating N responses requires N*(N-1)/2 comparisons, which becomes expensive fast, and you lose absolute score interpretability.

A practical finding from recent research: pairwise preferences flip in approximately 35% of cases when the positions of the two responses are swapped, compared to about 9% instability for absolute pointwise scores. This is a concrete reason to prefer pointwise for anything where reproducibility matters — CI gates, regression tests, SLA monitoring. Reserve pairwise for offline comparative analysis where you can run both orderings and aggregate.

There is also a hybrid worth knowing: best-of-N selection. Generate N candidate responses, use pointwise to score each, return the highest-scoring one. This is not an evaluation strategy but a generation strategy — but it shares the same judge infrastructure and often delivers a better quality floor than any individual response, especially for high-stakes outputs.

Common biases and how to mitigate them

LLM judges inherit the biases of their training data and exhibit systematic preferences that are independent of response quality. Not understanding these biases is how you end up with an eval pipeline that produces confidently wrong results.

Position bias: In pairwise evaluation, LLM judges prefer whichever response appears first in the prompt — regardless of quality. The effect is well-documented in the MT-Bench paper and has been reproduced across judge models. Mitigation: always run both orderings (A then B, B then A) and aggregate. If the judge flips its preference when you swap positions, the difference is not statistically meaningful.

Length bias: Judges systematically prefer longer responses. A 400-word answer to a question that deserves 80 words will often outscore the concise answer on helpfulness dimensions. Mitigation: explicitly instruct the judge that length is not a proxy for quality and add a rubric criterion penalizing unnecessary verbosity.

Self-enhancement bias: When GPT-4 judges responses from GPT-4 vs Claude, it tends to prefer GPT-4 outputs even when humans prefer Claude's outputs — and vice versa for Claude-as-judge. Mitigation: use a different family of models for judge vs system under test, or calibrate against human labels to measure the magnitude of the bias before relying on cross-model evaluation.

Rubric position bias: In pointwise Likert evaluation, judges tend to prefer score options that appear at specific positions in the rubric description. A rubric that lists scores from 5 (best) down to 1 (worst) will produce different score distributions than one listed 1 to 5, even with identical criteria. Mitigation: run calibration experiments with multiple rubric orderings, or use a balanced permutation strategy that randomizes the score ordering across evaluation calls.

Verbosity of reasoning bias: Judges rate responses with longer, more elaborate reasoning chains higher, independent of whether the reasoning is correct. This is particularly dangerous for domains where confident wrong reasoning sounds more impressive than a terse correct answer. Mitigation: separate evaluation of "is the answer correct" from "is the reasoning well-articulated" into distinct rubric dimensions.

Lesson learned

We ran a helpfulness eval where GPT-4.1 judged outputs from both GPT-4.1 and Claude Sonnet 4.6 on the same customer support golden set. GPT-4.1 gave its own outputs a mean helpfulness of 4.1 and Claude's outputs a mean of 3.7. When we flipped to Claude as judge, the scores reversed: Claude-generated answers scored 4.2, GPT-4.1 answers scored 3.6. Human annotators rated both models at 3.9 on average, with no statistically significant difference. Neither judge was reliable for cross-model comparison. We switched to a smaller open-source judge (Prometheus-7B) fine-tuned specifically for rubric-following and got kappa 0.71 against humans — usable, consistent, no systematic model preference.

Calibrating judges against humans (Cohen's kappa)

An LLM judge is only as trustworthy as its agreement with human judgment on your specific domain and rubric. Calibration is the process of measuring and improving that agreement before you hand the judge authority over CI gates or production quality metrics.

The right metric for calibration is Cohen's kappa, not raw percentage agreement. Raw percentage agreement is misleading on imbalanced distributions — if 85% of your outputs are "acceptable," a judge that always predicts acceptable hits 85% agreement while providing zero signal. Cohen's kappa corrects for chance agreement and gives you the real discriminative ability of the judge. Target kappa of 0.7 or above. Below 0.6, the rubric is ambiguous and needs revision.

Here is what the calibration iteration cycle looks like in practice, with real numbers from a document Q&A project we ran:

Iteration	What changed	% Agreement	Cohen's Kappa
v1 — baseline	RAGAS faithfulness, no custom rubric	74%	0.41
v2	Added custom 5-point rubric, chain-of-thought reasoning	79%	0.54
v3	Added anchor examples for scores 2, 3, 4 (most ambiguous)	83%	0.63
v4	Added domain-specific completeness dimension, split faithfulness from completeness	86%	0.71
v5	Randomized rubric score ordering, switched to open-source judge fine-tuned on domain	88%	0.78

Table: Inter-judge agreement between LLM judge and human raters across rubric iterations. N=120 examples rated by 2 domain experts and the LLM judge. Cohen's kappa computed on collapsed 3-class labels (poor / acceptable / good) to reduce noise from boundary cases.

The jump from v1 to v4 is almost entirely attributable to rubric quality, not to the judge model. This is the most important calibration lesson: before you upgrade your judge model, fix your rubric. Ambiguous criteria are the root cause of low kappa, not judge model capacity. The rubric itself is a serious prompt — the patterns in advanced prompt engineering for production apply directly.

Practically, the calibration process is:

Sample 80-120 examples from your golden set.
Have 2 domain experts annotate each example independently with your rubric.
Compute inter-annotator kappa between the two humans. If it is below 0.7, the rubric is ambiguous even for humans — fix the rubric before involving the LLM judge at all.
Run the LLM judge on the same 80-120 examples.
Compute kappa between the LLM judge and the average human score.
Identify disagreement patterns — look at examples where judge and humans diverge by 2+ points — and update the rubric with clearer anchor descriptions for those cases.
Repeat until kappa exceeds 0.7.

Run this calibration exercise whenever you update the rubric, change the judge model, or add a new evaluation dimension. It takes half a day but it is the only way to know whether your automated scores mean anything.

Judge model selection: accuracy vs cost

The judge model choice is a tradeoff between agreement with humans, cost per eval call, latency, and whether the judge has the domain knowledge to understand your content. The provider-by-provider trade-offs are covered in detail in Mistral vs OpenAI vs Anthropic.

Here is a practical breakdown of the main options as of mid-2026:

GPT-4.1 and Claude Sonnet 4.6: Strong out-of-the-box agreement with humans on general-purpose rubrics. GPT-4.1 costs approximately $0.002-0.008 per eval call for a typical 1,500-token judge prompt, Claude Sonnet 4.6 is in a similar range. Both reach kappa 0.65-0.75 on well-designed rubrics without fine-tuning. The self-enhancement bias noted above is a real concern when judging same-family outputs. Use these as your calibration baseline before considering alternatives.

Smaller open-source judges (Prometheus-2, JudgeLM, fine-tuned Mistral 7B/8x7B): Purpose-fine-tuned judge models specifically optimized for rubric following. Prometheus-2 (7B) reaches competitive kappa to GPT-4 on structured rubrics while costing approximately $0.0001-0.001 per eval call on a self-hosted or dedicated inference endpoint — 10-50x cheaper than frontier models. The tradeoff: requires domain-specific fine-tuning or prompt engineering to handle niche vocabulary, and is less reliable on free-form open-ended rubrics.

Embedding-based classifiers: For binary pass/fail criteria with enough labeled examples (200+), a fine-tuned classifier on text-embedding-3-large or similar can reach kappa 0.70-0.80 at under $0.0001 per call. Zero latency, fully deterministic, trivially reproducible. This is massively underused. If your binary criterion has 200+ labeled examples, train a classifier first and use an LLM judge only for the cases where the classifier has low confidence.

The practical recommendation:

Use GPT-4.1 or Claude Sonnet 4.6 during rubric development and calibration.
Once kappa is above 0.7, evaluate whether a smaller fine-tuned judge or classifier can match it at lower cost.
For high-volume production sampling (10,000+ evals/day), the cost difference between frontier and fine-tuned open-source is significant: frontier costs $20-80/day, fine-tuned open-source costs $1-8/day. At scale, the investment in fine-tuning a domain-specific judge pays back within weeks.
Never use the same model family as judge and system under test for cross-model comparison. The self-enhancement bias will invalidate your results.

For teams using LangSmith or Langfuse, both platforms have native LLM-as-judge evaluation built in with configurable rubrics. Promptfoo and DeepEval are worth evaluating if you need CI integration with custom rubric support and multi-model judge comparison. These tools do not eliminate the calibration work — they make it faster to iterate.

Continuous eval: CI gates and production sampling

Building a good judge and a good golden set is only useful if you actually run them. The mechanics of making evaluation continuous are straightforward but require discipline to maintain as the codebase evolves.

Eval in CI. Your golden set should run on every PR that touches the retrieval pipeline, the system prompt, the chunking logic, or the embedding model. A faithfulness regression blocks the merge. This is not optional sophistication — it is the minimum viable engineering discipline for an LLM system. The tooling is mature: Promptfoo has a GitHub Actions integration, LangSmith has CI hooks, DeepEval has pytest-compatible assertions. Pick one and commit to it.

The CI eval should be fast. A 50-example golden set run against a cached judge model should complete in under 60 seconds if you parallelize the judge calls (most eval frameworks do this by default). If your CI eval takes 10 minutes, it will be skipped. Target under 2 minutes for the full golden set run.

Production sampling cadence. Every week, sample 5-15% of real production queries and run them through your judge pipeline. For most systems this means 50-200 examples per week — enough to detect drift without drowning the team in review. The weekly sample should feed three outputs:

A time-series dashboard of your key eval metrics. You want to see the trend, not just the point-in-time score. A faithfulness score of 0.78 that was 0.84 three weeks ago is a signal. A faithfulness score of 0.78 with no historical context is noise.
A flag on the 10 lowest-scoring examples from the week for human review. These are your highest-signal candidates for golden set expansion.
An alert if the weekly median score drops by more than a defined threshold from the 4-week rolling average. This is your early warning system for silent degradation.

Human-in-the-loop sampling. At 5-15% production sampling, you cannot have humans review every query. What you can do is have domain experts review 20-30 examples per week — the lowest-scoring flagged examples from the LLM judge plus a random sample for calibration. This serves two purposes: it catches what the automated judge misses (LLM judges have systematic blind spots, especially for domain-specific correctness), and it keeps the eval infrastructure honest by continuously measuring whether judge-human kappa is holding.

For the eval infrastructure itself, RAGAS works well as a baseline metric provider even when you have custom judges on top — use it for the generic dimensions (faithfulness, answer relevance) and reserve your custom judges for the domain-specific dimensions. There is no value in rebuilding faithfulness from scratch when RAGAS already implements it reasonably well.

Lesson learned

A customer support RAG we maintained had stable eval scores for 8 weeks after launch. In week 9, a product update introduced a new pricing tier. The golden set did not cover this topic. Production sampling flagged it: 34% of queries about the new tier scored below the faithfulness threshold, compared to 6% overall. We caught it in week 9 because we were sampling. Without production sampling, this would have been discovered from user complaints 3-4 weeks later. Golden set coverage of new topics always lags production by 2-4 weeks — sampling is the bridge.

When to ditch LLM judges and use deterministic checks

LLM judges are powerful and expensive. Deterministic checks are trivial and free. The mistake is using an LLM judge for things that have a correct and verifiable answer.

Use deterministic checks for:

Format compliance: Is the output valid JSON? Does it match the expected schema? Does it contain a required field? These are regex or schema validation checks — zero cost, 100% reproducible.
Citation presence: Does the answer reference document IDs from the retrieved context? Simple string matching.
Forbidden phrase detection: Does the output contain any string from your disallowed list (competitor names, deprecated product names, legally sensitive phrases)? Another regex check.
Numeric range validation: If the answer contains a date, price, or numeric claim, is it within an expected range? Parse and validate.
Refusal detection: For queries outside scope, does the system refuse rather than hallucinate? A simple intent classifier or keyword match catches most cases.

The rule of thumb: if you could write a unit test for it in 10 lines of Python, write the unit test. Do not spend $0.005 per eval call asking an LLM to check whether the output is valid JSON. Running an LLM judge on mechanically verifiable criteria adds noise (LLMs occasionally get these wrong), cost, and latency for zero additional signal over a deterministic check.

The right architecture is a layered eval stack:

Fast deterministic checks run first — format, forbidden phrases, citation presence. These run on 100% of outputs in under 1ms. Failures are hard failures.
Domain-specific LLM judge runs second — faithfulness, completeness, tone, helpfulness. These run on the golden set in CI and on 5-15% production samples. Each call costs $0.001-0.01 depending on judge model and prompt length.
Human review runs last — a weekly sample of the lowest-scoring LLM-judged examples, plus random sampling to keep calibration honest.

This stack gives you broad coverage at low cost, deep quality signal where it matters, and a human feedback loop that prevents systematic judge drift. The teams that over-index on LLM judges for everything end up with eval infrastructure that is expensive to run and slow to iterate — and they often discover too late that a simple regex would have caught 40% of their production failures faster and cheaper.

For context on how this connects to the broader RAG engineering picture, see our piece on Agentic RAG — in agentic pipelines, eval becomes even more critical because you are judging multi-step reasoning chains, not just individual responses. The same principles apply but the rubric complexity increases significantly. You may also find our RAG technical guide useful for grounding the retrieval context that feeds your evaluation inputs.

Frequently asked questions

An LLM-as-judge pipeline uses a language model (the judge) to evaluate the outputs of another language model (the system under test). The judge receives a prompt containing the input, the system output, and an optional reference answer, then returns a score and reasoning. It replaces manual human annotation for high-volume evaluation, provided the judge's scores are calibrated against human ground truth.

RAGAS faithfulness measures whether the generated answer is entailed by the retrieved context, using a generic NLI-style decomposition. In domain-specific contexts — legal, medical, financial — faithfulness alone is insufficient. A legally correct answer can be faithful to the retrieved text but omit a critical exception clause. A medical answer can be fully grounded yet miss a contraindication present in the same document. Generic metrics do not encode the domain-specific notion of correctness your users actually care about.

50 high-quality examples outperform 500 poorly curated ones. Start with 50 examples drawn from real production queries or realistic simulations, with a mix of easy cases, known failure modes, and edge cases. Grow to 150-200 as production traffic matures. Do not build a 500-example golden set before you have 50 real production queries to learn from — you will spend 80% of the curation effort on distributions that do not match actual usage.

Pointwise judging scores a single response against a rubric (e.g., 1-5 Likert scale). Pairwise judging presents the judge with two responses side by side and asks which is better. Pairwise is better at detecting fine-grained quality differences but introduces position bias (the judge prefers whichever response appears first in about 30-40% of cases) and does not produce absolute scores needed for CI thresholds. For most production pipelines, pointwise with a well-designed rubric and randomized score permutations is the right default.

Target Cohen's kappa of 0.7 or above before trusting the judge for automated decisions. A kappa above 0.8 is excellent. Raw percentage agreement above 80% sounds good but is misleading — on skewed distributions where most outputs are acceptable, you can hit 85% agreement by always predicting acceptable. Kappa corrects for chance agreement. If your kappa is below 0.6, your rubric is ambiguous and needs more granular criteria or anchor examples.

Use deterministic checks for anything that has a provably correct answer: format compliance (valid JSON, correct date format), citation presence, numeric range validation, forbidden phrase detection, and regex-verifiable assertions. These checks are free, fast, 100% reproducible, and do not suffer from position or length bias. Reserve LLM judges for criteria that genuinely require semantic understanding: coherence, tone, helpfulness, domain correctness. Running an LLM judge on a JSON schema check is wasteful and adds variance.

Building Custom LLM Judges: Going Beyond RAGAS

Why generic eval metrics fail in production

Building a golden dataset that reflects your domain

Designing a scoring rubric that judges can follow

Pairwise vs pointwise judging

Common biases and how to mitigate them

Calibrating judges against humans (Cohen's kappa)

Judge model selection: accuracy vs cost

Continuous eval: CI gates and production sampling

When to ditch LLM judges and use deterministic checks

Further reading

Frequently asked questions