Forcing Consensus: A Multi-Agent Convergence Method for LLM Evaluation

Most people who work with LLMs already cross-reference them: a second model to sanity-check the first, catch a hallucination, or take over when one stalls. I wanted to push that habit to its logical end. Put two models in one room with a shared rubric and deliberately conflicting evidence, make them argue the disagreement to a conclusion instead of averaging it away, and keep a human in the loop, me, to interrogate them and be interrogated in return. The output is a consensus with a reasoning trail you can audit, instead of a number that hides how it was reached.

I built and tested it on a deliberately hard evaluation problem: a year of my own AI-augmented work across 660+ conversations with Claude and Gemini, with each model holding a different slice of the record. Two evaluators, partial and conflicting evidence, a high bar for what “good” means. If the method could converge on that, it could converge on cleaner problems.

The Design

The method borrows from Valve Software’s flat-structure peer review model, adapted for AI agents. Two independent LLMs act as evaluators, each with access to a different slice of the evidence, a shared rubric, and instructions to grade against a fixed benchmark. The asymmetry of evidence is deliberate: it is what forces the disagreement the method is built to resolve.

The architecture enforces rigor through three phases designed to isolate variables and force convergence:

Round 1: Blind Independence. Each agent reviews only its own conversation history. No visibility into the other’s data. This produces two independent assessments from fundamentally different vantage points.

Round 2: Cross-Validation. Agents exchange datasets. Each audits the other’s findings against new evidence. Biases get corrected. Grades shift.

Round 3: Collaborative Synthesis. Both agents are instantiated in a shared environment (LobeChat) to debate findings in real-time, interrogate me directly, allow me to defend myself, and then draft a unified consensus report.

The Infrastructure

The evidence base wasn’t just chat logs. I built searchable knowledge infrastructure to support the review:

Component	Purpose
Qdrant	Indexed the Claude (116 conversations, 1,095 semantic chunks) and Gemini (~550 conversations) archives for evidence retrieval, embedded with gemini-embedding-001
Neon Postgres	Structured metadata for the grading process, so every retrieved passage traced back to its source conversation
Custom Rubric	4-category, 18-subcategory grading matrix (A+ through F, as a rubric both LLMs know cold) with explicit evidence standards
LobeChat Orchestration	Multi-agent environment with a purpose-built Supervisor agent managing turn-taking, DB access, context injection, and task delegation

The rubric is a critical piece. It defines specific professional behaviors at each grade level, requires cited evidence for every assessment, and benchmarks against people who use AI as a core methodology rather than a novelty. A C+ in this system means “average for a senior AI-augmented professional.” A high bar. I expected the usual LLM sycophancy to push the grade up anyway, despite prompting against it.

What Each Round Does

The method’s three rounds aren’t arbitrary. Each one targets a specific failure mode of LLM evaluation, and the test case shows each failure mode appearing and getting corrected.

Round 1: Surface the disagreement

Both evaluators grade blind, against the shared rubric, with no visibility into each other’s evidence or conclusions. The point of the round is to capture how far apart independent judges actually are before any reconciliation smooths it over.

On the test case, the gap was immediate and large. The two models held different slices of the same record, so they built different pictures from it. One model, working mostly from ideation and planning conversations, read the subject as strong on starts and weak on follow-through. The other, working mostly from implementation conversations, read the same subject as strong on execution and light on documentation. Neither was wrong. Both were incomplete, because each had seen only part of the evidence.

That is the structural flaw the method is built to expose: a single-model evaluation of a body of work renders a confident judgment from a partial view. Round 1 makes the partiality visible instead of hiding it.

Round 1 delta: 0.26. Significant disagreement, concentrated on follow-through.

Round 2: Cross-validate against the other evidence

The evaluators exchange evidence and audit each other’s findings. The round tests whether a disagreement survives contact with the data the other judge was working from.

On the test case, most of it didn’t. The “weak follow-through” critique dissolved once that evaluator saw the completed work sitting in the other’s evidence base: the apparent abandonment was task routing, projects that had moved from planning to implementation and out of the first model’s view. The “light documentation” read softened once that evaluator saw the upstream specification work it had never been shown. The disagreement wasn’t a difference of judgment. It was a difference of evidence, and cross-validation is what distinguishes the two.

Round 2 delta: 0.12. Disagreement narrowed to genuine matters of nuance.

Round 3: Reconcile through structured debate

The evaluators are placed in a shared environment with a supervisor agent managing turn-taking, and they argue the remaining gaps to a conclusion against the rubric, rather than averaging them away. This is the round that produces a defensible result instead of a split-the-difference number.

The remaining disagreements were real ones, the kind averaging would have buried. On the single most contested criterion, one evaluator held a lower grade and the other pushed higher; the lower grade won, not by compromise, but because one side made the stronger case against the benchmark and the other conceded it. The output is a consensus with a reasoning trail attached to every contested point, which is the thing a defensible evaluation needs and an averaged score cannot provide.

Round 3 delta: 0.04. Consensus reached, with the disputes resolved on the record rather than smoothed over.

Chart showing two independent evaluators converging across three review rounds, from a 0.26 delta to 0.04

The Result That Matters

The headline isn’t the consensus score. It’s the path the two evaluators took to get there: a 0.26 delta narrowed to 0.04 across three rounds, with a reasoning trail at every contested point. The score they converged on (3.81 of 4.0, so much for my sycophancy safeguards) is far less interesting than the fact that they reached it by resolving their disagreement against the evidence rather than splitting the difference.

That distinction is the whole argument for the method. An averaged score from two judges who disagreed 0.26 apart would have reported false agreement and thrown away the reason for the disagreement. The three-round structure kept the reason and resolved it, which is what makes the output auditable: every point where the evaluators initially diverged has a record of how the gap closed.

There is a recursive property worth noting, because it doubles as a stress test. The subject under evaluation was a year of AI-augmented work, and the evaluation was itself a piece of AI-augmented work: a structured rubric both judges worked from, a multi-round process under a supervisor agent, different tasks routed to different models by strength. The method had to be good enough to evaluate the kind of work it is an example of. It held.

What the Method Taught Me

An isolated evaluator is a biased one

A single model grading a body of work renders a confident judgment from whatever slice of evidence it happens to hold. The 0.26 opening delta wasn’t noise to be averaged out. It was the method working: it revealed that the two judges were partitioning the same evidence differently, which is exactly the failure mode a real evaluation has to catch. The fix is cross-validation, not averaging.

Disagreement is signal, not error

The instinct with multiple LLM judges is to average and move on. That instinct discards the most useful information in the system. Where the judges disagree, and why, is where the real assessment lives. A method that surfaces the disagreement and forces it to resolve produces a result you can defend line by line. A method that averages produces a number you can only assert.

The method generalizes past its test case

The findings about one person’s work are incidental. The transferable result is the structure: blind independence, cross-validation, structured synthesis, applied wherever independent LLM judges need to reach a consensus that holds up to scrutiny. The test case was deliberately hard (partial evidence, two models, a strict benchmark) precisely so the structure would be exercised under load before it went anywhere that mattered.

Technical Architecture

Architecture diagram showing the Round 3 peer review environment: Claude and Gemini assessors plus a Supervisor agent in LobeChat, connected through Shared Libraries to a Qdrant vector database and Neon Postgres

Artifacts Produced

Artifact	Description
Final Consensus Grade Report	Unified assessment from both agents: 18 subcategories, qualitative commentary, cross-reviewer confidence levels
AI as Peer Reviewer (Field Report)	Methodological case study documenting the experiment, key findings on data partitioning and inter-model debate
Round 1 Reports (×2)	Independent blind assessments from each agent
Round 2 Audit Reports (×2)	Cross-validation findings after dataset exchange
Round 3 Transcript	Full 144-message synthesis session including questioning, debate, and collaborative report drafting
Rubric & Evidence Standards	The evaluation framework, reusable for structured AI peer review

Why the Method Holds Up

The evaluation problem here was adversarial by design: partial evidence, two models with different views, a hard benchmark. The method still converged, and every step left an auditable trail. That is the transferable result. The same three-round structure (blind independence, cross-validation, collaborative synthesis) applies to any evaluation where independent LLM judges need to reach a defensible consensus and averaging would paper over the disagreement: candidate screening, vendor selection, document review, high-stakes prompt validation.

The production implementation of this method runs as a private build, on the same three rounds described here.

Related Projects: Forcing Consensus · PHINEAS