· 5 min read
Raising Limner: Notes from a Year of Agent Husbandry
An HR consultant asked for pixel art. The discipline that came out of it turned into something I now call agent husbandry.
Published: · 8 min read
Independent LLM evaluators disagree, and averaging their scores hides the disagreement instead of resolving it. Here is a three-round method that surfaces the conflict, reconciles it through structured debate, and converges on an auditable result.
Ask two LLMs to evaluate the same body of work and you get two different answers. Ask the same model twice and you may still get two different answers. The standard fix is to average the scores, which produces a number that looks like agreement while hiding the fact that the evaluators disagreed and why.
I wanted a method that did the opposite: surface the disagreement first, then reconcile it through structured dialogue against a shared rubric, and output a result with a reasoning trail you can audit.
I built and tested it on a deliberately hard evaluation problem: a year of my own AI-augmented work across 660+ conversations with Claude and Gemini, with each model holding a different slice of the record. Two evaluators, partial and conflicting evidence, a high bar for what “good” means. If the method could converge on that, it could converge on cleaner problems.
The method borrows from Valve Software’s flat-structure peer review model, adapted for AI agents. Two independent LLMs act as evaluators, each with access to a different slice of the evidence, a shared rubric, and instructions to grade against a fixed benchmark. The asymmetry of evidence is deliberate: it is what forces the disagreement the method is built to resolve.
The architecture enforces rigor through three phases designed to isolate variables and force convergence:
Round 1: Blind Independence. Each agent reviews only its own conversation history. No visibility into the other’s data. This produces two independent assessments from fundamentally different vantage points.
Round 2: Cross-Validation. Agents exchange datasets. Each audits the other’s findings against new evidence. Biases get corrected. Grades shift.
Round 3: Collaborative Synthesis. Both agents are instantiated in a shared environment (LobeChat) to debate findings in real-time, interrogate me directly, and draft a unified consensus report.
The evidence base wasn’t just chat logs. I built searchable knowledge infrastructure to support the review:
| Component | Purpose |
|---|---|
| Cloudflare Vectorize | Indexed 116 Claude conversations (1,095 semantic chunks) for evidence retrieval |
| Qdrant | Indexed ~550 Gemini conversations for the same purpose |
| Custom Rubric | 4-category, 18-subcategory grading matrix (A+ through F) with explicit evidence standards |
| LobeChat Orchestration | Multi-agent environment with a Supervisor agent managing turn-taking, context injection, and task delegation |
The rubric is a critical piece. It defines specific professional behaviors at each grade level, requires cited evidence for every assessment, and benchmarks against people who use AI as a core methodology rather than a novelty. A C+ in this system means “average for a senior AI-augmented professional.” That’s already a high bar.
The method’s three rounds aren’t arbitrary. Each one targets a specific failure mode of LLM evaluation, and the test case shows each failure mode appearing and getting corrected.
Both evaluators grade blind, against the shared rubric, with no visibility into each other’s evidence or conclusions. The point of the round is to capture how far apart independent judges actually are before any reconciliation smooths it over.
On the test case, the gap was immediate and large. The two models held different slices of the same record, so they built different pictures from it. One model, working mostly from ideation and planning conversations, read the subject as strong on starts and weak on follow-through. The other, working mostly from implementation conversations, read the same subject as strong on execution and light on documentation. Neither was wrong. Both were incomplete, because each had seen only part of the evidence.
That is the structural flaw the method is built to expose: a single-model evaluation of a body of work renders a confident judgment from a partial view. Round 1 makes the partiality visible instead of hiding it.
Round 1 delta: 0.26. Significant disagreement, concentrated on follow-through.
The evaluators exchange evidence and audit each other’s findings. The round tests whether a disagreement survives contact with the data the other judge was working from.
On the test case, most of it didn’t. The “weak follow-through” critique dissolved once that evaluator saw the completed work sitting in the other’s evidence base: the apparent abandonment was task routing, projects that had moved from planning to implementation and out of the first model’s view. The “light documentation” read softened once that evaluator saw the upstream specification work it had never been shown. The disagreement wasn’t a difference of judgment. It was a difference of evidence, and cross-validation is what distinguishes the two.
Round 2 delta: 0.12. Disagreement narrowed to genuine matters of nuance.
The evaluators are placed in a shared environment with a supervisor agent managing turn-taking, and they argue the remaining gaps to a conclusion against the rubric, rather than averaging them away. This is the round that produces a defensible result instead of a split-the-difference number.
The remaining disagreements were real ones, the kind averaging would have buried. On the single most contested criterion, one evaluator held a lower grade and the other pushed higher; the lower grade won, not by compromise, but because one side made the stronger case against the benchmark and the other conceded it. The output is a consensus with a reasoning trail attached to every contested point, which is the thing a defensible evaluation needs and an averaged score cannot provide.
Round 3 delta: 0.04. Consensus reached, with the disputes resolved on the record rather than smoothed over.
The most useful output of the run wasn’t the consensus grade. It was a pattern the evaluators surfaced that I hadn’t named, and the way they surfaced it is itself a property of the method worth pointing at.
In the synthesis round, both evaluators independently noted that the same structural pattern ran through the work they were assessing: a habit of building pre-structured environments that let a new contributor, human or AI, get productive fast and produce consistent output. I now call it Context-Station Architecture. A context station has four properties:
The rubric used in this very evaluation is an instance of it: a pre-structured environment that let two independent models produce calibrated, comparable assessments without coordinating. The same pattern shows up in jigs and visual work instructions for trade crews, in global room standards for vendor installation teams, and in cold-start documents for AI agents. Different substrates, one design principle.
The point for the method is narrower and more interesting than the pattern itself: two evaluators working from partial, conflicting evidence independently identified the same structural feature before they reconciled. Independent convergence on a finding, arrived at separately and then confirmed in debate, is a stronger validity signal than either judge asserting it alone. That is the kind of result the three-round structure is built to produce.
The headline isn’t the consensus score. It’s the path the two evaluators took to get there: a 0.26 delta narrowed to 0.04 across three rounds, with a reasoning trail at every contested point. The score they converged on (3.81 of 4.0) is far less interesting than the fact that they reached it by resolving their disagreement against the evidence rather than splitting the difference.
That distinction is the whole argument for the method. An averaged score from two judges who disagreed 0.26 apart would have reported false agreement and thrown away the reason for the disagreement. The three-round structure kept the reason and resolved it, which is what makes the output auditable: every point where the evaluators initially diverged has a record of how the gap closed.
There is a recursive property worth noting, because it doubles as a stress test. The subject under evaluation was a year of AI-augmented work, and the evaluation was itself a piece of AI-augmented work: a rubric acting as a context station, a multi-round process under a supervisor agent, different tasks routed to different models by strength. The method had to be good enough to evaluate the kind of work it is an example of. It held.
A single model grading a body of work renders a confident judgment from whatever slice of evidence it happens to hold. The 0.26 opening delta wasn’t noise to be averaged out. It was the method working: it revealed that the two judges were partitioning the same evidence differently, which is exactly the failure mode a real evaluation has to catch. The fix is cross-validation, not averaging.
The instinct with multiple LLM judges is to average and move on. That instinct discards the most useful information in the system. Where the judges disagree, and why, is where the real assessment lives. A method that surfaces the disagreement and forces it to resolve produces a result you can defend line by line. A method that averages produces a number you can only assert.
The findings about one person’s work are incidental. The transferable result is the structure: blind independence, cross-validation, structured synthesis, applied wherever independent LLM judges need to reach a consensus that holds up to scrutiny. The test case was deliberately hard (partial evidence, two models, a strict benchmark) precisely so the structure would be exercised under load before it went anywhere that mattered.
| Artifact | Description |
|---|---|
| Final Consensus Grade Report | Unified assessment from both agents: 18 subcategories, qualitative commentary, cross-reviewer confidence levels |
| AI as Peer Reviewer (Field Report) | Methodological case study documenting the experiment, key findings on data partitioning and inter-model debate |
| Round 1 Reports (×2) | Independent blind assessments from each agent |
| Round 2 Audit Reports (×2) | Cross-validation findings after dataset exchange |
| Round 3 Transcript | Full 144-message synthesis session including questioning, debate, and collaborative report drafting |
| Rubric & Evidence Standards | The evaluation framework, a reusable context station for structured AI peer review |
The evaluation problem here was adversarial by design: partial evidence, two models with different views, a hard benchmark. The method still converged, and every step left an auditable trail. That is the transferable result. The same three-round structure (blind independence, cross-validation, collaborative synthesis) applies to any evaluation where independent LLM judges need to reach a defensible consensus and averaging would paper over the disagreement: candidate screening, vendor selection, document review, high-stakes prompt validation.
The production implementation of this method is The Permanent Record, now in client testing.
Related Projects: The Permanent Record · PHINEAS
· 5 min read
An HR consultant asked for pixel art. The discipline that came out of it turned into something I now call agent husbandry.
· 8 min read
Independent LLM evaluators disagree, and averaging their scores hides the disagreement instead of resolving it. Here is a three-round method that surfaces the conflict, reconciles it through structured debate, and converges on an auditable result.