The Permanent Record

Originated as spec work for a client relationship in Assessment and Development.

The Permanent Record and AI Peer Review System

Context

I started building this for a client engagement where the need was multi-evaluator grading at scale. When LLMs grade work, independent runs of the same model produce inconsistent results. The variance compounds when multiple evaluators are involved. Averaging the scores hides the disagreement without resolving it. I wanted a methodology that surfaced the disagreement first, then reconciled it through structured dialogue rather than statistical smoothing.

Approach

A three-round multi-agent system:

  1. Blind Independence. Each evaluator agent grades independently with no access to peer outputs. Variance is captured, not suppressed.
  2. Cross-Validation. Each evaluator sees its peers’ grades and rationales, and can defend or revise its own position.
  3. Collaborative Synthesis. Evaluators reconcile through structured dialogue against a shared rubric, producing a final assessment with a reasoning trail.

Built as a serverless RAG system on Cloudflare’s edge stack. Vector search, structured metadata, and processed conversation archives feed the multi-agent methodology.

Outcome

Grade variance across evaluator agents dropped from 0.26 to 0.04 in pilot, a 6x reduction in disagreement. The system is now in client testing as a containerized LobeHub distribution.

The methodology generalizes past grading. Any decision where multiple LLM evaluators need to reach a defensible, auditable consensus benefits from the same three-round pattern: surface the disagreement, reconcile it through structured dialogue, output a result with a reasoning trail. That structure applies to candidate evaluation, vendor selection, document review, and high-stakes prompt validation, anywhere averaging would hide the disagreement instead of resolving it.