The Permanent Record & AI Peer Review

The Permanent Record and AI Peer Review System

Published:

The Infrastructure — The Permanent Record

Working across multiple AI interfaces (Claude, Gemini, LobeChat, Claude Code) created a fragmentation problem. Months of accumulated institutional knowledge — architecture decisions, debugging sessions, design rationale — existed only in scattered chat logs with no unified retrieval layer.

Built a serverless RAG system entirely on Cloudflare's edge infrastructure:

  • Vectorize for semantic search across 1,700+ knowledge chunks
  • D1 (SQLite at the edge) for structured metadata
  • R2 for raw conversation archives (S3-compatible, zero egress — the disaster recovery layer)
  • Workers AI (bge-m3) for embedding generation — no external API dependency
  • Workers for the search API — V8 isolates, zero cold starts, sub-50ms global response

Key architecture decision: all-Cloudflare stack. Single wrangler.toml, one billing dashboard, one deployment pipeline. Deliberate vendor consolidation traded for operational simplicity at this scale, with R2 archive as the portability hedge.

The Application — AI Peer Review System

Hypothesized that AI collaborators possess the most complete dataset of a professional's working behavior — the prompting strategies, debugging loops, and architectural decisions that human peers rarely observe. Designed a multi-agent assessment framework where two competing LLMs acted as senior colleagues, graded against a structured rubric, and resolved disagreements through cross-validated evidence retrieval.

Three-round methodology enforcing convergence:

  1. Round 1 — Blind Independence: Claude reviewed 116 conversations (execution evidence); Gemini reviewed ~550 conversations (architecture evidence). No data sharing. Result: GPA delta of 0.26 — significant divergence driven by data partitioning.
  2. Round 2 — Cross-Validation: Each agent audited the other's evidence. Gemini rescinded its "scope abandonment" critique after reviewing Claude's completed build artifacts. Delta narrowed to 0.12.
  3. Round 3 — Collaborative Synthesis: Both agents instantiated in a shared LobeChat environment with a Supervisor agent managing turn order and context injection. Real-time debate produced final delta of 0.04. Consensus GPA: 3.81/4.0 (A-/A).

The Permanent Record was the retrieval layer that made evidence-based grading possible — semantic search over 660+ indexed conversations replaced unreliable model memory with auditable evidence.

Technical Outcomes

  • 660+ conversations indexed across two vector databases (Cloudflare Vectorize + Qdrant)
  • Sub-50ms semantic search response time at the edge
  • 0.26 → 0.04 GPA delta across three rounds — demonstrating that inter-model debate mitigates individual AI assessment bias
  • 5 published artifacts generated through the system: two independent audits, two grade reports, and a methodology field report
  • Key finding: data partitioning creates systematic bias in single-model evaluation — multi-agent consensus is structurally required for AI-based performance assessment

Read the full case study →