The Permanent Record & AI Peer Review
Published:
The Infrastructure — The Permanent Record
Working across multiple AI interfaces (Claude, Gemini, LobeChat, Claude Code) created a fragmentation problem. Months of accumulated institutional knowledge — architecture decisions, debugging sessions, design rationale — existed only in scattered chat logs with no unified retrieval layer.
Built a serverless RAG system entirely on Cloudflare's edge infrastructure:
- Vectorize for semantic search across 1,700+ knowledge chunks
- D1 (SQLite at the edge) for structured metadata
- R2 for raw conversation archives (S3-compatible, zero egress — the disaster recovery layer)
- Workers AI (bge-m3) for embedding generation — no external API dependency
- Workers for the search API — V8 isolates, zero cold starts, sub-50ms global response
Key architecture decision: all-Cloudflare stack. Single wrangler.toml, one billing dashboard, one deployment pipeline. Deliberate vendor consolidation traded for operational simplicity at this scale, with R2 archive as the portability hedge.
The Application — AI Peer Review System
Hypothesized that AI collaborators possess the most complete dataset of a professional's working behavior — the prompting strategies, debugging loops, and architectural decisions that human peers rarely observe. Designed a multi-agent assessment framework where two competing LLMs acted as senior colleagues, graded against a structured rubric, and resolved disagreements through cross-validated evidence retrieval.
Three-round methodology enforcing convergence:
- Round 1 — Blind Independence: Claude reviewed 116 conversations (execution evidence); Gemini reviewed ~550 conversations (architecture evidence). No data sharing. Result: GPA delta of 0.26 — significant divergence driven by data partitioning.
- Round 2 — Cross-Validation: Each agent audited the other's evidence. Gemini rescinded its "scope abandonment" critique after reviewing Claude's completed build artifacts. Delta narrowed to 0.12.
- Round 3 — Collaborative Synthesis: Both agents instantiated in a shared LobeChat environment with a Supervisor agent managing turn order and context injection. Real-time debate produced final delta of 0.04. Consensus GPA: 3.81/4.0 (A-/A).
The Permanent Record was the retrieval layer that made evidence-based grading possible — semantic search over 660+ indexed conversations replaced unreliable model memory with auditable evidence.
Technical Outcomes
- 660+ conversations indexed across two vector databases (Cloudflare Vectorize + Qdrant)
- Sub-50ms semantic search response time at the edge
- 0.26 → 0.04 GPA delta across three rounds — demonstrating that inter-model debate mitigates individual AI assessment bias
- 5 published artifacts generated through the system: two independent audits, two grade reports, and a methodology field report
- Key finding: data partitioning creates systematic bias in single-model evaluation — multi-agent consensus is structurally required for AI-based performance assessment