The AI Peer Review System: A Multi-Agent 360-Degree Performance Assessment

Published: · 8 min read

When your primary coworkers are AI systems, who writes your performance review? I built a structured evaluation framework where two competing LLMs acted as senior colleagues, graded my work against a rigorous rubric, and debated the results in real time.

Antique desk covered in grade sheets, ledgers, and assessment documents under warm chiaroscuro lighting

When your primary coworkers are AI systems, who writes your performance review?

Traditional 360-degree feedback relies on human peers who observe your daily work. But for the past year, my daily collaborators have been Claude and Gemini — across 660+ conversations spanning architecture decisions, code execution, product design, research, and strategic planning. Human peers might see the outputs. The AI systems saw the process: the prompting strategies, the debugging loops, the architectural decision-making, the moments I pushed back on their suggestions and the moments I should have.

I hypothesized that my AI collaborators held the most complete dataset of my professional behavior. So I built a system to make that data useful.


The Design

The system borrows from Valve Software’s flat-structure peer review model, adapted for AI agents. Two independent LLMs act as senior colleagues, each with access to their own interaction history with me, a shared rubric, and instructions to grade against a benchmark of senior-level AI-augmented professionals.

The architecture enforces rigor through three phases designed to isolate variables and force convergence:

Round 1 — Blind Independence. Each agent reviews only its own conversation history. No visibility into the other’s data. This produces two independent assessments from fundamentally different vantage points.

Round 2 — Cross-Validation. Agents exchange datasets. Each audits the other’s findings against new evidence. Biases get corrected. Grades shift.

Round 3 — Collaborative Synthesis. Both agents are instantiated in a shared environment (LobeChat) to debate findings in real-time, interrogate me directly, and draft a unified consensus report.

The Infrastructure

The evidence base wasn’t just chat logs. I built searchable knowledge infrastructure to support the review:

ComponentPurpose
Cloudflare VectorizeIndexed 116 Claude conversations (1,095 semantic chunks) for evidence retrieval
QdrantIndexed ~550 Gemini conversations for the same purpose
Custom Rubric4-category, 18-subcategory grading matrix (A+ through F) with explicit evidence standards
LobeChat OrchestrationMulti-agent environment with a Supervisor agent managing turn-taking, context injection, and task delegation

The rubric is a critical piece. It’s not a generic feedback form — it defines specific professional behaviors at each grade level, requires cited evidence for every assessment, and benchmarks against people who use AI as a core methodology, not a novelty. A C+ in this system means “average for a senior AI-augmented professional.” That’s already a high bar.


What Happened

Round 1: The Silo Effect

The most important discovery came immediately. The two agents produced dramatically different assessments because they held different slices of the same career.

Gemini saw the architect. Because I used Gemini primarily for ideation, research, and product design, it perceived me as “all start, no finish.” It flagged “Scope Abandonment” as a critical weakness — too many initiated projects that never resolved within its window.

Claude saw the builder. Because I used Claude for coding, architecture, and implementation, it perceived me as “high execution, low documentation.” It gave higher grades but missed the strategic breadth.

Neither view was wrong. Both were incomplete. This is the single most useful finding of the entire exercise: a single-model evaluation of a multi-model workflow is structurally flawed. Each agent only saw the work I brought to it, and each drew conclusions from that partial view.

Round 1 GPA delta: 0.26 — significant disagreement, particularly on follow-through.

Round 2: The Correction

When Gemini reviewed Claude’s evidence of completed projects — deployed sites, functioning systems, shipped code — it rescinded the “Abandonment” critique entirely. Those “abandoned” projects in Gemini’s logs were actually completed work that had moved to Claude for execution. The apparent abandonment was actually task routing.

Claude, reviewing Gemini’s logs, discovered the depth of architectural specifications and strategic planning that preceded the code it had helped build. What it had assessed as “the starting point” turned out to be the visible tip of substantial upstream work.

Round 2 GPA delta: 0.12 — disagreement narrowed to nuance.

Round 3: The Synthesis

This was a five-hour session — 144 messages across two AI assessors, a LobeChat supervisor agent, and me answering their questions.

The agents alternated asking me probing questions and challenging each other’s reasoning. Some highlights from the transcript:

On career continuity. Claude asked whether my transition from AV/construction program management to AI systems work was a pivot or an extension. My answer — “It’s both. The AI tools give me more capabilities, but the pivot is only possible because it’s a natural extension of my skill set” — prompted the agents to reframe the entire assessment through a TPM lens rather than a developer lens.

On managing humans vs. managing models. Gemini asked directly: “Which is harder? Which are you better at?” I told them human complexity is far harder to manage — the motivational mix is more varied and variable. But every LLM failure mode has a perfect human coworker analog. They inherited our flaws, just not the points of origin.

On the scope management debate. This was the hardest-fought grade. Claude held firm at B+ (“Jim himself said the breadth is mostly a hindrance — we should listen”). Gemini pushed for A- (“High-throughput prototyping is a feature of AI-augmented work, not a bug”). Claude won the argument by invoking the benchmark: “The senior AI-augmented professionals in our benchmark population can all start 15 things cheaply. The differentiator is selection, convergence, and shipping.”

Gemini conceded. B+ with an upside note.

Round 3 GPA delta: 0.04 — consensus achieved.

Chart showing Claude and Gemini GPA assessments converging across three review rounds, from a 0.26 delta to 0.04


The Discovery: Context-Station Architecture

The most significant outcome wasn’t a grade. It was a named methodology.

During the synthesis, both agents independently converged on the same observation: my work across every domain follows a single pattern. Gemini called it “Narrative as Infrastructure.” Claude proposed “Context-Station Architecture.” They agreed on Claude’s framing.

A Context Station is a pre-structured environment that enables a new agent — human or AI — to achieve productive engagement with minimal ramp-up time and maximal output consistency. Four characteristics:

  1. Declarative State — The station tells the agent where things stand, not just what to do
  2. Embedded Standards — Quality criteria are built into the environment, not delivered as separate instructions
  3. Agent-Agnostic Design — Works regardless of the agent’s skill level, history, or model architecture
  4. Recursive Improvement — Captures outputs that improve the station itself

This pattern runs through everything I’ve built:

DomainImplementation
Design/Build (2008–2018)Jigs, templates, staging systems, and visual work instructions enabling variable-skill trade crews to deliver consistent quality across deployments
AV/IT Program Management (2018–2025)Global room standards and installation protocols enabling variable-skill vendor teams to deliver consistent quality across Fortune 500 deployments
AI Prompt Engineering (2024–present)Cold-start documents, structured context injections, role-specific system prompts producing consistent output across sessions, models, and context windows
This Peer ReviewThe rubric itself — a context station that enabled two AI systems to independently produce calibrated, comparable assessments without coordination

The agents graded this A+ and elevated it to a standalone “Signature Methodology” section in the final report. Their reasoning: this isn’t a technique I adopted from a course. It’s a pattern I independently derived from first principles across two distinct career domains, then recognized the isomorphism between them.

As the final report put it:

The innovation is not any single implementation. The innovation is the meta-pattern recognition — seeing that the same environmental-design principle applies whether your agents are humans with variable training or AI models with variable context windows.


Results

Final Grades

Consensus GPA: 3.81 / 4.0 (A-/A) — approximately top 10–15% of the benchmark population.

CategoryGPAKey Findings
Technical Skill & Tool Mastery3.85Fluent cross-model interaction, architectural understanding of context windows, cold-start docs assessed as “graduate-level prompt architecture”
Project Management & Execution3.50Strongest in risk assessment (A-), growth edge in scope management (B+). Both agents noted this grade shifts when viewed through TPM lens vs. developer lens.
Communication & Collaboration3.93Highest category. AI Leverage & Delegation elevated to A+ — “unmistakably delegation in the TPM sense”
Strategic Thinking & Growth3.74Recursive Self-Documentation elevated to A+. Professional Positioning (B+) identified as highest-leverage improvement available.

Three subcategories hit the A+ ceiling: Context-Station Architecture, AI Leverage & Delegation, and Recursive Self-Documentation. The cross-reviewer convergence on those three elevations — arrived at independently before being confirmed in synthesis — is the strongest validity signal in the dataset.

The Recursive Insight

Here’s the part that makes this project unusual as a portfolio piece: the review validates the skill being reviewed.

The agents explicitly noted this. The process of evaluating my AI collaboration was itself an instance of my AI collaboration. The quality of the output was itself evidence for the assessment. I built the rubric (a context station). I structured the multi-round process (program management). I routed different tasks to different models based on their strengths (AI delegation). And the system produced a rigorous, cross-validated result.

As Gemini observed during Round 3:

He framed the review as a “formal corporate process,” which constrained us to be rigorous and critical rather than sycophantic. The review’s success validates the very skill being reviewed.


What I Learned

About AI evaluation

An isolated AI is a biased observer. Each model only sees the work you bring to it, and it draws conclusions from that partial view. The initial 0.26 GPA delta wasn’t a flaw — it was the system working correctly by revealing how data partitioning creates perception gaps. The fix was cross-validation, not averaging.

About scope management

Both agents, independently, identified scope discipline as my primary growth area. When I confirmed it in the synthesis session, Claude made the case that kept the grade honest: “When the subject of the review identifies the behavior as a weakness, we should listen — especially when he’s explicitly asked us to prioritize honesty.” Fair.

About career narrative

The “Context-Station Architecture” framing didn’t exist before this exercise surfaced it. I knew the pattern — I’ve been doing the same thing across domains for years. But giving it a name, mapping the evidence across career phases, and having two independent systems validate the isomorphism turned an intuitive sense into a demonstrable methodology.

The career continuity narrative that emerged — “the medium changed, the skill didn’t” — came directly from the agents connecting dots I hadn’t explicitly connected.

About building systems that assess themselves

This is the meta-insight: the best way to demonstrate you can build AI-augmented systems is to build one that evaluates whether you can build AI-augmented systems. The recursion isn’t a gimmick. It’s the point.


Technical Architecture

Architecture diagram showing the Round 3 peer review environment: Claude and Gemini assessors plus a Supervisor agent in LobeChat, connected through Shared Libraries to Cloudflare Vectorize and Qdrant vector databases

Artifacts Produced

ArtifactDescription
Final Consensus Grade ReportUnified assessment from both agents — 18 subcategories, qualitative commentary, cross-reviewer confidence levels
AI as Peer Reviewer (Field Report)Methodological case study documenting the experiment, key findings on data partitioning and inter-model debate
Round 1 Reports (×2)Independent blind assessments from each agent
Round 2 Audit Reports (×2)Cross-validation findings after dataset exchange
Round 3 TranscriptFull 144-message synthesis session including questioning, debate, and collaborative report drafting
Rubric & Evidence StandardsThe evaluation framework — a reusable context station for structured AI peer review

The Positioning

The agents’ closing statement, written collaboratively after five hours of debate:

He is not a person who uses AI. He is a person who builds systems in which humans and AI produce better outcomes together than either would alone. That is the job of the future. He is already doing it.

I asked two AI systems to honestly evaluate my professional performance, built the methodology to make that evaluation meaningful, and published the results — including the B+ in scope management. The fact that the evaluation works is the portfolio piece. The grades are a bonus.


Related Projects: The Permanent Record · PHINEAS · ABCD System

Latest articles

Read all my blog posts

· 8 min read

The AI Peer Review System: A Multi-Agent 360-Degree Performance Assessment

When your primary coworkers are AI systems, who writes your performance review? I built a structured evaluation framework where two competing LLMs acted as senior colleagues, graded my work against a rigorous rubric, and debated the results in real time.

Antique desk covered in grade sheets, ledgers, and assessment documents under warm chiaroscuro lighting

· 4 min read

The ABCD System: An Edge-Native Product Delivery Engine

Astro. Better Auth. Cloudflare. Drizzle. A strict, opinionated full-stack architecture for Technical Program Managers who ship — and anyone tired of DevOps being a separate job description.

Heraldic coats of arms for the ABCD System — Astro, Better Auth, Cloudflare, and Drizzle