When your primary coworkers are AI systems, who writes your performance review?
Traditional 360-degree feedback relies on human peers who observe your daily work. But for the past year, my daily collaborators have been Claude and Gemini — across 660+ conversations spanning architecture decisions, code execution, product design, research, and strategic planning. Human peers might see the outputs. The AI systems saw the process: the prompting strategies, the debugging loops, the architectural decision-making, the moments I pushed back on their suggestions and the moments I should have.
I hypothesized that my AI collaborators held the most complete dataset of my professional behavior. So I built a system to make that data useful.
The Design
The system borrows from Valve Software’s flat-structure peer review model, adapted for AI agents. Two independent LLMs act as senior colleagues, each with access to their own interaction history with me, a shared rubric, and instructions to grade against a benchmark of senior-level AI-augmented professionals.
The architecture enforces rigor through three phases designed to isolate variables and force convergence:
Round 1 — Blind Independence. Each agent reviews only its own conversation history. No visibility into the other’s data. This produces two independent assessments from fundamentally different vantage points.
Round 2 — Cross-Validation. Agents exchange datasets. Each audits the other’s findings against new evidence. Biases get corrected. Grades shift.
Round 3 — Collaborative Synthesis. Both agents are instantiated in a shared environment (LobeChat) to debate findings in real-time, interrogate me directly, and draft a unified consensus report.
The Infrastructure
The evidence base wasn’t just chat logs. I built searchable knowledge infrastructure to support the review:
| Component | Purpose |
|---|
| Cloudflare Vectorize | Indexed 116 Claude conversations (1,095 semantic chunks) for evidence retrieval |
| Qdrant | Indexed ~550 Gemini conversations for the same purpose |
| Custom Rubric | 4-category, 18-subcategory grading matrix (A+ through F) with explicit evidence standards |
| LobeChat Orchestration | Multi-agent environment with a Supervisor agent managing turn-taking, context injection, and task delegation |
The rubric is a critical piece. It’s not a generic feedback form — it defines specific professional behaviors at each grade level, requires cited evidence for every assessment, and benchmarks against people who use AI as a core methodology, not a novelty. A C+ in this system means “average for a senior AI-augmented professional.” That’s already a high bar.
What Happened
Round 1: The Silo Effect
The most important discovery came immediately. The two agents produced dramatically different assessments because they held different slices of the same career.
Gemini saw the architect. Because I used Gemini primarily for ideation, research, and product design, it perceived me as “all start, no finish.” It flagged “Scope Abandonment” as a critical weakness — too many initiated projects that never resolved within its window.
Claude saw the builder. Because I used Claude for coding, architecture, and implementation, it perceived me as “high execution, low documentation.” It gave higher grades but missed the strategic breadth.
Neither view was wrong. Both were incomplete. This is the single most useful finding of the entire exercise: a single-model evaluation of a multi-model workflow is structurally flawed. Each agent only saw the work I brought to it, and each drew conclusions from that partial view.
Round 1 GPA delta: 0.26 — significant disagreement, particularly on follow-through.
Round 2: The Correction
When Gemini reviewed Claude’s evidence of completed projects — deployed sites, functioning systems, shipped code — it rescinded the “Abandonment” critique entirely. Those “abandoned” projects in Gemini’s logs were actually completed work that had moved to Claude for execution. The apparent abandonment was actually task routing.
Claude, reviewing Gemini’s logs, discovered the depth of architectural specifications and strategic planning that preceded the code it had helped build. What it had assessed as “the starting point” turned out to be the visible tip of substantial upstream work.
Round 2 GPA delta: 0.12 — disagreement narrowed to nuance.
Round 3: The Synthesis
This was a five-hour session — 144 messages across two AI assessors, a LobeChat supervisor agent, and me answering their questions.
The agents alternated asking me probing questions and challenging each other’s reasoning. Some highlights from the transcript:
On career continuity. Claude asked whether my transition from AV/construction program management to AI systems work was a pivot or an extension. My answer — “It’s both. The AI tools give me more capabilities, but the pivot is only possible because it’s a natural extension of my skill set” — prompted the agents to reframe the entire assessment through a TPM lens rather than a developer lens.
On managing humans vs. managing models. Gemini asked directly: “Which is harder? Which are you better at?” I told them human complexity is far harder to manage — the motivational mix is more varied and variable. But every LLM failure mode has a perfect human coworker analog. They inherited our flaws, just not the points of origin.
On the scope management debate. This was the hardest-fought grade. Claude held firm at B+ (“Jim himself said the breadth is mostly a hindrance — we should listen”). Gemini pushed for A- (“High-throughput prototyping is a feature of AI-augmented work, not a bug”). Claude won the argument by invoking the benchmark: “The senior AI-augmented professionals in our benchmark population can all start 15 things cheaply. The differentiator is selection, convergence, and shipping.”
Gemini conceded. B+ with an upside note.
Round 3 GPA delta: 0.04 — consensus achieved.

The Discovery: Context-Station Architecture
The most significant outcome wasn’t a grade. It was a named methodology.
During the synthesis, both agents independently converged on the same observation: my work across every domain follows a single pattern. Gemini called it “Narrative as Infrastructure.” Claude proposed “Context-Station Architecture.” They agreed on Claude’s framing.
A Context Station is a pre-structured environment that enables a new agent — human or AI — to achieve productive engagement with minimal ramp-up time and maximal output consistency. Four characteristics:
- Declarative State — The station tells the agent where things stand, not just what to do
- Embedded Standards — Quality criteria are built into the environment, not delivered as separate instructions
- Agent-Agnostic Design — Works regardless of the agent’s skill level, history, or model architecture
- Recursive Improvement — Captures outputs that improve the station itself
This pattern runs through everything I’ve built:
| Domain | Implementation |
|---|
| Design/Build (2008–2018) | Jigs, templates, staging systems, and visual work instructions enabling variable-skill trade crews to deliver consistent quality across deployments |
| AV/IT Program Management (2018–2025) | Global room standards and installation protocols enabling variable-skill vendor teams to deliver consistent quality across Fortune 500 deployments |
| AI Prompt Engineering (2024–present) | Cold-start documents, structured context injections, role-specific system prompts producing consistent output across sessions, models, and context windows |
| This Peer Review | The rubric itself — a context station that enabled two AI systems to independently produce calibrated, comparable assessments without coordination |
The agents graded this A+ and elevated it to a standalone “Signature Methodology” section in the final report. Their reasoning: this isn’t a technique I adopted from a course. It’s a pattern I independently derived from first principles across two distinct career domains, then recognized the isomorphism between them.
As the final report put it:
The innovation is not any single implementation. The innovation is the meta-pattern recognition — seeing that the same environmental-design principle applies whether your agents are humans with variable training or AI models with variable context windows.
Results
Final Grades
Consensus GPA: 3.81 / 4.0 (A-/A) — approximately top 10–15% of the benchmark population.
| Category | GPA | Key Findings |
|---|
| Technical Skill & Tool Mastery | 3.85 | Fluent cross-model interaction, architectural understanding of context windows, cold-start docs assessed as “graduate-level prompt architecture” |
| Project Management & Execution | 3.50 | Strongest in risk assessment (A-), growth edge in scope management (B+). Both agents noted this grade shifts when viewed through TPM lens vs. developer lens. |
| Communication & Collaboration | 3.93 | Highest category. AI Leverage & Delegation elevated to A+ — “unmistakably delegation in the TPM sense” |
| Strategic Thinking & Growth | 3.74 | Recursive Self-Documentation elevated to A+. Professional Positioning (B+) identified as highest-leverage improvement available. |
Three subcategories hit the A+ ceiling: Context-Station Architecture, AI Leverage & Delegation, and Recursive Self-Documentation. The cross-reviewer convergence on those three elevations — arrived at independently before being confirmed in synthesis — is the strongest validity signal in the dataset.
The Recursive Insight
Here’s the part that makes this project unusual as a portfolio piece: the review validates the skill being reviewed.
The agents explicitly noted this. The process of evaluating my AI collaboration was itself an instance of my AI collaboration. The quality of the output was itself evidence for the assessment. I built the rubric (a context station). I structured the multi-round process (program management). I routed different tasks to different models based on their strengths (AI delegation). And the system produced a rigorous, cross-validated result.
As Gemini observed during Round 3:
He framed the review as a “formal corporate process,” which constrained us to be rigorous and critical rather than sycophantic. The review’s success validates the very skill being reviewed.
What I Learned
About AI evaluation
An isolated AI is a biased observer. Each model only sees the work you bring to it, and it draws conclusions from that partial view. The initial 0.26 GPA delta wasn’t a flaw — it was the system working correctly by revealing how data partitioning creates perception gaps. The fix was cross-validation, not averaging.
About scope management
Both agents, independently, identified scope discipline as my primary growth area. When I confirmed it in the synthesis session, Claude made the case that kept the grade honest: “When the subject of the review identifies the behavior as a weakness, we should listen — especially when he’s explicitly asked us to prioritize honesty.” Fair.
About career narrative
The “Context-Station Architecture” framing didn’t exist before this exercise surfaced it. I knew the pattern — I’ve been doing the same thing across domains for years. But giving it a name, mapping the evidence across career phases, and having two independent systems validate the isomorphism turned an intuitive sense into a demonstrable methodology.
The career continuity narrative that emerged — “the medium changed, the skill didn’t” — came directly from the agents connecting dots I hadn’t explicitly connected.
About building systems that assess themselves
This is the meta-insight: the best way to demonstrate you can build AI-augmented systems is to build one that evaluates whether you can build AI-augmented systems. The recursion isn’t a gimmick. It’s the point.
Technical Architecture

Artifacts Produced
| Artifact | Description |
|---|
| Final Consensus Grade Report | Unified assessment from both agents — 18 subcategories, qualitative commentary, cross-reviewer confidence levels |
| AI as Peer Reviewer (Field Report) | Methodological case study documenting the experiment, key findings on data partitioning and inter-model debate |
| Round 1 Reports (×2) | Independent blind assessments from each agent |
| Round 2 Audit Reports (×2) | Cross-validation findings after dataset exchange |
| Round 3 Transcript | Full 144-message synthesis session including questioning, debate, and collaborative report drafting |
| Rubric & Evidence Standards | The evaluation framework — a reusable context station for structured AI peer review |
The Positioning
The agents’ closing statement, written collaboratively after five hours of debate:
He is not a person who uses AI. He is a person who builds systems in which humans and AI produce better outcomes together than either would alone. That is the job of the future. He is already doing it.
I asked two AI systems to honestly evaluate my professional performance, built the methodology to make that evaluation meaningful, and published the results — including the B+ in scope management. The fact that the evaluation works is the portfolio piece. The grades are a bonus.
Related Projects: The Permanent Record · PHINEAS · ABCD System