PHINEAS | Jim Vinson - Forward Deployed / Applied AI

Context

The head of English at an ESL academy in Singapore needed a way to produce reading material at defined CEFR levels (the Common European Framework of Reference, the standard for language proficiency), at scale, with deterministic adherence to the level. Off-the-shelf LLM output couldn’t be trusted to stay on-level, even vaguely. A passage requested at B1 would drift into B2 vocabulary, or slip back to A2 grammar, or include culture-specific references that broke the assessment.

I took on the build on a speculative basis. The head of English championed the work internally and committed beta-trial time from students and faculty.

Approach

The design decision the whole system rests on: a word’s CEFR level is data, not a model guess. PHINEAS keeps a part-of-speech-aware vocabulary database (roughly 14,000 core words and phrases, extended with extrapolated morphologies) where every entry carries its level. When a passage is built, per-word levels are looked up, not inferred, so they stay deterministic. The model is left to do only the two things it is actually good at: identifying the handful of words the database has not seen, and writing the human-facing prose around a fixed vocabulary and grammar window.

Two correction loops keep the system honest as it runs:

Per-word, into the database. When a level is wrong, the fix is written back to the vocabulary database, so the correction is permanent and shared.
Many-shot, into the model context. Approved examples are fed back as in-context guidance, so the prose model learns the house style without retraining.

Both loops sit behind an approval gate with three roles, user, trainer, and admin. A correction is a proposal until a human with the right role accepts it. The human is in the loop by design, not as a fallback.

Outcome

Now at phineas.app, in formal product development. The pilot bar was outcome-based: a generated passage had to be classroom-ready without edits beyond formatting, and clear unanimous staff review. The target was 60% of passages clearing that bar; the pilot reached just under 85%, which is what moved it into open beta. Beta-user trials are running with students and faculty at the original academy.

The pattern (deterministic data carrying the load, the model confined to what it does well, corrections gated through a human) generalizes to any domain where output has to hit a precise level or category: medical literacy, legal writing, regulatory compliance, technical documentation tiered by audience.