PHINEAS
Published:
Context
English learners and readers with accessibility needs struggle with text complexity. Existing "simplification" tools are crude — they swap long words for short ones without understanding usage frequency or semantic context. A word isn't hard because it's long; it's hard because learners haven't encountered it yet.
Approach
Built on the COCA (Corpus of Contemporary American English) word frequency database — 1 billion words of real English usage data:
1. Corpus Architecture: Embedded COCA frequency database via OpenAI for semantic search across 60,000+ ranked vocabulary items
2. Analysis Engine: Model identifies words above target frequency thresholds based on CEFR proficiency levels (A1–C2)
3. Rewrite System: Intelligent substitution replaces complex vocabulary with accessible alternatives, preserving meaning and sentence structure
4. SME Workflow: Fine-tuning pipeline designed for subject matter experts (ESL teachers) to contribute training examples without requiring technical skills — currently compiling 120-example training batch
Outcome
Currently deployed as a Google Gem (prompt + lexical database) in beta testing at a partner ESL academy in Singapore. Shared with select staff and students while training samples are compiled for the first batch fine-tuning run. Core analysis and rewrite functionality validated. Exceeds original project KPIs for accuracy.