The Test
A developer benchmarked six LLMs on a deceptively simple task: match 16 insults from The Secret of Monkey Island (1990) to their specific comebacks, with four generic distractors thrown in. The game's "insult sword fighting" mechanic tests pattern matching over combat. Players learn insult-comeback pairs from pirates before facing the Sword Master.
The challenge? Insults like "You fight like a dairy farmer" require the exact response "How appropriate. You fight like a cow." Not close. Not creative. Exact.
The Results
GPT-5.1, GPT-5.2, Claude Opus 4.5, and Claude Sonnet 4.5 scored 16/16. GPT-4.1 managed 14/16, reusing some responses multiple times. Grok Code Fast scored 11/16, missing five matches entirely.
Claude Haiku 4.5 refused the task outright: "This isn't related to programming." A coding assistant that can't handle wordplay.
Why This Matters
This isn't about gaming trivia. It's a probe of capabilities enterprise teams need: exact recall amid noise, contextual reasoning, and handling creative text generation tasks.
The failures are instructive. GPT-4.1's repeated responses suggest retrieval issues under constraints. Grok's missed matches highlight gaps in pattern recognition. Claude Haiku's refusal points to overly narrow training boundaries.
For enterprise applications requiring chatbots, content generation, or NLP tasks where precision matters, these gaps matter. A system that can't match 16 specific pairs won't reliably handle customer queries or generate accurate documentation.
The test also surfaces a practical question: for tasks requiring exact matching and creative text handling, commercial APIs like GPT-5 consistently outperform cheaper alternatives. The cost difference may be justified when precision is non-negotiable.
Worth noting: the developer's prompt accidentally said "witty insults" instead of "witty comebacks." Every model handled the ambiguity. That's actually impressive.