Open-source WFGY framework claims 91% MMLU accuracy via prompt engineering, no fine-tuning

What it is

WFGY is an open-source framework that wraps LLMs in structured prompts to improve reasoning and reduce hallucinations. Created by onestardao, it's available on GitHub under MIT license. Three versions target different use cases: a beginner-friendly PDF (1.0), a RAG debugging toolkit (2.0), and a 131-test stress suite (3.0).

The approach is pure prompt engineering. Four modules (BBMC, BBPF, BBCR, BBAM) create what the author calls "solver loops" that guide model behavior without touching weights or requiring fine-tuning.

The claims

Benchmark numbers look impressive. The project reports MMLU accuracy jumping from 68.2% to 91.4%, GSM8K from 45.3% to 84.0%, and a 3.6x improvement in time-to-failure across ten standard tests including TruthfulQA and MathBench.

Version 2.0 offers a 16-problem checklist for common RAG failures: retrieval errors, vector database fragmentation, prompt injection, deployment sequencing issues. It uses a tension metric (delta_s = 1 - cos(I, G)) to flag when generated output drifts from intended behavior.

What to watch

No independent validation exists. Benchmark improvements come from the author's own evaluations, some using simulated GPT-5 behavior via fixed seeds rather than production APIs. The mathematical formulas presented as novel (including a B = I - G + mc² construct) haven't faced peer review outside a Zenodo paper.

For teams debugging RAG systems or testing vector database accuracy, the 16-problem checklist might offer useful pattern recognition. The framework addresses real issues: data drift in retrieval systems, hallucination detection in structured outputs, semantic accuracy degradation over time.

The pitch is appealing: drop a text file into any LLM, get better reasoning in 30 seconds. Reality will depend on reproducibility. With 1.3k GitHub stars staked on version 3.0, the author invites teams to "try to break it."

Worth testing on your own benchmarks before relying on claimed gains.

What it is

The claims

What to watch

Related Articles

ModelRiver pitches async webhook streaming for AI chatbots, aims to simplify failover

Grindr tests AI matching tier at $80-200/week amid compliance costs, diversification push

Gemini Live API running on Ray-Ban Meta glasses via LiveKit proxy