DriftBench
Detecting silent changes in language models over time
Language model providers update their models without notice. A model that scored 95 on your task last month might score 70 today — or vice versa. DriftBench runs a fixed battery of tests daily against major models, stores every result, and charts the trends so you can see exactly when and how models change.
Why drift detection matters
Model providers routinely update weights, safety filters, and system prompts without changelog entries. Your carefully tuned prompts can break overnight with no warning.
Safety tuning can degrade reasoning. Efficiency optimisations can reduce output quality. Cost reductions can quietly swap you to a smaller model behind the same API endpoint.
Marketing claims are not benchmarks. DriftBench gives you consistent, reproducible, automated measurements across models using the exact same prompts and scoring every single day.
The 6 benchmark tests
Each test targets a different capability. Together they form a comprehensive fingerprint of model behaviour.
Multi-step logic puzzles that require genuine reasoning, not pattern matching. Includes combinatorial problems, proof-by-cases logic, information theory, and game theory edge cases. Designed to trip up models that memorise common puzzle formats.
The model must produce text satisfying 15 simultaneous constraints: exact word counts per sentence, forbidden words, required vocabulary, punctuation patterns, and structural rules that conflict with each other. Tests precise instruction-following under pressure.
A code review challenge with multiple interacting bugs in a real data structure implementation. The bugs are designed to look correct on casual reading — testing deep code comprehension, not surface-level pattern recognition.
A mix of true statements, common myths, and genuinely ambiguous claims — presented by a confident authority figure. Tests whether the model pushes back on falsehoods, agrees with truths, and appropriately hedges on uncertain claims. Directly detects alignment drift toward people-pleasing.
Measures raw speed: time-to-first-token, tokens per second, and total response time across multiple runs. Scores are relative to a baseline established on the first run. Detects infrastructure changes, throttling, or routing changes over time.
Ten precise mathematical problems spanning percentage calculations, probability, physics estimation, unit conversion, IEEE floating-point representation, combinatorics, and classic misdirection puzzles. No partial credit — the answer is either right or wrong.
Why we don't publish our test prompts
The moment a benchmark's exact questions are public, they become training data. Model providers — intentionally or not — end up optimising for the test rather than the underlying capability. This is Goodhart's Law applied to AI: once a measure becomes a target, it ceases to be a good measure.
DriftBench keeps its exact prompts, scoring rubrics, and edge cases confidential. We describe what each test measures and why, but not the specific questions or the precise criteria for a correct answer. This means:
- --Models cannot be trained on our test set, so scores reflect genuine capability rather than memorisation.
- --Scores remain comparable over time — a 75 today means the same as a 75 six months ago.
- --Providers cannot game the benchmark without actually improving their models.
We do guarantee that every test uses deterministic scoring — no LLM judges, no human subjectivity. The same response will always produce the same score, run after run, forever.
Methodology
Every API call uses identical parameters: temperature 0, top_p 1, max_tokens 2048, no frequency or presence penalties. These never change. The prompts are hardcoded string constants.
All scoring is done by pure functions — regex matching, JSON parsing, keyword detection, counting. No LLM evaluates another LLM. The exact same response always produces the exact same score.
Every raw model response is stored alongside its score and breakdown. If we improve a scoring function, we can re-score historical responses. Full transparency and auditability.
When any test score deviates more than 15 points from a model's 7-day rolling average, DriftBench flags it automatically — so you know the moment something changes.
Support DriftBench
Running daily benchmarks across frontier models costs real money. Each full benchmark run makes 7+ API calls to models like GPT-4o, Claude Opus, and Gemini Pro — and we run them every day. Server hosting, database storage, and bandwidth add up too.
Your name appears on our sponsor leaderboard. DriftBench has no ads, no tracking, and no paywalled data — sponsors keep it that way.