Benchmarking LLMs on real test-automation work

Coding benchmarks ask whether a model can implement a feature or fix a bug. Authoring test automation is a different job: you choose a structure that survives the application changing, you isolate flakiness, you handle async and network failures, and you keep the suite maintainable as it grows. That is the work I do, and I wanted to measure it directly, not whether a model can write code that runs, but whether it can do the engineering that test automation demands.

So I built llm-qa-benchmark: twenty models, three trials per item, five tracks. Nothing is graded on the text of the answer. Generated suites are executed against mutated source and scored on what they catch; bug-fix patches are applied and run against a hidden test; Playwright scripts are launched against a live app. The headline result is a split. On the foundational tracks the field is saturated. On the work that resembles a real test suite, it falls apart.

The leaderboard

Composite leaderboard

#	Model	Composite	E2E adv.	$/correct
1	Gemini 3.1 Pro	0.877	0.60	$0.0444
2	GPT-5.4	0.869	0.59	$0.0148
3	GPT-5.5	0.864	0.60	$0.0742
4	Claude Sonnet 4.6	0.860	0.56	$0.0245
5	Claude Opus 4.8	0.852	0.50	$0.0448
6	Kimi K2.6	0.850	0.50	$0.0211
7	Qwen3.7 Max	0.847	0.56	$0.0156
8	GPT-5.1 Codex	0.842	0.47	$0.0354
9	Gemini 3.5 Flash	0.819	0.34	$0.0376
10	GPT-OSS 120B	0.814	0.46	$0.0003

The composite column is misleadingly tight: the top ten land within six hundredths of each other. The E2E adv. column, the advanced end-to-end track, is where the same models range from 0.34 to 0.60. The overall number flatters everyone; the advanced number separates them.

The foundations are strong

Split the five tracks into two groups. Component-level work (generating unit tests, localizing a bug, designing cases from a spec) shows strong maturity. Fifteen of the twenty models clear 0.88 there, and the strongest sit above 0.93; only two older open coders trail. System-level work (authoring end-to-end automation against a running app) is where the scores drop, in some cases by half.

Component vs. system tracks by model

This is the central finding, and it holds across the field: the scores on the left are extremely high, while the scores on the right lag significantly behind. A model can produce a passing component test efficiently, but producing automation a team can own and maintain is a separate skill that is still underdeveloped. Being a frontier model is enough to score highly on component tasks; it is not enough to be good at the rest.

Models don't reach for the right structure

The advanced track is built to isolate that skill. Each task describes a scenario that implies a technique (a page object model, fixtures, polling, network interception, direct API testing) without naming it, the way a real ticket would. A test can pass without the right structure, so scoring measures both: did it pass, and did it reach for the tool the scenario called for?

Advanced-pattern adoption by model

The pattern is clear. Models handle the techniques that look like ordinary code: most recognize a direct API check or simulate a failed network call. Far fewer use a page object, a fixture, or explicit polling, even when the scenario plainly requires it. They default to the shape of a tutorial answer rather than the structure a suite needs to survive maintenance. Design for change, isolate flakiness, do not repeat setup: these are the reflexes a senior SDE in Test brings by default, and the models have not been trained to have them.

I read this as a training-data gap, not a reasoning gap. There is an enormous amount of application code and bug-fix data in the world, and very little that captures how a maintained test suite is actually structured. The models are good at what they have seen.

Cost separates the field more than quality does

Because the quality scores are compressed, cost is where the real decisions live.

Cost vs. quality

GPT-OSS 120B lands one rung below Gemini 3.5 Flash on composite (0.814 against 0.819, half a point) at about 125× lower cost per passing sample ($0.0003 against $0.0376). The open-weight coders cluster on the value frontier: GPT-OSS 120B and Qwen3 Coder 30B both resolve a correct sample for roughly $0.0003, and DeepSeek V4 Pro for $0.0044. Choosing by composite alone leaves most of that on the table.

How it is scored

Each output passes through two signals. The first is execution: the code runs in a sandbox and we measure objective things: mutation-kill rate, branch coverage, whether a repaired bug now passes its hidden test, whether the Playwright script reaches its assertion. The second is a dual judge: one model scores craftsmanship against a rubric, another flags calls to functions and APIs that do not exist in the code under test. The two combine into the composite and an A/B/C tier. Everything generated is untrusted, so the runner is Docker with no network, capped memory and CPU, and a read-only root, with a local-subprocess fallback when no daemon is present. Runs are resumable, keyed on (model, track, sample, trial).

What this leaves open

The first half of the test-automation problem has matured enough to be practically useful today. Models write unit tests and simple flows with high competence. The second half (choosing patterns, designing for maintainability, handling the failure modes that make real automation flaky) remains a challenge, and a model's general capability does not predict it: the strongest models on this board still score in the low 0.5s on the advanced track. Until that engineering judgment improves, the right way to deploy them is the way you would onboard a junior: let them draft the obvious tests, and keep the structure of the suite under human ownership.

uv sync --extra dev --extra dashboard
    cp .env.example .env                                  # add OPENROUTER_API_KEY
    uv run qabench run --track unit_test_gen --models claude-sonnet-4-6 --limit 5
    uv run qabench score <run_id>
    uv run qabench dashboard

Code, data, and the cross-model dashboard: github.com/kidby/llm-qa-benchmark.