The "+16pt harness effect"

Empirical evidence that the harness matters as much as the model. Benchmarks that hold the model constant and vary the harness show large differences in auto…

Empirical evidence that the harness matters as much as the model. Benchmarks that
hold the model constant and vary the harness show large differences in autonomous
task completion — a gap of roughly a dozen-plus points between a strong harness
configuration and a weak one on agentic benchmarks. Students learn to read
agent benchmarks critically (SWE-bench versus Terminal-Bench, contamination
concerns) and to treat harness configuration as a first-class performance lever.

The "+16pt harness effect"

Sources