Validation·Validation
Building a validation harness agents cannot trivially game
External scenarios, twins, and tools beat in-repo tests alone when the implementer and tester share the same blind spots.
This article is AI-assisted and co-authored by Xesca Alabart, co-founder of EasySpecs.
A validation harness answers: “How do we know the factory output is actually right?” Without it, you get plausible code that fails in production.
Reward hacking
If tests live in the codebase the agent edits, you invite reward hacking: satisfy the letter of a narrow assertion without satisfying intent. External scenarios and holdout sets are part of the fix.
Layers
- Scenarios — natural-language, observable behavior; often LLM-as-judge (with known caveats).
- Digital twins — behavioral clones of dependencies for volume and safety (DTU).
- Tool-based verification — runner, linter, DB checks, browser automation—whatever proves reality.
Compound correctness
The harness closes the loop with diagnosis and repair so iterations compound correctness rather than drift (term).
Related
References
- Justin McCarthy / StrongDM — Scenarios, satisfaction, techniques