Validation·Validation

Building a validation harness agents cannot trivially game

External scenarios, twins, and tools beat in-repo tests alone when the implementer and tester share the same blind spots.

This article is AI-assisted and co-authored by Xesca Alabart, co-founder of EasySpecs.

A validation harness answers: “How do we know the factory output is actually right?” Without it, you get plausible code that fails in production.

Reward hacking

If tests live in the codebase the agent edits, you invite reward hacking: satisfy the letter of a narrow assertion without satisfying intent. External scenarios and holdout sets are part of the fix.

Layers

Scenarios — natural-language, observable behavior; often LLM-as-judge (with known caveats).
Digital twins — behavioral clones of dependencies for volume and safety (DTU).
Tool-based verification — runner, linter, DB checks, browser automation—whatever proves reality.

Compound correctness

The harness closes the loop with diagnosis and repair so iterations compound correctness rather than drift (term).

References

Justin McCarthy / StrongDM — Scenarios, satisfaction, techniques