v1.0 · 2026
工程方法论 · 验证篇ENGINEERING · VERIFICATION CHAPTER/← 返回工程方法论← back to Engineering

AI-Native 验证方法论

AI-Native Verification & Evals

生成已近免费,确认它正确,依旧昂贵。这道不对称,划定了自动化的边界。当 agent 无人值守地跑,它也在无人值守地犯错——「done」永远是一个声明,不是一个证明。所以验证不再是收尾的一道闸门,而是整个循环里唯一的承重墙,也是未来最大的工程投入方向。一条尺子:某领域的自动化进度,约等于它验证成本的下降速度

Generation is nearly free; confirming it is correct is not. That asymmetry draws the boundary of automation. When agents run unattended, they err unattended, and "done" is always a claim, never a proof. So verification is no longer a gate at the end but the one load-bearing wall of the entire loop, and the largest engineering investment ahead. One ruler: a domain automates exactly as fast as its verification cost falls.

工程
验证篇
ENG
VERIF
SHEET
00
PROLOGUE · 概念
PROLOGUE · The Concept
定义 · 先划界
Definition · Draw the line first

不是「人审更多」,是让验证跟上生成

Not "review more," but verification that keeps up with generation

AI-Native 验证的转向,是从"多写点测试、多审点代码",走向让验证随生成一起规模化。生成的产出在指数增长,人审的带宽却恒定——靠人审更多,是用算术追指数。真正的问题是:怎么让验证随生成一起规模化

AI-Native verification shifts from "write more tests, review more code" to making verification scale with generation. Output grows exponentially while human review bandwidth is constant; reviewing more is arithmetic chasing an exponential. The real question is: how do you make verification scale with generation.

在前 AI 时代,验证是流程末端的一道人审闸门,够用,因为产出本身慢。当 agentic 生成让产出近乎免费且暴涨,这道末端闸门立刻成为整条管线最窄的口。这不是某个团队的疏忽,是一个结构性的不对称:生成的成本塌到接近零,确认正确的成本没有

In the pre-AI era, verification was a human gate at the end of the pipeline, and it sufficed because output was slow. Once agentic generation makes output nearly free and explosive, that end gate instantly becomes the narrowest point in the whole pipeline. This is not one team's oversight; it is a structural asymmetry: the cost of generating collapsed toward zero, the cost of confirming correctness did not.

所以这一卷不教"怎么审得更勤",而是问一个结构问题:当生成充裕,验证该怎么重画——哪些能交给机器机检、哪些必须把验证写进结构、人审到底守在哪几处不可机检的判断上,以及怎么把人从实时盯屏改成异步分诊。

So this volume does not teach "review more diligently." It asks a structural question: when generation is abundant, how should verification be redrawn: what can be machine-checked, what must be written into the structure, which few non-machine-checkable judgments humans must hold, and how to move humans from real-time watching to asynchronous triage.

SHEET
01
THE KERNEL · 内核特化
THE KERNEL · Specialization
命题 · 承重
Thesis · Load-bearing

同一条内核,作用在信任这个面上

The same kernel, on the surface of trust

系列内核:执行变充裕 → 判断退守 → 上下文成基设 → 人回归意义。把四步填上"验证"的具体内容——退守的那个节点,就是验证;它正是工程卷的新瓶颈、谱系卷里唯一的承重墙。

The series kernel: execution becomes abundant → judgment retreats → context becomes infrastructure → people return to meaning. Fill the four steps with verification: the node it retreats to is verification itself – the new bottleneck of the engineering volume, the one load-bearing wall of the genealogy.

母版 · 特化到验证MASTER TEMPLATE · specialized to verification
充裕ABUNDANCE
生成
Generation
agent 让产出近乎免费、暴涨;"造出来"不再稀缺。
Agents make output nearly free and explosive; "producing it" is no longer scarce.
判断JUDGMENT
验证 = 瓶颈
Verification = bottleneck
稀缺的是确认正确;生成/验证的不对称划定自动化边界。
What is scarce is confirming correctness; the generation/verification asymmetry draws automation's boundary.
上下文CONTEXT
验证即基础设施
Verification as infra
eval harness、可机检条件、可观测性——让验证随生成规模化。
Eval harnesses, machine-checkable conditions, observability, so verification scales with generation.
MEANING
定义「何为正确」
Define "correct"
人定义什么算对、什么算好,并做终审;例行检查交给机器。
People define what counts as correct and good, and own the final sign-off; routine checking goes to machines.

下面的图纸:SHEET 02 拆解那道不对称,SHEET 03–06 逐一重画——把验证写进结构、做成 eval 评测台、改实时盯屏为异步分诊、守住不可机检的人审,SHEET 07 给落地原则与信号。最后一件仪器,帮你按"可机检 × 错误代价"给每类产出分配验证策略。

The sheets ahead: SHEET 02 unpacks the asymmetry; SHEET 03–06 redraw in turn – write verification into the structure, build it into an eval bench, replace real-time watching with async triage, and hold the non-machine-checkable human sign-off; SHEET 07 gives principles and signals. A final instrument allocates a verification strategy to each output type by "machine-checkable × cost of being wrong."

SHEET
02
MECHANISM · 不对称性
MECHANISM · The asymmetry
机理 · 受力分析
Mechanism · Force analysis

生成塌到零,验证没有

Generation fell to zero, verification did not

这道不对称就是整盘棋。生成 × 暴涨,验证 × 恒定——两条曲线一交叉,验证就成了唯一的瓶颈。编码之所以第一个被 loop 化,正因它验证成本最低(测试、CI,确定且毫秒级)。

This asymmetry is the whole game. Generation explodes, verification stays flat; where the curves cross, verification becomes the only bottleneck. Coding was the first to be loop-ified precisely because its verification cost is lowest (tests, CI: deterministic, millisecond-scale).

旧 · 验证是末端闸门before · a gate at the end
产出慢,人审够用;"done" 由人点头确认。
Output was slow, human review sufficed; "done" was confirmed by a human nod.
新 · 验证是承重墙after · the load-bearing wall
产出暴涨,"done" 必须是可机检的证明,验证随生成规模化。
Output explodes; "done" must be a machine-checkable proof, and verification scales with generation.

这道不对称也给了一把判断领域成熟度的尺子:哪个领域的验证成本降下来,哪个领域就是下一个被自动化的。编码 loop 用测试与 CI 做选择压力——廉价、确定;科研 loop 用实验与自然做选择压力——昂贵、有噪、以周计。auto-research 何时成熟,量的也是同一把尺子。

The asymmetry also hands you a ruler for a domain's maturity: whichever domain's verification cost falls is the next to be automated. A coding loop uses tests and CI as selection pressure (cheap, deterministic); a research loop uses experiments and nature (expensive, noisy, week-scale). When auto-research matures is measured by the same ruler.

承重判据Load-bearing test

「done」是一个声明,还是一个证明?凡是只能靠人点头说"应该对了"的地方,都还没被真正自动化——把它变成可机检的证明,自动化的边界才向前移一格。

Is "done" a claim or a proof? Anywhere it rests on a human nodding "this should be right" is not truly automated yet; turn it into a machine-checkable proof and automation's boundary moves one notch forward.

SHEET
03
REDRAW · 验证写进结构
REDRAW · Verify in the structure
重画 · 承重墙
Redraw · The wall

重画 · 独立 checker 是唯一的承重墙

Redraw · the independent checker is the one load-bearing wall

写代码的模型给自己打分太宽容。把验证从"事后人看"改成"结构里内置":独立 checker(可换模型)、可机检条件、类型与契约即护栏、CI 即选择压力。

The model that writes is too lenient grading itself. Move verification from "a human looks afterward" to "built into the structure": an independent checker (a different model), machine-checkable conditions, types and contracts as guardrails, CI as selection pressure.

机制:没有独立验证,一个循环只是在用惊人的速度放大自己的错误——生成者和检查者是同一个模型,等于让考生自己批卷。重画:把"做"和"查"分开。一个独立的 checker(最好换一个模型)是整个结构里唯一的承重墙;它垮,整层就垮。再把能机检的都机检掉:测试、类型、契约、lint、可机检的验收条件——这些是确定、毫秒级、可无限复用的选择压力。

Mechanism: without independent verification, a loop merely amplifies its own mistakes at remarkable speed; generator and checker being the same model is letting the candidate grade their own exam. Redraw: separate "do" from "check." An independent checker (ideally a different model) is the one load-bearing wall in the structure; if it fails, the whole floor fails. Then machine-check everything checkable: tests, types, contracts, lint, machine-checkable acceptance conditions, which are deterministic, millisecond-scale, infinitely reusable selection pressure.

检验信号Test signal

可机检占比上升——「done」越来越多地由机器证明,而非人点头。生成者从不给自己签字放行。

The machine-checkable fraction goes up: "done" is increasingly proved by machines, not nodded by humans. The generator never signs off on its own work.

SHEET
04
REDRAW · 评测台
REDRAW · The eval bench
重画 · 一等工件
Redraw · First-class artifact

重画 · 把 eval 当一等工件

Redraw · treat evals as a first-class artifact

评测不是测试的附属,是和代码同级的资产:数据集、评分器、回归套件、对抗式评审。它把"done"从声明变成可重复、可回归的证明,并随产出一起增长。

Evals are not an appendage of tests; they are an asset on par with code: datasets, graders, regression suites, adversarial review. They turn "done" from a claim into a repeatable, regressible proof, and grow alongside output.

机制:一次性人审不可回归——今天对了,明天换个 prompt 又错,没人知道。重画:把验证沉淀成评测台。每发现一类错误,就补一条 eval;每条 eval 都进回归套件,从此机器替你盯住它。配上可观测性(生产里出错的样本自动回流成新 eval),验证就从一次性人力,变成一项随时间复利的基础设施——这正是趋势所指:未来最大的工程投入流向验证基础设施。

Mechanism: one-off human review does not regress; right today, wrong tomorrow under a new prompt, and no one knows. Redraw: settle verification into an eval bench. For each class of error found, add an eval; every eval enters the regression suite, and the machine watches it for you from then on. With observability (production failures flow back automatically as new evals), verification turns from one-off labor into infrastructure that compounds over time – exactly where the trend points: the largest engineering investment ahead flows to verification infrastructure.

检验信号Test signal

逃逸率下降,且每个逃逸都变成一条新 eval——同一个错误不会犯第二次。评测套件随产出一起长大。

The escape rate falls, and every escape becomes a new eval: the same mistake is not made twice. The eval suite grows alongside output.

SHEET
05
REDRAW · 异步分诊
REDRAW · Async triage
重画 · 人因
Redraw · Human factors

重画 · 设计验证,不要盯屏

Redraw · design verification, don't watch the screen

系统越可靠,监督者警觉性衰减越快——而恰在最该接管的异常时刻,人已丢失情境感知(Bainbridge 1983《自动化的反讽》)。「人在环上」注定盯不住。答案不是盯得更紧,是把人的介入从实时监督改成异步分诊。

The more reliable the system, the faster supervisor vigilance decays, and exactly when takeover is needed, the human has lost situational awareness (Bainbridge 1983, "Ironies of Automation"). "Human on the loop" is doomed to not watch. The answer is not to watch harder but to move human intervention from real-time supervision to asynchronous triage.

机制:警觉性衰减是物理规律,不是态度问题——让人盯着一个 99% 正确的循环,那 1% 一定会漏过。重画:把验证写进结构(SHEET 03/04)之后,人就不必与机器的时钟同步了。机器机检的归机器;机检不了、又要紧的,攒成一个分诊队列,人异步地、带着完整上下文地处理。介入点从"实时盯屏"前移到"设计时定义验证"、后移到"异常时分诊"——两头都比中间的实时监督可靠。

Mechanism: vigilance decay is physics, not attitude; put a human in front of a 99%-correct loop and the 1% will slip by. Redraw: once verification is written into the structure (SHEET 03/04), the human no longer has to keep the machine's clock. What machines can check goes to machines; what cannot be checked yet matters accumulates into a triage queue the human handles asynchronously, with full context. Intervention moves forward to "define verification at design time" and back to "triage on exception" – both more reliable than real-time supervision in the middle.

检验信号Test signal

人不再实时盯屏,而是在一个分诊队列里异步处理异常;分诊延迟可控,且没人需要全程在场。

Humans no longer watch in real time but handle exceptions asynchronously from a triage queue; triage latency is bounded, and no one needs to be present throughout.

SHEET
06
REDRAW · 人审守在哪
REDRAW · Where humans stay
重画 · 判断节点
Redraw · Judgment node

重画 · 机器查「对不对」,人定「何为对」

Redraw · machines check "is it right," humans define "what is right"

机器能验证"符不符合规格",但定义规格、定义"什么算对、什么算好",是机器替不了的。把人审收敛到不可机检、又要紧的那几处——这与工程卷的 trust-but-verify、架构卷的信任边界,是同一道分界。

Machines can verify "does it meet the spec," but defining the spec – defining what counts as correct, as good – is not something a machine can replace. Converge human review onto the few things that are non-machine-checkable yet matter; this is the same line as the engineering volume's trust-but-verify and the architecture volume's trust boundaries.

机器查:对不对Machines: is it right
  • 符不符合规格 / 测试 / 契约
  • Conformance to spec / tests / contracts
  • 回归:以前对的还对不对
  • Regression: does what was right stay right
  • 可机检的验收条件
  • Machine-checkable acceptance conditions
  • 异常与漂移的可观测信号
  • Observable signals of anomaly and drift
人定:何为对Humans: what is right
  • 「完成」与「好」的定义(质量的规格)
  • The definition of "done" and "good" (the spec of quality)
  • 风险容忍、安全与信任边界
  • Risk tolerance, security and trust boundaries
  • 品味与方向:这是不是我们想要的
  • Taste and direction: is this what we want
  • 承担后果的终审签字
  • The final sign-off that owns the outcome

注意一个对称性:两个人用完全相同的验证结构,会得到相反的结果——一个用它在深刻理解的工作上加速,另一个用它逃避理解本身。结构分不出区别,人分得出。这就是为什么"定义何为正确"这件事,是人无法外包的最后一项资产。

Note a symmetry: two people using the exact same verification structure get opposite results – one uses it to accelerate work they deeply understand, the other to escape understanding itself. The structure cannot tell the difference; the person can. That is why "defining what counts as correct" is the last asset a human cannot outsource.

SHEET
07
PLAYBOOK · 落地
PLAYBOOK · Rollout
行动 · 可执行
Action · Operable

落地 · 四原则 + 四信号

Rollout · four principles + four signals

把验证当成你最该投的基础设施。原则定方向,信号验真伪。起步只需一步:先量出你的验证成本花在哪——再用下面的仪器,给每类产出分配策略。

Treat verification as the infrastructure most worth investing in. Principles set direction, signals verify. Getting started is one step: measure where your verification cost goes, then use the instrument below to allocate a strategy to each output type.

四条原则

Four principles

原则 01PRINCIPLE 01
写进结构Build it in
能机检的都机检;生成者绝不给自己签字——独立 checker 是承重墙。
Machine-check all you can; the generator never signs off on itself – the independent checker is the wall.
原则 02PRINCIPLE 02
eval 即资产Evals are assets
每个错误补一条 eval、进回归套件;验证随产出复利。
Each error adds an eval into the regression suite; verification compounds with output.
原则 03PRINCIPLE 03
异步分诊Async triage
别盯屏——警觉性衰减是物理;介入前移到设计、后移到异常。
Don't watch; vigilance decay is physics. Move intervention to design-time and to exceptions.
原则 04PRINCIPLE 04
守住「何为对」Own "what is right"
机器查对不对,人定何为对——质量的规格不可外包。
Machines check "is it right," humans define "what is right"; the spec of quality is not outsourceable.

四个该追踪的信号

Four signals to track

单位验证成本Cost per verification
确认一件产出正确,平均要花多少人力。
The average human cost to confirm one output is correct.
可机检占比Machine-checkable %
「done」由机器证明、而非人点头的比例。
The share of "done" proved by machines rather than nodded by humans.
逃逸率Escape rate
漏到生产的缺陷;每个逃逸都该回流成 eval。
Defects that reach production; each escape should flow back as an eval.
分诊延迟Triage latency
异常进队列到被处理的时间——可控即可放手。
Time from an exception entering the queue to being handled; bounded means you can let go.

怎么起步

How to start

把团队产出的每一类,丢进下面的分配器:按"可机检吗 × 错了代价大吗"两轴,它会告诉你哪些全自动机检、哪些机检+抽样、哪些抽样容错、哪些必须人审承重。人审带宽稀缺,只花在承重那几类上。

Drop every type of output your team produces into the allocator below: on the two axes "is it machine-checkable × is it costly if wrong," it tells you which to auto-check, which to check and sample, which to sample and tolerate, and which a human must sign off. Human review bandwidth is scarce; spend it on the load-bearing few.

INSTRUMENT 09验证策略分配器VERIFICATION-STRATEGY ALLOCATOR
产出类型Output type可机检?Checkable?代价大?Costly?策略Strategy
单元逻辑Unit logic
代码风格 / lintCode style / lint
数据迁移Data migration
营销文案Marketing copy
安全 / 信任边界Security / trust boundary
产品品味 / 方向Product taste / direction

规律:可机检 × 代价低 → 全自动机检可机检 × 代价高 → 机检 + 抽样人审不可机检 × 代价低 → 抽样容错不可机检 × 代价高 → 必人审承重(安全、信任边界、品味)。模型变强、机检能力上移,这张表要重答——可机检的那条线一直在右移。

The rule: checkable × low cost → auto-check; checkable × high cost → machine-check plus human sampling; not checkable × low cost → sample and tolerate; not checkable × high cost → a human must sign off (security, trust boundaries, taste). As models improve and machine-checking climbs, re-answer the table – the line of what is checkable keeps moving right.