生成已近免费,确认它正确,依旧昂贵。这道不对称,划定了自动化的边界。当 agent 无人值守地跑,它也在无人值守地犯错——「done」永远是一个声明,不是一个证明。所以验证不再是收尾的一道闸门,而是整个循环里唯一的承重墙,也是未来最大的工程投入方向。一条尺子:某领域的自动化进度,约等于它验证成本的下降速度。
Generation is nearly free; confirming it is correct is not. That asymmetry draws the boundary of automation. When agents run unattended, they err unattended, and "done" is always a claim, never a proof. So verification is no longer a gate at the end but the one load-bearing wall of the entire loop, and the largest engineering investment ahead. One ruler: a domain automates exactly as fast as its verification cost falls.
AI-Native 验证的转向,是从"多写点测试、多审点代码",走向让验证随生成一起规模化。生成的产出在指数增长,人审的带宽却恒定——靠人审更多,是用算术追指数。真正的问题是:怎么让验证随生成一起规模化。
AI-Native verification shifts from "write more tests, review more code" to making verification scale with generation. Output grows exponentially while human review bandwidth is constant; reviewing more is arithmetic chasing an exponential. The real question is: how do you make verification scale with generation.
在前 AI 时代,验证是流程末端的一道人审闸门,够用,因为产出本身慢。当 agentic 生成让产出近乎免费且暴涨,这道末端闸门立刻成为整条管线最窄的口。这不是某个团队的疏忽,是一个结构性的不对称:生成的成本塌到接近零,确认正确的成本没有。
In the pre-AI era, verification was a human gate at the end of the pipeline, and it sufficed because output was slow. Once agentic generation makes output nearly free and explosive, that end gate instantly becomes the narrowest point in the whole pipeline. This is not one team's oversight; it is a structural asymmetry: the cost of generating collapsed toward zero, the cost of confirming correctness did not.
所以这一卷不教"怎么审得更勤",而是问一个结构问题:当生成充裕,验证该怎么重画——哪些能交给机器机检、哪些必须把验证写进结构、人审到底守在哪几处不可机检的判断上,以及怎么把人从实时盯屏改成异步分诊。
So this volume does not teach "review more diligently." It asks a structural question: when generation is abundant, how should verification be redrawn: what can be machine-checked, what must be written into the structure, which few non-machine-checkable judgments humans must hold, and how to move humans from real-time watching to asynchronous triage.
系列内核:执行变充裕 → 判断退守 → 上下文成基设 → 人回归意义。把四步填上"验证"的具体内容——退守的那个节点,就是验证;它正是工程卷的新瓶颈、谱系卷里唯一的承重墙。
The series kernel: execution becomes abundant → judgment retreats → context becomes infrastructure → people return to meaning. Fill the four steps with verification: the node it retreats to is verification itself – the new bottleneck of the engineering volume, the one load-bearing wall of the genealogy.
下面的图纸:SHEET 02 拆解那道不对称,SHEET 03–06 逐一重画——把验证写进结构、做成 eval 评测台、改实时盯屏为异步分诊、守住不可机检的人审,SHEET 07 给落地原则与信号。最后一件仪器,帮你按"可机检 × 错误代价"给每类产出分配验证策略。
The sheets ahead: SHEET 02 unpacks the asymmetry; SHEET 03–06 redraw in turn – write verification into the structure, build it into an eval bench, replace real-time watching with async triage, and hold the non-machine-checkable human sign-off; SHEET 07 gives principles and signals. A final instrument allocates a verification strategy to each output type by "machine-checkable × cost of being wrong."
这道不对称就是整盘棋。生成 × 暴涨,验证 × 恒定——两条曲线一交叉,验证就成了唯一的瓶颈。编码之所以第一个被 loop 化,正因它验证成本最低(测试、CI,确定且毫秒级)。
This asymmetry is the whole game. Generation explodes, verification stays flat; where the curves cross, verification becomes the only bottleneck. Coding was the first to be loop-ified precisely because its verification cost is lowest (tests, CI: deterministic, millisecond-scale).
这道不对称也给了一把判断领域成熟度的尺子:哪个领域的验证成本降下来,哪个领域就是下一个被自动化的。编码 loop 用测试与 CI 做选择压力——廉价、确定;科研 loop 用实验与自然做选择压力——昂贵、有噪、以周计。auto-research 何时成熟,量的也是同一把尺子。
The asymmetry also hands you a ruler for a domain's maturity: whichever domain's verification cost falls is the next to be automated. A coding loop uses tests and CI as selection pressure (cheap, deterministic); a research loop uses experiments and nature (expensive, noisy, week-scale). When auto-research matures is measured by the same ruler.
「done」是一个声明,还是一个证明?凡是只能靠人点头说"应该对了"的地方,都还没被真正自动化——把它变成可机检的证明,自动化的边界才向前移一格。
Is "done" a claim or a proof? Anywhere it rests on a human nodding "this should be right" is not truly automated yet; turn it into a machine-checkable proof and automation's boundary moves one notch forward.
写代码的模型给自己打分太宽容。把验证从"事后人看"改成"结构里内置":独立 checker(可换模型)、可机检条件、类型与契约即护栏、CI 即选择压力。
The model that writes is too lenient grading itself. Move verification from "a human looks afterward" to "built into the structure": an independent checker (a different model), machine-checkable conditions, types and contracts as guardrails, CI as selection pressure.
机制:没有独立验证,一个循环只是在用惊人的速度放大自己的错误——生成者和检查者是同一个模型,等于让考生自己批卷。重画:把"做"和"查"分开。一个独立的 checker(最好换一个模型)是整个结构里唯一的承重墙;它垮,整层就垮。再把能机检的都机检掉:测试、类型、契约、lint、可机检的验收条件——这些是确定、毫秒级、可无限复用的选择压力。
Mechanism: without independent verification, a loop merely amplifies its own mistakes at remarkable speed; generator and checker being the same model is letting the candidate grade their own exam. Redraw: separate "do" from "check." An independent checker (ideally a different model) is the one load-bearing wall in the structure; if it fails, the whole floor fails. Then machine-check everything checkable: tests, types, contracts, lint, machine-checkable acceptance conditions, which are deterministic, millisecond-scale, infinitely reusable selection pressure.
可机检占比上升——「done」越来越多地由机器证明,而非人点头。生成者从不给自己签字放行。
The machine-checkable fraction goes up: "done" is increasingly proved by machines, not nodded by humans. The generator never signs off on its own work.
评测不是测试的附属,是和代码同级的资产:数据集、评分器、回归套件、对抗式评审。它把"done"从声明变成可重复、可回归的证明,并随产出一起增长。
Evals are not an appendage of tests; they are an asset on par with code: datasets, graders, regression suites, adversarial review. They turn "done" from a claim into a repeatable, regressible proof, and grow alongside output.
机制:一次性人审不可回归——今天对了,明天换个 prompt 又错,没人知道。重画:把验证沉淀成评测台。每发现一类错误,就补一条 eval;每条 eval 都进回归套件,从此机器替你盯住它。配上可观测性(生产里出错的样本自动回流成新 eval),验证就从一次性人力,变成一项随时间复利的基础设施——这正是趋势所指:未来最大的工程投入流向验证基础设施。
Mechanism: one-off human review does not regress; right today, wrong tomorrow under a new prompt, and no one knows. Redraw: settle verification into an eval bench. For each class of error found, add an eval; every eval enters the regression suite, and the machine watches it for you from then on. With observability (production failures flow back automatically as new evals), verification turns from one-off labor into infrastructure that compounds over time – exactly where the trend points: the largest engineering investment ahead flows to verification infrastructure.
逃逸率下降,且每个逃逸都变成一条新 eval——同一个错误不会犯第二次。评测套件随产出一起长大。
The escape rate falls, and every escape becomes a new eval: the same mistake is not made twice. The eval suite grows alongside output.
系统越可靠,监督者警觉性衰减越快——而恰在最该接管的异常时刻,人已丢失情境感知(Bainbridge 1983《自动化的反讽》)。「人在环上」注定盯不住。答案不是盯得更紧,是把人的介入从实时监督改成异步分诊。
The more reliable the system, the faster supervisor vigilance decays, and exactly when takeover is needed, the human has lost situational awareness (Bainbridge 1983, "Ironies of Automation"). "Human on the loop" is doomed to not watch. The answer is not to watch harder but to move human intervention from real-time supervision to asynchronous triage.
机制:警觉性衰减是物理规律,不是态度问题——让人盯着一个 99% 正确的循环,那 1% 一定会漏过。重画:把验证写进结构(SHEET 03/04)之后,人就不必与机器的时钟同步了。机器机检的归机器;机检不了、又要紧的,攒成一个分诊队列,人异步地、带着完整上下文地处理。介入点从"实时盯屏"前移到"设计时定义验证"、后移到"异常时分诊"——两头都比中间的实时监督可靠。
Mechanism: vigilance decay is physics, not attitude; put a human in front of a 99%-correct loop and the 1% will slip by. Redraw: once verification is written into the structure (SHEET 03/04), the human no longer has to keep the machine's clock. What machines can check goes to machines; what cannot be checked yet matters accumulates into a triage queue the human handles asynchronously, with full context. Intervention moves forward to "define verification at design time" and back to "triage on exception" – both more reliable than real-time supervision in the middle.
人不再实时盯屏,而是在一个分诊队列里异步处理异常;分诊延迟可控,且没人需要全程在场。
Humans no longer watch in real time but handle exceptions asynchronously from a triage queue; triage latency is bounded, and no one needs to be present throughout.
机器能验证"符不符合规格",但定义规格、定义"什么算对、什么算好",是机器替不了的。把人审收敛到不可机检、又要紧的那几处——这与工程卷的 trust-but-verify、架构卷的信任边界,是同一道分界。
Machines can verify "does it meet the spec," but defining the spec – defining what counts as correct, as good – is not something a machine can replace. Converge human review onto the few things that are non-machine-checkable yet matter; this is the same line as the engineering volume's trust-but-verify and the architecture volume's trust boundaries.
注意一个对称性:两个人用完全相同的验证结构,会得到相反的结果——一个用它在深刻理解的工作上加速,另一个用它逃避理解本身。结构分不出区别,人分得出。这就是为什么"定义何为正确"这件事,是人无法外包的最后一项资产。
Note a symmetry: two people using the exact same verification structure get opposite results – one uses it to accelerate work they deeply understand, the other to escape understanding itself. The structure cannot tell the difference; the person can. That is why "defining what counts as correct" is the last asset a human cannot outsource.
把验证当成你最该投的基础设施。原则定方向,信号验真伪。起步只需一步:先量出你的验证成本花在哪——再用下面的仪器,给每类产出分配策略。
Treat verification as the infrastructure most worth investing in. Principles set direction, signals verify. Getting started is one step: measure where your verification cost goes, then use the instrument below to allocate a strategy to each output type.
把团队产出的每一类,丢进下面的分配器:按"可机检吗 × 错了代价大吗"两轴,它会告诉你哪些全自动机检、哪些机检+抽样、哪些抽样容错、哪些必须人审承重。人审带宽稀缺,只花在承重那几类上。
Drop every type of output your team produces into the allocator below: on the two axes "is it machine-checkable × is it costly if wrong," it tells you which to auto-check, which to check and sample, which to sample and tolerate, and which a human must sign off. Human review bandwidth is scarce; spend it on the load-bearing few.
规律:可机检 × 代价低 → 全自动机检;可机检 × 代价高 → 机检 + 抽样人审;不可机检 × 代价低 → 抽样容错;不可机检 × 代价高 → 必人审承重(安全、信任边界、品味)。模型变强、机检能力上移,这张表要重答——可机检的那条线一直在右移。
The rule: checkable × low cost → auto-check; checkable × high cost → machine-check plus human sampling; not checkable × low cost → sample and tolerate; not checkable × high cost → a human must sign off (security, trust boundaries, taste). As models improve and machine-checking climbs, re-answer the table – the line of what is checkable keeps moving right.