PART II / AI-NATIVE 工程AI-NATIVE ENGINEERING

AI Native 工程方法论

AI Native Engineering Methodology

代码是离 agent 最近的面,所以工程是 AI-Native 最先、也最深可施工的地方——但工程从不止于写代码。它是把意图变成可靠系统的整门手艺:架构、接口、验证、安全、可运维、演化。当写代码、写测试、重构都变成随取随用,瓶颈不会消失,而是整体迁移到这门手艺更难的那一半——验证、评审、安全边界与品味。这一部分比组织部分更"可施工":每张图纸都落到原理与可照做的做法。一条纪律贯穿始终——工具只是表层,我们要的是工具底下的原理。

Code is the surface closest to agents, so engineering is where AI-Native is first and most deeply buildable — but engineering was never only writing code. It is the whole craft of turning intent into reliable systems: architecture, interfaces, verification, security, operability, evolution. Once writing code, tests, and refactors becomes something you draw on at will, the bottleneck does not vanish; it moves wholesale to the harder half of that craft — verification, review, trust boundaries, and taste. This part is more "buildable" than the organization part: every sheet lands on a principle and a do-this. One discipline runs throughout — tools are only the surface; what we want is the principle beneath the tool.

本卷内核特化 ① 写码/测试/重构变充裕② 判断沿可验证性梯度分叉(可机检的对错并入自动化,构成性品味下沉给人)→ ③ 代码库成可查询基设④ 人回归专长与品味。读这一卷不必先读组织卷。

Kernel, specialized here ① writing code/tests/refactors turns abundant② judgment forks along the verifiability gradient (machine-checkable correctness joins automation; constitutive taste sinks to people) → ③ the codebase becomes queryable infrastructure④ people return to expertise and taste. You need not read the organization volume first.

AI-ENABLED ENGINEERINGAI-NATIVE ENGINEERING
对象
Object
代码生成更快Faster code generation规格、验证、安全边界一起被设计Specs, verification, and trust boundaries designed together
判断
Judgment
人事后补救Humans patch afterward人守不可逆、构成性与风险边界Humans hold irreversible, constitutive, and risk-boundary calls
沉淀
Residue
一次会话的产物One-session output可复用工程工件与回流上下文Reusable engineering artifacts and context that writes back
拖动滑块,看工程从“助手加速”转为“系统可验证”。进入 SHEET 05 · 运行环
Drag the slider: engineering moves from assistant acceleration to a verifiable system. Enter SHEET 05 · Operating Loop
AI-NATIVE DOCUMENT PACK · PART II

工程文档包:从代码生成到可验证系统

Engineering Pack: from generated code to verifiable systems

这不是工具清单,而是把本卷压缩成一组可带走的工程工件:命题、运行环、边界和第一步。先读这里,再进入图纸细节。

This is not a tool list. It compresses the volume into portable engineering artifacts: thesis, operating loop, boundaries, and first move. Read this first, then enter the sheets.

Thesis

代码变充裕后,工程价值搬到约束、验证、安全与边界。

Once code is abundant, engineering value moves to constraints, verification, security, and boundaries.

AI-Native 工程不是“让 AI 多写代码”,而是把研发系统重画成一个可被规格牵引、被独立验证器制衡、被安全边界收束的循环。

AI-Native engineering is not “make AI write more code”; it redraws development as a loop steered by specs, checked by independent verifiers, and contained by trust boundaries.

ENG
00
CONCEPT · 概念
CONCEPT
定义 · 先划界
Definition

从 AI 辅助工程,到AI-Native 工程

From AI-Assisted Engineering to AI-Native Engineering

工程师用 AI 写得更快,仍然只是 AI 辅助:旧研发流程的吞吐提升了,但验证、边界与判断结构没有改变。AI-Native 工程承认执行已充裕,于是围绕 agent 重画整张研发图。差别不是程度,是种类。

Engineers writing faster with AI is still AI-assisted engineering: the old development process gains throughput, but verification, boundaries, and judgment structure remain unchanged. AI-Native engineering accepts that execution is abundant and redraws the whole development graph around agents. The difference is not degree, but kind.

多年来工程的稀缺资源是工程师带宽——能写代码的人时。瀑布、敏捷,所有流程都为"打字很贵"设计。当 agentic coding 成为默认,写代码、测试、重构不再拖慢团队,但瓶颈没消失,它搬家了:搬到验证、评审、安全。把内核四步填上"代码"的具体内容,就是这一部分的全部命题。

For years engineering's scarce resource was engineering bandwidth — the hours of people who can write code. Waterfall, agile: every process was built for "typing is expensive." Once agentic coding is the default, writing code, tests, and refactors no longer slows the team, but the bottleneck does not vanish — it moves: to verification, review, security. Fill the kernel's four steps with the specifics of code and you get the whole thesis of this part.

把这件事说精确一点,要先分清两种"快"。第一种是打字快——把脑子里已经想清楚的东西敲进编辑器;自动补全、片段、脚手架早就在解决它,IDE 这二十年做的就是这件事。第二种是把没想清楚的东西变成正确系统——这一种从来不是打字慢,是判断慢:要不要这么切模块、这个边界条件算不算 bug、这个性能回退能不能接受。agentic coding 几乎把第一种压到零,却把第二种原样留下,甚至放大——因为生成越快,等着被判断的候选就越多。所以"工程师用 AI 写得更快"是个量级错误的描述:真正发生的不是同一份工作做得更快,而是工作的构成变了——打字那一份蒸发,判断那一份变成几乎全部。

To put it precisely, first separate two kinds of "fast." The first is typing fast — keying in what you have already worked out in your head; autocomplete, snippets, and scaffolds have long addressed it, and that is what IDEs spent two decades on. The second is turning the not-yet-thought-through into a correct system — and that was never slow typing but slow judgment: should the module be cut this way, does this edge case count as a bug, is this performance regression acceptable. Agentic coding crushes the first to near-zero while leaving the second untouched, even amplifying it — because the faster generation runs, the more candidates queue for judgment. So "engineers writing faster with AI" is an order-of-magnitude misdescription: what actually happens is not the same work done faster but a change in the composition of the work — the typing share evaporates and the judgment share becomes nearly all of it.

反面最能说清边界:vibe-coding 陷阱。把"执行充裕"误读成"可以放任",就掉进它——让 agent 凭感觉一路生成、不写规格、不设验证,建在猜测上的系统会以诡异的方式崩。它崩不是因为模型笨,而是因为缺了承重结构:没有 spec 当目标函数,生成无处收敛;没有独立 checker,自信而错无人拦;没有边界,一次越权就是全盘。所以"执行充裕 ≠ 放任"是这一卷反复要守的纪律——充裕解放的是打字,不是判断;省下来的判断带宽,要重新投到验证、规格与边界这三处新瓶颈上,而不是省掉。〔源 Graziano《AI-Native Engineering》Day 1 "vibe-coding trap",证据级 Ⅳ 一手从业者[R1]

The mirror image makes the boundary clearest: the vibe-coding trap. Misread "execution is abundant" as "you may now be lax" and you fall into it — letting the agent generate on vibes with no spec and no verification, and a system built on guesses collapses in strange ways. It collapses not because the model is dumb but because the load-bearing structure is missing: with no spec as objective function, generation has nowhere to converge; with no independent checker, confident wrongness goes uncaught; with no boundary, one over-privileged call is the whole thing. So "abundant execution is not licence" is the discipline this volume keeps returning to — abundance frees typing, not judgment; the judgment bandwidth it frees must be reinvested into the three new bottlenecks of verification, specs, and boundaries, not pocketed. [Source: Graziano, AI-Native Engineering Day 1, "vibe-coding trap," grade Ⅳ practitioner. [R1]]

这也解释了为什么本卷比组织卷更"可施工":代码是离 agent 最近的面——它天生可读、可 diff、可执行、可测,反馈回路以秒计而非以季度计。组织里一个判断错置可能几个月才暴露,工程里一次 CI 红灯当场就告诉你哪条护栏漏了。这条短反馈回路是后面所有图纸能"落到可照做"的物理前提:在代码上,命题不只是被论证,还能被运行、被证伪。

This also explains why this volume is more "buildable" than the organization one: code is the surface closest to agents — natively legible, diffable, executable, testable, with a feedback loop measured in seconds rather than quarters. In an organization a misplaced judgment may take months to surface; in engineering a single red CI run tells you on the spot which guardrail leaked. That short feedback loop is the physical precondition for every sheet below landing on a do-this: on code, a thesis is not only argued but run and falsified.

充裕ABUNDANCE
写码 / 测试 / 重构
Code / tests / refactors
agentic coding 成默认,"打字"不再稀缺。
Agentic coding is the default; typing is no longer scarce.
判断JUDGMENT
验证 · 评审 · 安全 · 品味
Verify · review · security · taste
新瓶颈即新判断节点。
The new bottleneck is the new judgment node.
上下文CONTEXT
代码库可查询 + 能否自动化
Queryable codebase + automate?
问 Claude 不问作者;再追问能否自动化。
Ask Claude, not the author; then ask if it can be automated.
MEANING
专长 · 品味 · 建造
Expertise · taste · building
人做系统专长与产品判断,不做吞吐。
People do deep expertise and product judgment, not throughput.

第②步在工程这张面上不是单一台阶,它就着代码自身的可验证性谱系分叉,而这条谱系在工程里是物理的、能逐段指出来的:最左端是编译器可判的(类型、借用检查——一个确定性程序当场给出对错),中段是测试可判的(行为对照固定断言,机器重复执行),再往右是评测可判的(语义评审,得另造一份独立判据),最右端是只有人能判的(架构取舍、命名、"这算不算对"的边界——没有外部判据,判据本身就是判断)。分叉规则由此直接读出:谱系上凡是判据能外置成一个不经生成模型"觉得"的确定检查的,就并入①一起自动化(这正是 SHEET 11 说的独立验证器);凡是判据无法外置、必须由人现场给出的,就下沉④留给人。所以"判断退守"在工程里不是一句口号,是沿这条可机检性谱系从左向右、把能机检的一段段切给机器、直到剩下那段必须人来定判据的残余——那段残余就是工程师 2030 年的岗位。〔源 本卷 SHEET 11 独立验证器 + 可验证性梯度的工程化推导,证据级 Ⅴ 推论;横向对照见体系总图

On engineering's own face, step ② is not a single stair; it forks along code's own spectrum of verifiability, and in engineering that spectrum is physical, pointable segment by segment: the far left is compiler-decidable (types, borrow-checks — a deterministic program returns correct/incorrect on the spot), the middle is test-decidable (behavior checked against fixed assertions, re-run by machine), further right is eval-decidable (semantic review, requiring a separately-manufactured independent criterion), and the far right is human-only-decidable (architectural trade-offs, naming, the boundary of "does this count as correct" — no external criterion exists; the criterion itself is the judgment). The fork rule reads straight off this: wherever on the spectrum the criterion can be externalized into a definite check that does not pass through the generating model's "feeling," it joins ① and gets automated (this is exactly SHEET 11's independent verifier); wherever the criterion cannot be externalized and must be supplied by a human on the spot, it sinks to ④ and stays with people. So "judgment retreats" is not a slogan in engineering but a concrete operation: move left-to-right along this machine-checkability spectrum, hand each machine-checkable segment to the machine, until what remains is the residue where a human must still set the criterion — and that residue is the engineer's job in 2030. [Source: this volume's SHEET 11 independent verifier plus an engineering-specific derivation of the verifiability gradient, grade Ⅴ inference; for the cross-volume comparison see the system map.]

从实现者到编排者:人持有的三件没变

From implementer to orchestrator: the three the human keeps

把"工作构成变了"落到一个可照做的角色描述上:工程师从实现者(implementer)变成编排者(orchestrator)——不再逐行实现,而是持有三件 agent 不能替你持有的东西。第一是意图(intent):要什么、为什么、不要什么;这是生成的方向,写错了 agent 会高效地造错东西。第二是约束(constraints):架构、标准、非目标;这是生成的边界,缺了它 agent 会在每个岔路口自己猜,而猜出来的局部最优往往全局是债。第三是验证(verification):测试、评审、质量门;这是生成的判据,没有它"对不对"就回到那个跟不上生成速度的人脑里。这三件恰好对应内核的②判断与④人:编排者不是"管 agent 的经理",而是把判断从实现细节抬到意图、约束、验证这三处构成性节点的人。〔源 Graziano《AI-Native Engineering》Day 1 implementer→orchestrator、intent/constraints/verification 三件由人持有,证据级 Ⅳ 一手从业者[R1]

Land "the composition of work changed" on a copyable role description: the engineer shifts from implementer to orchestrator — no longer implementing line by line but holding three things an agent cannot hold for you. First, intent: what, why, and what not; this is the direction of generation, and get it wrong and the agent efficiently builds the wrong thing. Second, constraints: architecture, standards, non-goals; this is the boundary of generation, and without it the agent guesses at every fork, and a local optimum it guesses is often a global debt. Third, verification: tests, review, quality gates; this is the criterion of generation, and without it "is it correct" returns to the human head that cannot keep pace with generation. These three map precisely onto the kernel's ② judgment and ④ people: an orchestrator is not "a manager of agents" but the person who lifts judgment from implementation detail to the three constitutive nodes of intent, constraints, and verification. [Source: Graziano, AI-Native Engineering Day 1 — implementer→orchestrator, with intent/constraints/verification held by the human, grade Ⅳ practitioner. [R1]]

这个迁移也重排了一名工程师的技能栈。过去稀缺的是"把想清楚的东西快速正确地敲出来"的实现技能;当那一份被 agent 吸收,稀缺的变成四样上游能力:规格素养(把模糊意图写成无歧义、可机检的规格)、上下文工程(决定此刻窗口里该有什么、不该有什么)、编排(把大任务切成 agent 能可靠完成的小步、并设好检查点)、验证(设计能自动判对错的检查,而非事后逐行人审)。注意这四样没有一样是"提示词技巧"——它们都是判断密集、且离"何为对"很近的工程能力。这给个人成长一条可证伪的方向:若一个工程师"用了 AI"两年,时间却仍主要花在敲实现、而非这四样上游能力上,那他多半还停在"用 AI 写得更快"的嫁接阶段,没有真正迁移到编排者。

This migration also reorders an engineer's skill stack. What used to be scarce was the implementation skill of "keying out, fast and correctly, what you had already thought through"; once that is absorbed by the agent, scarcity moves to four upstream capabilities: spec literacy (writing fuzzy intent into an unambiguous, machine-checkable spec), context engineering (deciding what should and should not be in the window right now), orchestration (cutting a large task into small steps an agent can reliably complete, with checkpoints set), and verification (designing checks that auto-decide correctness rather than reviewing line by line afterward). Note none of these is a "prompting trick" — all are judgment-dense engineering capabilities close to "what counts as correct." This gives personal growth a falsifiable direction: if an engineer has "used AI" for two years yet still spends time mainly on keying out implementation rather than these four upstream capabilities, they are most likely still at the graft stage of "writing faster with AI," not truly migrated to orchestrator.

嫁接与重画的分界线:一个可当场做的判别

The line between grafting and redrawing: a test you can run on the spot

"把 AI 嫁接到旧流程上"和"围绕 agent 重画整张图"听起来像态度差别,其实有一个可当场判别的结构标准:看你的流程在哪里假设了"打字很贵",又在哪里已经按"验证很贵"重排。嫁接的典型样子是——流程的骨架没变(还是同样的需求评审、同样的排期、同样的人审节奏),只是在"写代码"这一格里换上了 AI,于是产出快了,但下游的验证、评审、集成全都按老速度运行,结果产出在验证那里堵成一座山。重画的样子则相反——既然写代码不再是瓶颈,整张图就该把重心从"怎么更快地写"挪到"怎么更快地判对错":验证前移、自动化、和生成解耦;人审从逐行改成异步分诊;规划视野缩短贴着信号。一个干净的判别问题是:"如果明天写代码的速度再快十倍,你的流程是会更顺,还是会在某处堵得更死?"嫁接的流程会堵得更死(因为瓶颈没动,只是被喂得更猛),重画的流程会更顺(因为它的承重结构本就建在验证那一侧)。这也呼应组织卷反复说的"瓶颈不会消失只会搬家":嫁接的失败,本质是没有承认瓶颈已经搬到了验证,还在原来那个不再是瓶颈的地方使劲。可证伪信号:若你的团队"用了 AI"之后,产出明显变快、但交付质量或周期没有同步改善、甚至更糟,那几乎一定是嫁接——快出来的产出全都堵在那个没有被重画的下游瓶颈上。

"Grafting AI onto the old process" versus "redrawing the whole graph around agents" sounds like a difference of attitude, but it has a structural test you can run on the spot: look at where your process assumes "typing is expensive" and where it has already re-arranged around "verification is expensive." The typical grafted shape is — the process skeleton is unchanged (same requirements review, same scheduling, same human-review cadence), only the "write code" cell has AI swapped in, so output speeds up, but downstream verification, review, and integration all run at the old speed, and output piles into a mountain at verification. The redrawn shape is the opposite — since writing code is no longer the bottleneck, the whole graph should move its center of gravity from "how to write faster" to "how to judge correctness faster": verification moved earlier, automated, decoupled from generation; human review shifted from line-by-line to asynchronous triage; planning horizon shortened against signal. A clean test question is: "if writing code got ten times faster tomorrow, would your process flow more smoothly or jam harder somewhere?" A grafted process jams harder (the bottleneck did not move, it is just fed more aggressively); a redrawn process flows more smoothly (its load-bearing structure was built on the verification side to begin with). This echoes the organization volume's refrain that "the bottleneck does not vanish, it moves": the failure of grafting is essentially not admitting the bottleneck moved to verification, and still pushing on the place that is no longer the bottleneck. Falsifiable signal: if after your team "uses AI" output is clearly faster but delivery quality or cycle time did not improve in step, or got worse, it is almost certainly a graft — the faster output is all jamming at the downstream bottleneck that was never redrawn.

ENG
01
LEVERAGE · 杠杆点上移
LEVERAGE
机理 · 谱系
Mechanism · Genealogy

杠杆点在一栋楼里逐层上移

The leverage point climbs the building

prompt → context → spec → harness → loop → fleet 是一栋楼,电梯只往上开。模型每强一档,人的着力点就上移一层——这正是内核第②步"判断退守"的时间轴。底层是控制论的复活。

prompt → context → spec → harness → loop → fleet is one building; the elevator only goes up. Each model generation lifts the human's leverage point a floor — this is the timeline of the kernel's step ② "judgment retreats." Underneath, cybernetics reborn.

这些不是相互竞争的"工程学",是同一栋楼的楼层:早期你在底层逐字写 prompt;模型变强,你上移到喂 context、写 spec、搭 harness、最后只设计 loop 与调度 fleet。你这季度的着力点在哪一层,就是组织的判断瓶颈所在——低于它的,交给 agent 或产品化;高于它的,是你下一步该去设计的地方。

These are not competing "engineerings" but floors of one building: early on you write prompts word by word at the bottom; as models strengthen you move up to feeding context, writing specs, building the harness, and finally only designing the loop and scheduling the fleet. Whichever floor your leverage sits on this quarter is where the judgment bottleneck is — below it, hand off to agents or productize; above it is where you design next.

核心图KEY FIGFIG. E1.0 / THE BUILDING · 杠杆的楼层 看懂:你这季度的着力点在哪一层 = 瓶颈在哪一层 Read: which floor your leverage sits on = where the bottleneck is
电梯只往上开 · 模型每强一档,着力点上移一层 The elevator only goes up · each model generation lifts the leverage a floor F1 prompt · 逐字指令 prompt · word-by-word 已沉为基设:自动补全早把它产品化 sunk to infra: autocomplete productized it already F2 context · 喂可查询的上下文 context · feed queryable context 下沉中:RAG / 上下文库逐步自动化(ENG·02) sinking: RAG / context stores automate it (ENG·02) F3 spec · 写清"什么算对" spec · state "what counts as correct" ← 多数团队此刻的着力层(ENG·03 / 07) ← most teams' leverage floor this quarter (ENG·03 / 07) F4 harness · 搭承载循环的脚手架 harness · scaffold that carries the loop 你下一步该去设计的地方(ENG·04) where you design next (ENG·04) F5 loop · 设计自我改进的循环 loop · design the self-improving cycle 变异 / 选择 / 保留——进化的最小机制 variation / selection / retention — evolution's minimum F6 fleet · 调度并行的 agent 队伍 fleet · schedule a parallel agent fleet 楼顶:判断从单环抬到一支队伍的尺度 top floor: judgment lifts from one loop to a fleet 杠杆点上移 = 内核②判断退守的时间轴 leverage climbs = timeline of kernel ② judgment retreating ↓ 低于你的层 ↓ below your floor 交给 agent / 产品化 hand off / productize ↑ 高于你的层 ↑ above your floor 你下一步去设计 design next
这不是六门相互竞争的"工程学",是同一栋楼的六层楼。判断瓶颈永远停在你当前着力的那一层:低于它的已经能交给 agent 或被产品化(prompt 早被自动补全吃掉),高于它的是你下一步要去设计的地方。底层是控制论的复活——agent = 推理客户端 + 工具 + while 循环,最小内核只有 50 行(HuggingFace Tiny Agents)。〔源 本系列谱系篇综合 + HuggingFace,证据级 Ⅳ[R2]
These are not six competing "engineerings" but six floors of one building. The judgment bottleneck always rests on the floor where your leverage currently sits: below it is already delegable or productized (prompt was eaten by autocomplete long ago), above it is where you design next. Underneath is cybernetics reborn — an agent is an inference client + tools + a while loop, a 50-line minimal kernel (HuggingFace Tiny Agents). [Source: this series' Genealogy synthesis + HuggingFace, grade Ⅳ. [R2]]

底层是控制论的复活。把那栋楼的每一层抽象掉,剩下的骨架古老得令人意外:一个 agent 就是推理客户端 + 一组工具 + 一个 while 循环——读状态、决定下一步、调工具改变状态、再读。这正是 Wiener 1948 年讲的反馈控制:感知、比较、作动、再感知。HuggingFace 的 Tiny Agents 把这个最小内核压进 50 行代码,说明楼层不是技术堆叠的产物,而是同一个控制回路在不同抽象层的展开。理解这一点有实际用处:当你不知道某个新工具该放哪一层,问它在这个回路里扮演感知、比较还是作动,就能定位——工具会换,回路不会。

Underneath is cybernetics reborn. Abstract away every floor of the building and the skeleton left is startlingly old: an agent is an inference client + a set of tools + a while loop — read state, decide the next step, call a tool to change state, read again. This is exactly Wiener's 1948 feedback control: sense, compare, actuate, sense again. HuggingFace's Tiny Agents compresses this minimal kernel into 50 lines, showing the floors are not a product of stacking technology but the same control loop unfolded at different levels of abstraction. This has practical use: when you do not know which floor a new tool belongs on, ask whether it plays sense, compare, or actuate in that loop, and you can place it — the tools change, the loop does not.

为什么电梯只往上、从不往下。因为每一层一旦"够好且可机检",它就会被产品化、被下一代模型吸收,于是人在那层的边际价值趋零——你被推着上楼,不是因为想,而是因为站在原地的回报在塌。prompt 工程曾是 2023 年的热门技能,今天基本被自动补全和系统提示吃掉;context 工程正在被 RAG 与上下文库逐步自动化。这给出一个可证伪的预测:今天显得高级的 spec / harness 工作,也会沿同一条路下沉——若三年后"写 spec"仍是稀缺技能而非基础设施,这个楼层模型就被削弱了。反过来,它也划出人的持久着力点:楼顶那几层(loop 设计、fleet 调度、以及决定"该不该造"的判断)下沉得最慢,因为它们恰好是最不可机检的构成性判断。

Why the elevator only goes up, never down. Because once a floor is "good enough and machine-checkable," it gets productized and absorbed by the next model generation, so a human's marginal value on that floor trends to zero — you are pushed upstairs not because you want to but because the payoff for standing still is collapsing. Prompt engineering was the hot skill of 2023 and is today largely eaten by autocomplete and system prompts; context engineering is being automated by RAG and context stores. This yields a falsifiable prediction: today's seemingly advanced spec / harness work will sink along the same path — if "writing a spec" is still a scarce skill rather than infrastructure three years out, this floor model is weakened. Conversely it marks the human's durable leverage: the top floors (loop design, fleet scheduling, and the judgment of "whether to build at all") sink slowest, precisely because they are the least machine-checkable, most constitutive judgments.

深潜Deep dive

楼层模型与 Loop 解剖的完整推导,见The full derivation of the floor model and Loop anatomy is in 谱系篇 ↗the Genealogy chapter ↗

怎么判断你这季度站在哪一层

How to tell which floor you are on this quarter

楼层模型若只是个好比喻,价值有限;它真正可操作的地方,是给"瓶颈在哪"一个能当场自测的判据。问三个问题就能定位你当前的着力层。其一,你最近一次为 agent 做的最费神的事是什么?如果是逐字斟酌怎么措辞一个请求,你还在 prompt 层;如果是组织"它需要知道的一切",你在 context 层;如果是写清"什么算对",你在 spec 层;如果是搭"每次运行都自动验证并纠偏"的脚手架,你在 harness 层。其二,低于这一层的事,是不是已经基本不用你操心了?——若你还在频繁手写 prompt,说明 context 层其实没立起来,你只是误以为自己在上面。其三,高于这一层的事,是不是开始让你觉得"该有个系统来管了"?这个隐隐的不适感,正是电梯要往上开一层的信号。

If the floor model is only a nice metaphor its value is limited; where it becomes operational is in giving "where is the bottleneck" a criterion you can self-test on the spot. Three questions locate your current leverage floor. First, what was the most effortful thing you last did for an agent? If it was agonizing word by word over how to phrase a request, you are on the prompt floor; if it was organizing "everything it needs to know," you are on context; if it was stating "what counts as correct," you are on spec; if it was scaffolding "verify and self-correct automatically on every run," you are on harness. Second, are the things below this floor basically off your plate now? — if you are still hand-writing prompts frequently, the context floor is not actually built and you only think you are above it. Third, are the things above this floor starting to make you feel "there should be a system for this"? That faint discomfort is precisely the signal that the elevator is due to climb one floor.

这个自测有一个反直觉但重要的推论:不同团队、不同人,此刻站在不同楼层是正常的,而且不该强行拉齐。一个还在手写大量 prompt 的团队,其当务之急是先把 context 层立起来(让上下文成为可查询的基础设施),而不是越级去搭 fleet 调度——越级搭出来的高层脚手架,会因为底层不稳而频繁坍塌。这和组织卷"瓶颈不会消失只会搬家"是同一条纪律的两面:你不能跳过当前瓶颈去优化下一个,因为下一个瓶颈还没成为瓶颈。可证伪信号:若一个团队投了大量精力搭多 agent 编排(高楼层),实际产出却仍卡在"每次都要人重新解释一遍上下文"(低楼层没立稳),那就是越级的证据——该退回去先把底层那一层做扎实。

This self-test has a counter-intuitive but important corollary: different teams and different people standing on different floors right now is normal, and should not be forcibly leveled. A team still hand-writing many prompts should first stand up its context floor (make context queryable infrastructure), not skip levels to build fleet scheduling — high scaffolding built across skipped levels collapses often because the lower floor is unstable. This is two sides of the same discipline as the organization volume's "the bottleneck does not vanish, it moves": you cannot skip the current bottleneck to optimize the next, because the next is not yet a bottleneck. Falsifiable signal: if a team invests heavily in multi-agent orchestration (a high floor) while real output is still stuck on "having to re-explain context every time" (a low floor not stood up), that is evidence of level-skipping — retreat and make the lower floor solid first.

为什么没有"回到底层"这回事

Why there is no "going back to the lower floors"

楼层模型有一个对个人和团队都很实在的推论:一旦某一层被产品化、被下一代模型吸收,人就回不去那一层了——不是不能,是回去没有回报。这解释了为什么"我先把提示词技巧练扎实再说"在今天是一个亏本的投资方向:prompt 这一层正在被自动补全、系统提示、和模型自身的指令遵循能力快速吸收,你在它上面磨出来的精细技巧,会随着下一代模型变得更会"猜你想要什么"而贬值。同样的命运在等着 context 层(RAG 和上下文库在自动化它),并且按楼层模型的预测,迟早会轮到今天显得高级的 spec 与 harness。这给个人成长一个不那么舒服但很清醒的指南:不要在一个正在下沉的楼层上深耕,要往电梯的上方走。判断一个技能值不值得深投,问它在三年的时间尺度上是会变成稀缺判断、还是会变成基础设施——前者值得投,后者迟早被产品化。这也和内核第④步对上:人最终的持久着力点,是那些下沉得最慢、最不可机检的构成性判断(决定该造什么、何为对、接缝放哪),因为恰恰是"不可机检"这个性质,让它们抗拒被自动化吸收。可证伪信号:若你发现自己引以为傲的核心技能,每出一代新模型就明显贬值一截,那它大概率是个正在下沉的楼层——该把深耕的重心往上挪一层了。

The floor model has a corollary very real for both individuals and teams: once a floor is productized and absorbed by the next model generation, a human cannot go back to it — not cannot in principle, but going back has no payoff. This explains why "let me first get my prompting tricks solid" is a loss-making investment direction today: the prompt floor is being rapidly absorbed by autocomplete, system prompts, and the model's own instruction-following, and the fine tricks you hone on it depreciate as the next model gets better at "guessing what you want." The same fate awaits the context floor (RAG and context stores are automating it), and by the floor model's prediction, sooner or later the spec and harness work that looks advanced today. This gives personal growth a less comfortable but clear-eyed guide: do not dig deep on a sinking floor; move up the elevator. To judge whether a skill is worth deep investment, ask whether on a three-year horizon it becomes scarce judgment or becomes infrastructure — the former is worth investing in, the latter gets productized eventually. This matches the kernel's step ④: the human's ultimately durable leverage is the slowest-sinking, least machine-checkable constitutive judgments (deciding what to build, what is correct, where the seams go), because it is precisely the "not machine-checkable" property that resists being absorbed by automation. Falsifiable signal: if you find the core skill you take pride in visibly depreciating a notch with each new model generation, it is most likely a sinking floor — time to move the center of your deep practice up one floor.

ENG
02
CONTEXT · 上下文即基设
CONTEXT AS INFRA
重画 · 原理
Redraw · Principle

上下文即基础设施——以及为什么纯文本赢

Context as infrastructure — and why plain text wins

旧办法是"找写代码的人去问";新办法是先问 Claude,再追问"这能不能自动化"。但更深的问题是:为什么 Markdown、本地知识图谱(Obsidian)、纯文本在 AI 优先下价值被放大?

The old way was "find the person who wrote it and ask"; the new way is ask Claude first, then ask "can this be automated." But the deeper question: why do Markdown, local knowledge graphs (Obsidian), and plain text get amplified under AI-first?

不是工具本身赢,是它们恰好满足了四条对 agent 友好的底层属性。凡满足这四条的产物都被放大,凡是不可读的二进制 / 私有格式都被边缘化:

It is not the tools that win, but that they happen to satisfy four agent-friendly properties. Anything that meets these four gets amplified; opaque binary / proprietary formats get marginalized:

Before
上下文住在人脑、聊天记录、私有格式里——只能靠"找作者"流动。
Context lives in heads, chat logs, proprietary formats — it flows only by "finding the author."
新 · 原理After · principle
上下文是纯文本基础设施:可读 / 可 diff / 可查 / 人机同源——agent 与人同饮一口井。
Context is plain-text infrastructure: legible / diffable / queryable / same-source — agents and people drink from one well.

边界 · 放大不等于"越多越好"。这四条属性让文本被放大,但有一条硬约束:有效上下文窗口远小于标称值——往窗口里塞太多,反而降准、加成本、稀释注意力(Anthropic《Effective Context Engineering for AI Agents》)。所以可读 / 可 diff / 可查的真正价值,不是"把全部塞进去",而是让"检索出正确的那一小撮子集"成为可能。legible / queryable 是必要、不充分——另一半是会策展。文本可以无限堆在硬盘上,但喂进窗口的每一段都要被挑选,否则放大会反噬成噪声。〔源 Anthropic 工程实践,证据级 Ⅳ 一手从业者;经 Graziano《AI-Native Engineering》转引。[R3][R1]

Boundary · amplification is not "more is better." These four properties amplify text, but one hard constraint holds: the effective context window is far smaller than the nominal one — overstuffing it lowers accuracy, raises cost, and dilutes attention (Anthropic, Effective Context Engineering for AI Agents). So the real value of legible / diffable / queryable is not "put everything in" but making it possible to retrieve the right small subset. Legible / queryable is necessary, not sufficient — the other half is curation. Text can pile up without limit on disk, but every piece fed into the window must be chosen, or amplification backfires into noise. [Source: Anthropic engineering practice, grade Ⅳ practitioner; via Graziano's AI-Native Engineering. [R3][R1]]

FIG. E2.0 / THE CONTEXT BUDGET · 有效窗口曲线 看懂:为什么"塞更多"会反噬成噪声 Read: why "stuff more in" backfires into noise
little 塞满标称窗口 nominal window full 窗口里上下文的量 → amount of context in the window → 有效准确率 ↑ effective accuracy ↑ 太少 = 瞎 too little = blind 峰值 = 对的那一小撮子集 peak = the right small subset 有效窗口 effective 标称窗口 ≫ 有效窗口 nominal ≫ effective 太多 = 分心 too much = distraction 降准 · 加成本 · 稀释注意力 lower accuracy · more cost · diluted attention 可读 / 可查 是必要、不充分——另一半是会策展 legible / queryable is necessary, not sufficient — the other half is curation
纯文本的四条属性(可读 / 可 diff / 可查 / 同源)让上下文可被放大,但放大不等于"越多越好"。准确率不是单调上升:到峰值后,继续往窗口里堆反而降准、加成本、稀释注意力。所以这四条的真正价值是让"检索出对的那一小撮子集"成为可能——而不是把全部塞进去。这条曲线把 ENG·02 与 ENG·04 缝在一起:上下文工程问"此刻窗口里该有什么",harness 工程问"什么系统在每次运行都生产并校验那份上下文"。〔源 Anthropic《Effective Context Engineering for AI Agents》,证据级 Ⅳ;经 Graziano 转引[R3][R1]
Plain text's four properties (legible / diffable / queryable / same-source) let context be amplified, but amplification is not "more is better." Accuracy is not monotonic: past the peak, piling more into the window lowers accuracy, raises cost, and dilutes attention. So the real value of those four is making it possible to retrieve the right small subset, not to put everything in. This curve stitches ENG·02 to ENG·04: context engineering asks "what should be in the window now," harness engineering asks "what system produces and checks that context on every run." [Source: Anthropic, Effective Context Engineering for AI Agents, grade Ⅳ; via Graziano. [R3][R1]]

从 prompt 工程到上下文工程

From prompt engineering to context engineering

这条四属性原理换个角度,就是一次术语的迁移:变量从"单条 prompt"移到了"推理时喂给模型的整个状态"。prompt 工程关心怎么把一句话措辞好,上下文工程关心一个更大的问题——此刻这次推理,窗口里应该装哪些东西、以什么顺序、占多少预算。Rules(恒定约束)、Skills(可调用能力)、Commands(封装好的动作)、Custom agents(带专属上下文的子 agent),都是给这个"整个状态"分层供料的方式。它们不是四个新玩具,是同一个问题的四个抽屉:把"模型需要知道的"按"多久变一次"分层存放——恒定的进 Rules,按需的进 Skills,一次性的进 prompt。

Seen from another angle, this four-property principle is a shift of terms: the variable moves from "a single prompt" to "the entire state fed to the model at inference." Prompt engineering cares how to word one sentence well; context engineering cares about a larger question — for this inference, right now, what should be in the window, in what order, at what budget. Rules (constant constraints), Skills (callable capabilities), Commands (packaged actions), and Custom agents (sub-agents with their own context) are all ways of feeding that "entire state" in layers. They are not four new toys but four drawers of one problem: store "what the model needs to know" by "how often it changes" — constants in Rules, on-demand in Skills, one-offs in the prompt.

一手实践锚 · 用文件作持久真源,而非靠对话历史。context rot(上下文腐化,见 ENG·09)的根因,是把"真相"留在了会随长会话稀释、覆盖、自相矛盾的对话历史里。解法是把决策、规格、任务沉淀成文件——SPEC / PLAN / TASKS 三件套就是这个解法的具体形态:它们是磁盘上的持久状态,每次推理从这里重新装配窗口,而不是指望模型"记得"几千 token 之前说过什么。这恰好把"人机同源"从一句原则变成一个可照做的纪律:人改的是文件,agent 读的也是文件,没有第二份漂移的真相。这一条自然把 ENG·02 引向 ENG·03——既然真相住在文件里,下一个问题就是:这些文件能不能被机器检验?〔源 Graziano《AI-Native Engineering》Day 2 / Day 4,证据级 Ⅳ;转引 Anthropic 上下文工程[R1][R3]

First-hand practice anchor · use files as the persistent source, not conversation history. The root of context rot (see ENG·09) is leaving "the truth" in a conversation history that dilutes, overwrites, and self-contradicts across a long session. The fix is to settle decisions, specs, and tasks into files — the SPEC / PLAN / TASKS trio is the concrete shape of that fix: persistent state on disk, from which each inference re-assembles the window, rather than hoping the model "remembers" what was said thousands of tokens ago. This turns "same source" from a principle into a copyable discipline: humans edit files, agents read files, and there is no second drifting truth. It naturally leads ENG·02 into ENG·03 — since the truth lives in files, the next question is: can these files be machine-checked? [Source: Graziano, AI-Native Engineering Day 2 / Day 4, grade Ⅳ; via Anthropic context engineering. [R1][R3]]

检验信号Test signal

上手爬坡时间下降——上下文是基础设施而非口口相传时,新人第一周就能交付真实代码。Onboarding ramp time drops — when context is infrastructure rather than word of mouth, newcomers ship real code in week one.

有效上下文窗口:为什么"喂得越多"反而越差

The effective context window: why "feed it more" makes it worse

"上下文即基础设施"容易被误读成"把一切都塞进窗口",那是把这一条引向反面。真实约束是:有效上下文窗口远小于标称窗口。一个模型标称能吃二十万 token,不等于这二十万 token 都被等权重地用上——注意力会随上下文变长而稀释,窗口里塞得越多,早期真正关键的约束越容易被淹没。所以上下文工程有两个对称的失败方向:喂太少,agent 缺关键事实只能瞎猜(幻觉的温床);喂太多,关键信号被噪音稀释,agent 分心、降准、还更贵。这条"少即是多"是 Anthropic《Effective Context Engineering for AI Agents》的核心论点,也是对"上下文越多越好"这个朴素直觉的直接修正。它让 ENG·02 从"为何文本/MD 被放大"的正面收益,补上了"上限与取舍"这一面——而正是这一面让这条原理更难被证伪、也更可操作。〔源 Anthropic《Effective Context Engineering for AI Agents》(经 Graziano Day 4 转引),证据级 Ⅳ(一手厂商工程文章)[R3][R1]

"Context as infrastructure" is easily misread as "cram everything into the window," which turns this principle into its opposite. The real constraint is: the effective context window is far smaller than the nominal one. A model nominally eating 200K tokens does not mean all 200K are used at equal weight — attention dilutes as context grows, and the more crammed into the window, the more the genuinely critical early constraints drown. So context engineering has two symmetric failure directions: too little, and the agent lacks key facts and can only guess (a breeding ground for hallucination); too much, and the key signal is diluted by noise, the agent distracted, less accurate, and more expensive. This "less is more" is the core argument of Anthropic's Effective Context Engineering for AI Agents, and a direct correction to the naive intuition that "more context is better." It rounds out ENG·02 from "why text / MD get amplified" (a positive payoff) with the side of "ceilings and trade-offs" — and it is precisely this side that makes the principle harder to falsify and more operational. [Source: Anthropic, Effective Context Engineering for AI Agents (via Graziano Day 4), grade Ⅳ (first-hand vendor engineering article). [R3][R1]]

这条上限直接解释了为什么"用文件作持久真源"优于"靠对话历史"。把 SPEC / PLAN / TASKS 写成版本化的文件,等于把"它需要知道的一切"放在窗口之外的可查询基础设施里,按需检索进窗口,而不是让它在一条越来越长、越来越被稀释的对话历史里反复重读。文件不会随对话变长而腐烂,可以被 diff、被审、被多个 agent 共享;对话历史则恰好相反——它越长越腐烂,且只活在这一个会话里。这也是 ENG·02 通向 ENG·03 的天然过渡:当你认真地"装配"上下文,你迟早会把那份装配写成一份规格,而规格正是下一张图纸的主题。可证伪信号:若你的 agent 在长会话后期开始"忘记"早先说过的关键约束、或重复犯同一个早已纠正过的错,那不是模型变笨,是有效窗口被对话历史塞满了——该把那些约束移出对话、写进文件。

This ceiling directly explains why "files as persistent source" beats "relying on conversation history." Writing SPEC / PLAN / TASKS as versioned files puts "everything it needs to know" in queryable infrastructure outside the window, retrieved on demand, rather than making it re-read repeatedly inside an ever-longer, ever-more-diluted conversation history. A file does not rot as the conversation lengthens; it can be diffed, reviewed, shared across agents. Conversation history is exactly the opposite — the longer it gets the more it rots, and it lives only in this one session. This is also the natural transition from ENG·02 to ENG·03: when you assemble context seriously, you will eventually write that assembly down as a spec, and the spec is the subject of the next sheet. Falsifiable signal: if your agent starts "forgetting" key constraints stated earlier, or repeats an error long since corrected, late in a long session, that is not the model getting dumber but the effective window crammed full of conversation history — move those constraints out of the conversation and into files.

为什么纯文本与 Markdown 在 AI 优先下被放大

Why plain text and Markdown are amplified in an AI-first world

"上下文即基础设施"还有一个具体的形态偏好需要讲清机制,而不是当口号:在 AI 优先的工作流里,纯文本、Markdown、以及结构化的纯文本知识(如知识图谱)的价值被系统性放大,而专有的、二进制的、只能用特定软件打开的格式被边缘化。原因不是文本"更朴素更好"这种审美,而是三条可机检的性质叠加。其一,人机同源:纯文本是人能读、agent 也能直接读的同一份东西,不需要一层导出/解析的损耗,于是人和 agent 真正"喝同一口井",不会出现"人看的文档"和"机器看的数据"两份各自漂移的真源。其二,可 diff:文本的改动是逐行可见、可评审、可回退的,于是 agent 的每一次修改都能被当作一个可审的提交,而不是一次不可追溯的覆盖——这把 ENG·02 直接接到了"diffable"那条贯穿原理上。其三,可查询、可组合:文本能被检索、被切块、被按需装配进窗口,恰好服务于上面那条"有效窗口有限、要主动装配"的纪律。把这三条合起来,"用文本作真源"就不是怀旧,而是对"agent 要能读、要能 diff、要能按需查"这套硬约束的最优响应。可证伪信号:若你团队的关键知识仍主要锁在只有特定软件能打开、agent 读不进来的专有格式里,那么无论你怎么强调"上下文重要",你的 agent 实际能用上的上下文都是残缺的——它被你的格式选择挡在了门外。

"Context as infrastructure" also has a concrete form preference whose mechanism, not slogan, must be stated: in an AI-first workflow the value of plain text, Markdown, and structured plain-text knowledge (such as knowledge graphs) is systematically amplified, while proprietary, binary formats openable only by specific software are marginalized. The reason is not an aesthetic that text is "plainer and better" but three machine-checkable properties stacking. First, one shared source: plain text is the same thing a human reads and an agent reads directly, with no export/parse loss in between, so humans and agents truly "drink from one well" and there is no drift between "the doc humans read" and "the data machines read" as two separate sources of truth. Second, diffable: text changes are line-by-line visible, reviewable, revertible, so each agent edit can be treated as a reviewable commit rather than an untraceable overwrite — wiring ENG·02 straight onto the "diffable" through-line. Third, queryable and composable: text can be retrieved, chunked, and assembled into the window on demand, serving precisely the "the effective window is limited, assemble actively" discipline above. Put the three together and "files as source of truth" is not nostalgia but the optimal response to the hard constraint that "the agent must read it, diff it, and query it on demand." Falsifiable signal: if your team's key knowledge is still mainly locked in proprietary formats openable only by specific software that the agent cannot read in, then however much you stress "context matters," the context your agent can actually use is incomplete — your format choice has shut it out at the door.

ENG
03
SPEC · 规格可机检
MACHINE-CHECKABLE SPEC
重画 · 原理
Redraw · Principle

规格要能被机器检验——为什么类型赢

Specs must be machine-checkable — why types win

生成充裕后,瓶颈是"对不对"。要让生成自己收敛,就得把"对"写成机器能检验的形式。为什么 TypeScript 与类型系统价值被放大?因为类型就是机器可检验的规格与护栏

Once generation is abundant, the bottleneck is "is it correct." To make generation converge on its own, "correct" must be written in a form a machine can check. Why are TypeScript and type systems amplified? Because types are machine-checkable specs and guardrails.

把意图外化成机器可检验的形式——类型、schema、测试、eval、lint——等于给生成循环一个目标函数:它在生成时就约束、在 CI 里自动验、把"对"从人脑搬进可执行的检查。规格越可机检,生成越能自我收敛、验证越能自动化、人越能只盯"只有人能定的对"。这与组织部分的 T1 同构:判断退守到稀缺节点,而"何为对"由人来定、由机器来查。

Externalize intent into a machine-checkable form — types, schemas, tests, evals, lints — and you give the generation loop an objective function: it constrains at generation, verifies automatically in CI, and moves "correct" from heads into executable checks. The more checkable the spec, the more generation self-converges, the more verification automates, and the more humans can watch only the "correct" that only humans can define. Isomorphic to the organization part's T1: judgment retreats to scarce nodes, while humans define "what is right" and machines check it.

为什么类型系统的价值被放大——讲清机制,而不是站队语言。不是 TypeScript 这门语言赢,是"把约束写成机器当场能驳回的形式"这件事赢。一个类型签名 (user: User) => Result<Order, PaymentError> 同时是三样东西:一份给人读的意图说明、一份给 agent 的生成护栏(它生成时就被约束在签名内)、一份给编译器的可机检规格(违反当场报错)。这三样过去要靠文档、口头约定、code review 分别维护,现在塌缩成同一行、且不会过期。生成越充裕,这种"当场驳回"的价值越高——因为人来逐个驳回跟不上生成的速度,只有机器能在生成的同一时间尺度上说"不"。凡是能把更多"对"压进编译期、压进类型、压进 schema 的做法,都在沿同一条原理放大;反过来,靠运行时才暴露、靠人事后才发现的约束,都在被边缘化。

Why type systems get amplified — the mechanism, not a language allegiance. It is not the TypeScript language that wins but the act of "writing a constraint in a form the machine can reject on the spot." A type signature like (user: User) => Result<Order, PaymentError> is three things at once: an intent brief a human reads, a generation guardrail for the agent (constrained to the signature as it generates), and a machine-checkable spec for the compiler (a violation errors out immediately). These three used to be maintained separately as docs, verbal convention, and code review; now they collapse into one line that cannot go stale. The more abundant generation is, the higher the value of this "reject on the spot" — because a human rejecting candidates one by one cannot keep pace with generation, and only a machine can say "no" on the same time scale as generation. Anything that presses more "correctness" into compile time, into types, into schemas amplifies along the same principle; conversely, constraints that only surface at runtime, found only by humans after the fact, are being marginalized.

trust-but-verify 由此长出来。把约束做成可机检的形式,等于声明:agent 的产出默认不可信,要由一个与生成分离的检查器判定才放行。这不是对 AI 的敌意,是工程纪律——人写的代码同样默认不可信,所以我们才有类型检查、测试、CI。区别只在:agent 生成快了几个量级,"事后人审"这条旧防线被冲垮了,必须把验证前移、自动化、并和生成解耦。下一节(ENG·04)讲这个独立验证器如何嵌进循环,ENG·06 讲它如何落成可照做的三档分工。

This is where trust-but-verify grows from. Making constraints machine-checkable is a declaration: the agent's output is untrusted by default and is released only when a checker, separate from generation, judges it. This is not hostility toward AI but engineering discipline — human-written code is untrusted by default too, which is why we have type checks, tests, and CI at all. The only difference is that the agent generates orders of magnitude faster, the old "humans review afterward" line of defense is overrun, and verification must be moved earlier, automated, and decoupled from generation. The next sheet (ENG·04) covers how this independent verifier embeds into the loop; ENG·06 covers how it lands as a copyable three-tier division of labor.

交给 ClaudeHand to Claude
  • 风格与 lint
  • Style & linting
  • 提交前抓 bug 并修
  • Catch & fix bugs pre-commit
  • 补测试 / 跑 eval
  • Add tests / run evals
留给人 · 定何为对Keep with humans · define right
  • 法务与风险容忍
  • Legal & risk tolerance
  • 信任边界与安全敏感代码
  • Trust boundaries & security-sensitive code
  • 产品品味与"够好"的判据
  • Product taste & the bar for "good enough"
核心图KEY FIGFIG. E3.0 / THE VERIFIABILITY GRADIENT · 内核②的分叉点 看懂:判断在哪条线上一分为二 Read: the line where judgment splits in two
◀ 可机检 · 廉价 · 确定 ◀ machine-checkable · cheap · deterministic 构成性 · 只有人能定 ▶ constitutive · only humans set ▶ 类型 / 编译器types / compiler 测试 / schematests / schema eval / lintevals / lint 语义 diff / AI 评审semantic diff / AI review 架构取舍 / 品味architecture trade-offs / taste "何为对"的边界the bar for "correct" 内核第②步在此分叉 kernel step ② forks here 左半 → 并入 ①充裕,被自动化 Left half → joins ① abundance, automated 机器能廉价且确定地判对错 → 交给 CI / checker machine judges correctness cheaply → to CI / checker 右半 → 下沉 ④,留给人 Right half → sinks to ④, kept with people 构成性判断:何为对、风险容忍、信任边界 constitutive: what is right, risk tolerance, trust seams
第②步"判断退守"不是一个台阶,是一条光谱。把每一类正确性按"机器能否廉价且确定地判"排开:左端的类型 / 测试 / eval 可机检,向右经语义 diff、AI 评审过渡,直到右端"何为对"的边界——这条只有人能定。分叉线左半并入①充裕被自动化,右半下沉④留给人。这条梯度是全系列共用的同一把尺:ENG·06 的三档分工、ENG·07 的规格阶梯、ENG·10 的边界即判断节点,都是它在不同面上的投影。
Step ②'s "judgment retreats" is not a stair but a spectrum. Lay out each kind of correctness by "can a machine judge it cheaply and deterministically": at the left, types / tests / evals are machine-checkable; moving right through semantic diff and AI review, until the right end, the bar for "what counts as correct" — that only humans can set. Left of the fork joins ① abundance and automates; right of it sinks to ④ and stays with people. This gradient is the one ruler the whole series shares: the three delegation tiers (ENG·06), the spec ladder (ENG·07), and the boundary-as-judgment-node (ENG·10) are its projections on different faces.
检验信号 / 深潜Signal / deep dive

可机检比例上升、人审集中在只有人能答处。验证为何是唯一承重墙,见The machine-checkable share rises; human review concentrates where only humans can answer. Why verification is the one load-bearing wall is in 验证篇 ↗the Verification chapter ↗

把规格写到"哪一档",决定了它的成色

Which "rung" you write the spec to decides its quality

"机器可检验"还不够精确——它只说了规格要采取什么形式,没说规格在团队里处于什么地位。规格驱动开发(SDD)把这件事铺成一道成熟度阶梯,三级:Spec-First,先写规格再写代码,但规格写完就被搁置,真源仍是代码;Spec-Anchored,规格与代码并存、且规格被当作权威参照,代码偏离规格被视为需要解释的事;Spec-as-Source,规格是唯一真源,代码是从规格生成或对规格负责的产物,改行为先改规格。这条阶梯回答了一个 ENG·03 单讲"形式"时回避的问题:同样是"写了规格",为什么有的团队规格活、有的团队规格三个月没人碰?因为他们停在不同档。可机检(这一节)是规格的形式条件,成熟度阶梯(ENG·07)是规格的地位条件,两者缺一不可——一份可机检但没人当真源的规格,和一份被当真源但全是自然语言、机器没法验的规格,都会失败。〔源 Graziano《AI-Native Engineering》Day 5–6 SDD 三级成熟度 Spec-First→Spec-Anchored→Spec-as-Source,证据级 Ⅳ 一手从业者[R1]

"Machine-checkable" is not precise enough — it states only what form the spec takes, not what standing the spec holds in the team. Spec-driven development (SDD) lays this out as a maturity ladder of three rungs: Spec-First, write the spec before the code, but the spec is shelved once written and the source of truth remains the code; Spec-Anchored, spec and code coexist and the spec is treated as the authoritative reference, with code drifting from spec seen as something requiring explanation; Spec-as-Source, the spec is the sole source of truth, code is generated from or answerable to the spec, and to change behavior you change the spec first. This ladder answers a question ENG·03 ducks when it speaks only of "form": both teams "wrote a spec," so why is one team's spec alive while another's has gone untouched for three months? Because they sit on different rungs. Machine-checkability (this sheet) is the spec's form condition; the maturity ladder (ENG·07) is the spec's standing condition, and neither is dispensable — a machine-checkable spec no one treats as source, and a spec treated as source but all natural language that no machine can check, both fail. [Source: Graziano, AI-Native Engineering Day 5–6 SDD maturity Spec-First→Spec-Anchored→Spec-as-Source, grade Ⅳ practitioner. [R1]]

类型不是约束的全部:可机检的光谱

Types are not the whole of constraint: the machine-checkable spectrum

说"类型赢"容易被窄化成"用了静态类型语言就行",那是把这一节的机制读浅了。真正放大的不是类型这一种形式,是"把约束写成机器当场能驳回的形式"这件事的整条光谱——类型只是这条光谱上最靠近编译期、最廉价的一段。往光谱右边走,依次还有:schema(约束数据的形状)、契约/前后置条件(约束接口的行为)、property-based 测试(约束"对任意输入都该成立的性质"而非单个样例)、不变量断言(约束运行时不该被破坏的状态)、以及 eval(约束语义层面的"对不对")。这些都是同一条原理的不同强度版本:把"对"前移到一个不依赖人事后审查、能在生成的同一时间尺度上说"不"的检查里。所以工程师真正该练的判断不是"该用哪门语言",而是"对这一处约束,能把它压到光谱多左边"——能压进类型的别只写注释,能写成 property 的别只写一个样例测试,能让机器在 CI 里当场驳回的别留到人审。可机检比例这个指标,量的就是你的约束整体在这条光谱上有多靠左。可证伪信号:若你的关键约束大量靠"代码注释 + 评审时人记得提醒"来维持,那它们其实在光谱最右、最弱、最易腐烂的那一端——这些约束会在某次没人记得提醒的评审里悄悄失效,而类型或 schema 不会。

Saying "types win" is easily narrowed to "just use a statically typed language," which reads this sheet's mechanism too shallowly. What is actually amplified is not the single form of types but the whole spectrum of "writing a constraint in a form the machine can reject on the spot" — types are merely the cheapest segment of that spectrum, nearest compile time. Moving right along the spectrum come, in turn: schemas (constraining the shape of data), contracts / pre- and post-conditions (constraining the behavior of an interface), property-based tests (constraining "a property that should hold for any input" rather than a single example), invariant assertions (constraining state that must not be broken at runtime), and evals (constraining semantic-level "is it correct"). These are all different-strength versions of one principle: move "correct" forward into a check that does not depend on after-the-fact human review and can say "no" on the same time scale as generation. So the judgment an engineer should actually practice is not "which language to use" but "for this constraint, how far left on the spectrum can I press it" — what can go into a type, do not leave as a comment; what can be a property, do not leave as one example test; what a machine can reject on the spot in CI, do not leave to human review. The machine-checkable share measures exactly how far left your constraints sit overall. Falsifiable signal: if your key constraints are largely maintained by "code comments plus a reviewer remembering to mention it," they actually sit at the rightmost, weakest, most rot-prone end of the spectrum — those constraints quietly lapse in some review where no one remembers, while a type or schema does not.

ENG
04
LOOP · 循环与自我进化
LOOP & SELF-EVOLUTION
重画 · 核心
Redraw · Core

把工作组织成会自我改进的循环

Organize work as self-improving loops

不要把 agent 当一次性执行,要当成可观测、可纠偏、自我改进的循环。循环的最小机制是进化的最小机制:生成器 + 独立验证器 + 外部状态 = 变异 / 选择 / 保留。

Do not treat the agent as one-shot execution; treat it as an observable, self-correcting, self-improving loop. A loop's minimal mechanism is evolution's minimal mechanism: a generator + an independent verifier + external state = variation / selection / retention.

harness 是承载循环的脚手架(心跳、隔离、知识、触手、制衡——做与查分离);spec 是循环的目标函数;eval 是承重墙——错误回流成新的 eval,随产出复利增长。而 skills / MCP / CLI 之所以是对的抽象,是因为它们把能力封装成可组合的接口暴露给 agent——底层原理还是上一节那条:对 agent 可读、可组合的纯文本协议赢,工具即上下文。

Harness is the scaffolding that carries the loop (heartbeat, isolation, knowledge, tentacles, checks-and-balances — doing separated from checking); spec is the loop's objective function; eval is the load-bearing wall — errors flow back as new evals and compound with output. And skills / MCP / CLI are the right abstraction because they package capability into composable interfaces exposed to agents — the underlying principle is the previous sheet's: legible, composable, plain-text protocols win, tools are context.

为什么这个最小机制就是进化的最小机制。把"自我改进"这个被滥用的词拆到底,它只需要三件东西凑齐:一个会产生多样候选的生成器(变异)、一个与生成分离、能判优劣的选择器(选择)、一个能把胜出者留到下一轮的外部存储(保留)。三件齐了,系统就会无监督地变好;缺任何一件,它就只会原地抖动。agentic 工作流恰好能凑齐这三件:agent 是生成器,eval / CI 是选择器,上下文库 / skills 库是外部存储。关键的、也最常被省掉的是第二件——选择必须与生成分离。让生成者给自己打分,等于让变异自己决定自己被不被选中,进化立刻退化成随机游走。这就是为什么"独立验证器"是承重墙:不是多一道保险,是这个机制能不能成立的充要条件。

Why this minimal mechanism is evolution's minimal mechanism. Take the over-used phrase "self-improving" down to the bottom and it needs only three things present together: a generator that produces diverse candidates (variation), a selector, separate from generation, that judges better from worse (selection), and an external store that carries winners into the next round (retention). With all three, the system improves unsupervised; missing any one, it merely jitters in place. Agentic workflows happen to supply all three: the agent is the generator, evals / CI the selector, the context store / skills library the external store. The crucial and most often omitted piece is the second — selection must be separate from generation. Letting the generator score itself is letting the mutation decide whether it survives, and evolution instantly degrades into a random walk. This is why the "independent verifier" is the load-bearing wall: not an extra safeguard but the necessary and sufficient condition for the mechanism to hold at all.

自改进的具体形态 · skills / commands / rules 库是会复利的项目级资产。steering loop 跑久了,会沉淀出一个东西:把每次失败的修法固化成可复用的 skill、command、rule。这个库和代码库不同——代码库记录"系统是什么",这个库记录"如何让 agent 把系统做对"。它随团队每次踩坑而增长,且对全队、对每次运行复用,所以它是少数几个真正随时间复利的工程资产之一。一个反直觉的推论:团队的护城河正从"代码本身"向"驯服 agent 的 harness + 库"转移——代码可被重新生成,但你们团队积累的、关于"什么会让 agent 在你们这套系统上犯错"的知识,对手抄不走。

The concrete shape of self-improvement · the skills / commands / rules library is a compounding project asset. Run the steering loop long enough and something settles out: each failure's fix is hardened into a reusable skill, command, or rule. This library differs from the codebase — the codebase records "what the system is," this library records "how to get the agent to build the system right." It grows with every pothole the team hits, and it is reused across the whole team and every run, so it is one of the few engineering assets that genuinely compounds over time. A counterintuitive corollary: a team's moat is shifting from "the code itself" toward "the harness + library that tames the agent" — code can be regenerated, but the knowledge your team has accumulated about "what makes the agent err on your particular system" cannot be copied off you.

生成器Generator
agent 大量产出候选——变异。The agent produces many candidates — variation.
独立验证器Independent verifier
与生成分离地判对错——选择。这是承重墙。Judges correctness, separate from generation — selection. The load-bearing wall.
外部状态External state
把通过的留存进上下文库——保留,于是循环会复利。Retains what passes into the context store — retention, so the loop compounds.

harness 的解剖(二维分类法,源 Martin Fowler《Harness Engineering for Coding Agents》)。承载循环的脚手架可沿两条轴拆开——guides(前馈,生成前把跑道铺好)sensors(反馈,生成后抓错);每条轴再分 computational(确定性、廉价:lint / 类型 / 测试 / schema)inferential(靠 LLM 推理:AI 评审 / 语义 diff / 计划批判)。规则很简单:能用 computational 就别用 inferential,inferential 只补"机器测不到的判断层";多数团队在某一格过投、在另一格欠投。

Anatomy of the harness (a two-axis taxonomy, after Martin Fowler, Harness Engineering for Coding Agents). The scaffolding that carries the loop splits along two axes — guides (feedforward: lay the runway before generation) vs sensors (feedback: catch errors after); each axis splits again into computational (deterministic, cheap: lint / types / tests / schema) vs inferential (LLM reasoning: AI review / semantic diff / plan critique). The rule is simple: prefer computational; let inferential cover only the judgment layer machines cannot test; most teams overinvest in one cell and underinvest in another.

guides · 前馈(生成前 steer)guides · feedforward (steer before)
  • computational:规格 / 类型约束 / 脚手架模板——生成前铺好跑道
  • computational: specs / type constraints / scaffold templates — lay the runway first
  • inferential:计划批判 / 示例 / 意图说明——把意图喂进去
  • inferential: plan critique / examples / intent briefs — feed intent in
sensors · 反馈(生成后 detect)sensors · feedback (detect after)
  • computational:lint / 类型 / 测试 / schema 校验——确定性抓错
  • computational: lint / types / tests / schema checks — catch errors deterministically
  • inferential:AI 评审 / 语义 diff——抓测试抓不到的判断层
  • inferential: AI review / semantic diff — catch the judgment layer tests miss
核心图KEY FIGFIG. E4.0 / THE HARNESS 2×2 · 脚手架的四格 看懂:你的 harness 在哪一格过投、哪一格欠投 Read: which cell your harness over-invests, which it under-invests
时间轴 → guides(前馈,生成前) · sensors(反馈,生成后) time axis → guides (feedforward, before) · sensors (feedback, after) 机制轴 ↑ computational(确定 · 廉价) · inferential(靠推理) mechanism axis ↑ computational (cheap) · inferential (LLM) GUIDES · 前馈 GUIDES · feedforward SENSORS · 反馈 SENSORS · feedback computational inferential G×C 规格 / 类型约束specs / type constraints 脚手架模板scaffold templates CI 预设 / lint 规则CI presets / lint rules 生成前铺好跑道,确定性约束lay the runway, deterministic constraints S×C lint / 类型检查lint / type checks 单测 / 集成测试unit / integration tests schema 校验 / evalschema checks / evals 生成后确定性抓错——最便宜的承重墙catch errors deterministically — cheapest wall G×I 计划批判 / 计划复核plan critique / plan review 示例 / few-shotexamples / few-shot 意图说明 / 角色设定intent briefs / role priming 生成前把意图喂进去,靠推理引导feed intent before generation, LLM-guided S×I AI 评审 / 语义 diffAI review / semantic diff "判断层"抽查judgment-layer spot checks 回归意图的一致性核验intent-consistency review 只补机器测不到的判断层——别滥用cover only what machines can't test — don't overuse 法则:能用 computational 就别用 inferential · 多数团队在某一格过投、另一格欠投 Rule: prefer computational over inferential · most teams over-invest one cell and under-invest another 最常见的失衡:S×I 堆满 AI 评审,却空着 S×C 的测试与 G×C 的规格 most common imbalance: S×I full of AI review while S×C tests and G×C specs sit empty
harness 不是一团模糊的"工具",它沿两条轴干净地分成四格:时间轴(前馈 guides / 反馈 sensors)× 机制轴(确定的 computational / 靠推理的 inferential)。四格各有该放的东西,规则只有一条——能用 computational 就别用 inferential,把昂贵又不确定的 LLM 推理省给机器真测不到的判断层。诊断你自己的 harness:哪一格塞满了、哪一格空着?最常见的病是 S×I(AI 评审)过投、S×C(测试)与 G×C(规格)欠投——下面的 INSTRUMENT 08 让你勾选自查。〔源 Martin Fowler《Harness Engineering for Coding Agents》,证据级 Ⅳ;经 Graziano 转引[R4][R1]
The harness is not a vague blob of "tools"; it splits cleanly along two axes into four cells: the time axis (feedforward guides / feedback sensors) × the mechanism axis (deterministic computational / LLM inferential). Each cell has its proper contents, and the rule is single — prefer computational over inferential, saving expensive, non-deterministic LLM reasoning for the judgment layer machines genuinely cannot test. Diagnose your own harness: which cell is stuffed, which is empty? The most common ailment is over-investing S×I (AI review) while under-investing S×C (tests) and G×C (specs); INSTRUMENT 08 below lets you check yourself. [Source: Martin Fowler, Harness Engineering for Coding Agents, grade Ⅳ; via Graziano. [R4][R1]]

steering loop · 把"自我进化"讲成可执行动作。每次 agent 失败 → 问"哪条 guide 或 sensor 本该拦住它" → 补上或磨利那一条 → 监督需求随时间下降;skills / commands / rules 库于是成为随时间复利的项目级资产。harness 不是另起炉灶,而是上一节"上下文工程"的工程纪律:后者问"此刻窗口里该有什么",前者问"什么系统在每次运行、跨整个团队地生产并校验那份上下文"——这一句把 ENG·02 与本张缝在一起。〔源 Martin Fowler,证据级 Ⅳ 一手从业者;经 Graziano《AI-Native Engineering》转引。[R4][R1]

steering loop · turning "self-evolution" into an executable move. Every time the agent fails → ask "which guide or sensor should have caught it" → add or sharpen that one → supervision demand falls over time; skills / commands / rules libraries thus become project-level assets that compound. The harness is not a fresh start but the engineering discipline of the previous sheet's context engineering: that one asks "what should be in the window right now," this one asks "what system produces and checks that context on every run, across the whole team" — the sentence that stitches ENG·02 to this sheet. [Source: Martin Fowler, grade Ⅳ practitioner; via Graziano's AI-Native Engineering. [R4][R1]]

INSTRUMENT 08 · 脚手架缺口诊断器INSTRUMENT 08 · Harness-Gap Diagnostic ● LIVE

勾选你当前 harness 真正覆盖的格。诊断器据 FIG. E4.0 的法则给出你最可能的失败模式与最便宜的下一笔投入。 Tick the cells your harness genuinely covers. Per FIG. E4.0's rule, it names your likely failure mode and the cheapest next investment.

检验信号Test signal

循环能在低人值守下自我纠偏;eval 覆盖随产出增长,而非靠人逐个把关。The loop self-corrects with little human babysitting; eval coverage grows with output instead of relying on humans to gate each one.

harness 是产品:guides 与 sensors 的二维分类

The harness is the product: the two-axis taxonomy of guides and sensors

"生成器 + 独立验证器 + 外部状态"给了循环的骨架,但还缺一张能让人当场盘点"我的脚手架到底有哪些、缺哪些"的分类表。Fowler 的《Harness Engineering for Coding Agents》给了这张表,它沿两条轴把围绕模型的整套脚手架干净地分成四格。第一条是时间轴guides(前馈)在生成之前引导——规则、skills、commands、上下文装配,是你提前塞进去的"该怎么做";sensors(反馈)在生成之后探测——测试、lint、类型检查、AI 评审,是事后告诉你"做得对不对"的。第二条是机制轴computational(确定性、廉价)是 lint / type / test / schema 这类不用模型推理、跑一次就有确定答案的检查;inferential(推理性)是 AI 评审 / 语义 diff / 计划批判这类要调模型来判断的检查。〔源 Martin Fowler《Harness Engineering for Coding Agents》(经 Graziano Day 4 转引),证据级 Ⅳ 一手从业者,ENG·04 的直接理论源[R4][R1]

"Generator + independent verifier + external state" gives the loop's skeleton but still lacks a table that lets you inventory on the spot "which scaffolding I actually have and which I lack." Fowler's Harness Engineering for Coding Agents gives that table, sorting the whole scaffolding around the model cleanly into four cells along two axes. First, the time axis: guides (feed-forward) steer before generation — rules, skills, commands, context assembly, the "how to do it" you put in ahead of time; sensors (feedback) detect after generation — tests, lint, type-checks, AI review, telling you afterward "whether it was done right." Second, the mechanism axis: computational (deterministic, cheap) are checks like lint / type / test / schema that need no model inference and give a definite answer in one run; inferential are checks like AI review / semantic diff / plan critique that call a model to judge. [Source: Martin Fowler, Harness Engineering for Coding Agents (via Graziano Day 4), grade Ⅳ practitioner, the direct theoretical source for ENG·04. [R4][R1]]

这张二维表立刻给出两条可照做的纪律。第一,先用 computational,inferential 只补判断层。能用确定性、廉价的检查解决的,绝不调模型去判——类型错误用编译器抓,不要让另一个 LLM "看一眼觉得对不对";inferential 检查只留给那些确实需要语义判断、computational 够不着的地方(比如"这个变量名是否表达了意图""这段重构有没有偷偷改了行为")。第二,多数团队同时在一边过投、一边欠投。常见的过投是堆砌 inferential(让 AI 评审一切,慢且贵且不确定),常见的欠投是 guides 太薄(没有把"该怎么做"沉淀成可复用的 skills/commands,于是每次都从零引导)。把你现有的脚手架往这四格里填一遍,空着的格子就是你的欠投,挤爆的格子就是你的过投。这正是 INSTRUMENT 08 脚手架缺口诊断器在做的事——它把这张表做成了可勾选的自检。可证伪信号:若你无法把现有的每一项检查准确地放进这四格之一,说明你对自己的脚手架其实没有清晰的图景,那"harness 是产品"这句话对你还停在口号。

This two-axis table immediately yields two copyable disciplines. First, use computational first; let inferential only fill the judgment layer. Whatever a deterministic, cheap check can solve, never call a model to judge — catch type errors with the compiler, do not have another LLM "glance and feel whether it is right"; reserve inferential checks for what genuinely needs semantic judgment that computational cannot reach (e.g. "does this variable name express the intent," "did this refactor quietly change behavior"). Second, most teams over-invest on one side and under-invest on the other at the same time. A common over-investment is piling on inferential (have AI review everything — slow, expensive, uncertain); a common under-investment is thin guides (no "how to do it" distilled into reusable skills/commands, so every run starts steering from zero). Fill your existing scaffolding into these four cells once: the empty cells are your under-investment, the overflowing cells your over-investment. This is exactly what INSTRUMENT 08, the Harness-Gap Diagnostic, does — it turns this table into a checkable self-test. Falsifiable signal: if you cannot place each existing check accurately into one of the four cells, you do not actually have a clear picture of your scaffolding, and "the harness is the product" is still a slogan for you.

自改进 = steering loop:每次失败补一条护栏

Self-improvement = the steering loop: each failure adds a guardrail

"会自我改进的循环"听起来抽象,但它有一个完全具体、可照做的执行形态,Fowler 把它叫 steering loop:每次 agent 失败,就问一句"哪条 guide 或 sensor 本该拦住这次失败",然后把那一条补上或磨利。这一个动作把"自我改进"从一种模糊的愿望,变成一台明确的机器:失败是输入,对护栏的一次增补是输出,而输出会让同类失败下次被自动拦下,于是监督需求随时间单调下降。注意这台机器改进的不是模型(模型是供应商给的、你改不动),而是围绕模型的脚手架——你的 skills、commands、rules、evals 库。这正是为什么这套库是"随时间复利的项目级资产":它不是一堆配置文件,而是这个项目踩过的每一个坑沉淀下来的、可复用的判断。它和 ENG·15 的 eval 回流是同一个循环的两种投影——eval 回流盯的是 sensor 那一侧(事后探测),steering loop 把它扩展到 guide 那一侧(事前引导):有些失败该补一条会变红的测试(sensor),有些失败该补一条"下次该这么做"的 skill(guide)。把两侧都纳入这个循环,harness 就成了一个会随团队经验一起变强的活系统。可证伪信号:若你的 skills/rules/evals 库长期不变、或只在项目初期写过一轮就再没更新,那说明 steering loop 没有在转——失败发生了,但没有回流成护栏,于是同类失败会一直重来,监督需求也降不下去。〔源 Martin Fowler《Harness Engineering for Coding Agents》steering loop / harness 复利 + Graziano Day 4,证据级 Ⅳ 一手从业者[R4][R1]

"A self-improving loop" sounds abstract, but it has a fully concrete, copyable executable form Fowler calls the steering loop: each time the agent fails, ask "which guide or sensor should have caught this failure," then add or sharpen that one. This single move turns "self-improvement" from a vague wish into a definite machine: failure is the input, an addition to a guardrail is the output, and the output makes the same class of failure auto-caught next time, so supervision demand declines monotonically over time. Note this machine improves not the model (the model is the vendor's; you cannot change it) but the scaffolding around the model — your library of skills, commands, rules, evals. This is exactly why that library is "a project-level asset that compounds over time": it is not a heap of config files but the reusable judgment distilled from every pit this project has stepped in. It is the same loop as ENG·15's eval feedback, seen on two faces — eval feedback watches the sensor side (detect after the fact), the steering loop extends it to the guide side (steer beforehand): some failures should add a test that turns red (a sensor), some should add a "next time, do it this way" skill (a guide). Bring both sides into this loop and the harness becomes a living system that strengthens with the team's experience. Falsifiable signal: if your skills/rules/evals library is long unchanged, or written once early and never updated, the steering loop is not turning — failures happened but did not flow back into guardrails, so the same class keeps recurring and supervision demand never falls. [Source: Martin Fowler, Harness Engineering for Coding Agents steering loop / harness compounding + Graziano Day 4, grade Ⅳ practitioner. [R4][R1]]

harness 工程是上下文工程的工程纪律

Harness engineering is the engineering discipline of context engineering

最后把 ENG·02 的上下文工程和这一节的 harness 工程缝在一起,因为它们常被当成两件事,其实是一件事的两个时间尺度。一句精准的定位是:harness 工程不是另起炉灶,而是上下文工程的工程纪律——后者问"此刻这个窗口里该有什么",前者问"什么系统在每次运行、跨整个团队地生产、校验、纠正那份上下文"。上下文工程是单次的、当下的:这一次调用,我该往窗口里装配哪些事实、哪些约束、哪些示例。harness 工程是重复的、系统的:我用什么 guides 在每次生成前自动把该有的上下文装进去(而不是每次手动),用什么 sensors 在每次生成后自动校验产出、把发现的问题回流成下次该补的上下文。这正是一个好 harness 的本相:一台自动做上下文工程的机器,它把"每次都得人记得装配对的上下文"这件易错、不可规模化的事,变成了系统在每次运行里稳定执行的纪律。这解释了为什么这两章在这一卷里相邻又互相指认:你在 ENG·02 学会"一次该装配什么",在 ENG·04 学会"怎么让一个系统每次都替你装配对、并随团队经验越装越准"。可证伪信号:若你的团队每次让 agent 干活,仍然要靠某个人手动想起来"哦这次得告诉它那个约定、给它那几个文件",那说明你停在了单次的上下文工程、没有把它升级成 harness——一旦那个人不在,上下文就装配错了,这正是缺乏 harness 纪律的代价。〔源 Martin Fowler《Harness Engineering for Coding Agents》+ Graziano Day 4(harness 工程是 context 工程的工程纪律),证据级 Ⅳ 一手从业者[R4][R1]

Finally, stitch ENG·02's context engineering to this sheet's harness engineering, because they are often taken as two things when they are two time scales of one. A precise placement: harness engineering is not a fresh start but the engineering discipline of context engineering — the latter asks "what should be in this window right now," the former asks "what system, on every run and across the whole team, produces, verifies, and corrects that context." Context engineering is single-shot and present-tense: for this call, which facts, constraints, examples should I assemble into the window. Harness engineering is repeated and systematic: which guides automatically assemble the right context before every generation (instead of doing it by hand each time), which sensors automatically verify the output after every generation and flow what they find back into the context to add next time. In other words, a good harness is a machine that does context engineering automatically, turning the error-prone, unscalable "someone must remember to assemble the right context every time" into a discipline the system executes stably on every run. This explains why the two chapters sit adjacent and point at each other in this volume: in ENG·02 you learn "what to assemble once," in ENG·04 you learn "how to make a system assemble it right every time and get better at it with the team's experience." Falsifiable signal: if every time your team puts the agent to work, someone still has to manually recall "oh, this time I have to tell it that convention, give it those files," you are stuck at single-shot context engineering and have not upgraded it into a harness — once that person is away, the context is assembled wrong, exactly the cost of missing harness discipline. [Source: Martin Fowler, Harness Engineering for Coding Agents + Graziano Day 4 (harness engineering as the engineering discipline of context engineering), grade Ⅳ practitioner. [R4][R1]]

ENG
05
FLOW · 流程重画
FLOW · REDRAW
重画 · 可拷贝环
Redraw · Copyable loop

研发流程重画成一个规格驱动的环

Redraw the dev process into a spec-driven loop

瀑布把规划堆在前头,敏捷把它切成迭代——两者都假设"实现很贵、要省着用"。当实现充裕,正确的形状不是直线、不是冲刺,是一个把意图、生成、验证缝成闭环、且能从每次跑动里学习的

Waterfall front-loads planning; agile slices it into iterations: both assume "implementation is expensive, so ration it." When implementation is abundant, the right shape is not a line and not a sprint but a loop that stitches intent, generation, and verification into one closed cycle and learns from every run.

受力分析 · 为何是环不是线。线性流程把"规划→实现→测试"当成一次性管道,赌的是上游判断准、下游执行慢。agentic coding 把这个赌注反过来:执行近乎免费,错误却在多步里滚雪球放大。线性管道没有把偏差挡回上游的回路,于是一处早期误解一路漂到生产。环把"学习"显式做成第六步——每次失败回流成下一轮的规格与护栏——于是流程本身会随产出复利。这就是把 ENG·04 的自我改进循环,从单个 agent 抬到整条研发流水线的尺度。

Force analysis · why a loop, not a line. A linear flow treats "plan to build to test" as a one-shot pipeline, betting that upstream judgment is sound and downstream execution is the slow part. Agentic coding inverts the bet: execution is near-free, yet errors snowball across steps. A linear pipeline has no return path to push drift back upstream, so one early misread floats all the way to production. The loop makes "learn" an explicit sixth step (each failure flows back as next round's spec and guardrail), so the process itself compounds with output. This lifts ENG·04's self-improving loop from a single agent to the scale of the whole pipeline.

可拷贝环 · Specify → Plan → Execute → Verify → Integrate → Learn

Copyable loop · Specify → Plan → Execute → Verify → Integrate → Learn

Specify · 写规格Specify
先写 SPEC.md:要什么、为什么、非目标、验收条件。人持有这一步。Write SPEC.md first: what, why, non-goals, acceptance conditions. Humans own this step.
Plan · 拆计划Plan
PLAN.md 把规格拆成可执行步骤;只在动手前展开当下这步(JIT)。PLAN.md breaks the spec into executable steps; expand only the step at hand (JIT).
Execute · 生成Execute
TASKS.md 逐项交给 agent 生成——这一步被自动化、可并行。TASKS.md hands items to the agent to generate; this step is automated and parallelizable.
Verify · 验证Verify
独立 checker 判对错:类型 / 测试 / eval / 逐 diff 评审。承重墙在此。An independent checker judges: types / tests / evals / diff-by-diff review. The load-bearing wall.
Integrate · 并入Integrate
过墙的才合并;CI 即选择压力,PR 即评审门。Only what clears the wall merges; CI is the selection pressure, the PR is the review gate.
Learn · 回流Learn
把这轮的错沉淀成新 eval / 新 rule / 更准的规格——下一轮更稳。Settle this round's errors into a new eval / rule / sharper spec; the next round is steadier.
核心图KEY FIGFIG. E5.0 / THE SDD RING · 规格驱动的闭环 看懂:为什么 Verify 是承重节点、Learn 让环复利 Read: why Verify is load-bearing and Learn makes the loop compound
① Specify ① Specify SPEC.md · 人持有 SPEC.md · human-owned ② Plan ② Plan PLAN.md · JIT PLAN.md · JIT ③ Execute ③ Execute agent · 自动 · 并行 agent · automated · parallel ④ Verify ④ Verify 独立 checker · 承重门 independent checker · load-bearing gate ⑤ Integrate ⑤ Integrate CI = 选择压力 CI = selection pressure ⑥ Learn ⑥ Learn 错误回流成新 eval errors flow back as evals 回流 → 下一轮规格 / 护栏更准 feedback → next round's spec / guardrail sharper 闭环,不是直线 a loop, not a line 过 Verify 门的才向 Integrate 流动 only what clears Verify flows to Integrate
瀑布是直线、敏捷是切短的直线,两者都赌"实现很贵"。实现充裕后,正确形状是这个六节点闭环。两个节点承重:Verify 是独立 checker 把守的门——只有过墙的才向 Integrate 流动,它是选择压力;Learn 把这轮的错沉淀成新 eval / 新 rule,沿那条朱红回流箭头喂回 Specify,于是覆盖随产出复利、环越跑越稳。把直线拉成环、且让 Learn 真回流,就是把 ENG·04 的自我改进循环抬到整条流水线的尺度。
Waterfall is a line, agile a shortened line; both bet "implementation is expensive." Once implementation is abundant, the right shape is this six-node closed loop. Two nodes are load-bearing: Verify is a gate guarded by an independent checker — only what clears the wall flows to Integrate, and it is the selection pressure; Learn settles this round's errors into a new eval / rule and feeds them back to Specify along the vermilion return arrow, so coverage compounds with output and the loop steadies as it runs. Bending the line into a ring and actually feeding Learn back lifts ENG·04's self-improving loop to the scale of the whole pipeline.

JIT 规划 · 为何不一次规划到底。线性流程在最无知的时刻(项目开头)做最多的规划。环把规划改成即时(just-in-time):只在动手前把当下这步展开到可执行,规格保持稳定,计划保持新鲜。原因是 ENG·02 的同一条约束——有效上下文窗口有限,提前规划得越细,到执行时越陈旧、越占窗口、越稀释注意力。规格先行 ≠ 计划先行:规格是耐久真源,计划是可丢弃的工作面。

JIT planning · why not plan it all up front. A linear flow does the most planning at the moment of greatest ignorance (the project's start). The loop makes planning just-in-time: expand only the step at hand to executable detail, keep the spec stable and the plan fresh. The reason is ENG·02's same constraint (the effective window is finite): the more you pre-plan in detail, the staler and more window-hogging and attention-diluting it is by execution time. Spec-first is not plan-first: the spec is the durable source, the plan is a disposable working surface.

规划视野为什么会坍缩。线性流程的隐含假设是"规划便宜、实现贵,所以多规划、少返工"。当实现近乎免费,这个权衡彻底翻转:返工便宜了,提前规划的细节反而成了负债——它在执行时已陈旧,还白白占着有限的有效窗口、稀释 agent 的注意力(同 ENG·02 那条约束)。于是理性的规划视野从"项目级"坍缩到"下一步级":稳定的是 SPEC.md(耐久真源),易变的是 PLAN.md(用完即弃的工作面)。这和组织卷里"规划周期从年度坍缩到实时"是同一个机制在工程面的显形——不是因为我们变懒,是因为提前规划的边际回报随实现成本下降而塌掉了。一个可照做的判据:如果你的 PLAN.md 越写越厚、改一次代码要同步改一大片计划,说明你把本该即时展开的东西提前固化了,把环又拉直成了线。

Why the planning horizon collapses. A linear flow's implicit assumption is "planning is cheap, implementation is dear, so plan more and rework less." When implementation is near-free, that trade-off flips entirely: rework is cheap now, and pre-planned detail becomes a liability — stale by execution time, and wasting the finite effective window while diluting the agent's attention (the same ENG·02 constraint). So the rational planning horizon collapses from "project-level" to "next-step-level": stable is SPEC.md (the durable source), volatile is PLAN.md (a use-and-discard working surface). This is the same mechanism as the organization volume's "planning cycle collapses from annual to real-time," shown on the engineering face — not because we got lazy but because the marginal return of pre-planning collapses as implementation cost falls. A copyable test: if your PLAN.md keeps thickening and one code change forces a sync of a large swath of plan, you have pre-hardened what should have been expanded just in time, pulling the loop straight into a line again.

Before
规划→实现→测试是一次性管道;早期误解一路漂到生产,没有回上游的路。
Plan to build to test is a one-shot pipeline; an early misread floats to production with no path back upstream.
新 · 原理After · principle
六步闭环:规格耐久、计划即时、验证承重、错误回流——流程随产出复利。
A six-step closed loop: durable spec, JIT plan, load-bearing verify, errors flowing back; the process compounds with output.
证据 · 级 Ⅳ正典 SDD 环 Specify→Plan→Execute→Verify→Integrate→Learn 与三件套 SPEC / PLAN / TASKS,源 Graziano《AI-Native Engineering》7 日路径(Day 5–6),转引 GitHub Spec-kit(规格与代码同住、像代码一样被评审、constitution.md 作硬规则强制层)与 Martin Fowler《Exploring Gen-AI: SDD》。一手从业者策展,非同行评议——按"工具是表层"纪律,此处只取环的形状与回流机制,不搬具体工具链。
Evidence · grade ⅣThe canonical SDD loop Specify→Plan→Execute→Verify→Integrate→Learn and the SPEC / PLAN / TASKS trio come from Graziano's AI-Native Engineering 7-day path (Days 5–6), via GitHub Spec-kit (specs live with code, reviewed like code, with constitution.md as a hard-rule enforcement layer) and Martin Fowler's Exploring Gen-AI: SDD. Practitioner curation, not peer-reviewed; per the "tools are surface" discipline, we take only the loop's shape and feedback mechanism, not the toolchain.

旧 SDLC 的去向What Happens to the Old SDLC Stages

What Happens to the Old SDLC Stages

把"实现充裕→判断退守"套在传统研发的几个阶段上,可以逐项预言它们的去向。和组织卷对管理五职能的处理同构:没有一个阶段"被 AI 增强"或凭空消失,每一个都被沿可验证性梯度劈成两半,可机检的一半下沉为基础设施,构成性的一半上浮为判断。这张表是 FIG. E3.0 那条梯度在研发流程上的逐阶投影。

Apply "execution becomes abundant, judgment retreats" to the stages of the traditional dev cycle and you can forecast each one's fate. It is isomorphic to the organization volume's treatment of the five management functions: no stage is "augmented by AI" or vanishes; each is split along the verifiability gradient, with the machine-checkable half sinking into infrastructure and the constitutive half rising as judgment. This table is FIG. E3.0's gradient projected stage by stage onto the development process.

TABLE E5.0 · SDLC → AI NATIVE研发阶段去向表Fate of the dev stages
阶段Stage
旧实现Old implementation
下沉为基础设施的一半Half that sinks into infrastructure
上浮为判断的一半Half that rises as judgment
需求RequirementsSPECIFY
需求文档 · 评审会 · 一次写死Requirement docs · review meetings · written once
agent 把意图展开成可执行步骤与候选方案,写规格的机械部分自动化(ENG·05 ② Plan,JIT 展开)Agents expand intent into executable steps and option sets; the mechanical part of spec-writing automates (ENG·05 ② Plan, JIT)
intent 与非目标、为"什么值得造"负责(SPEC.md 这一步人持有)Set intent and non-goals, own "what is worth building" (humans hold the SPEC.md step)
设计DesignARCHITECT
详设文档 · UML · 评审签字Detailed-design docs · UML · sign-off reviews
模块内部实现、内部算法、数据结构选型交给 agent 在契约下填充Module internals, algorithms, data-structure choices filled by the agent under contract
接缝与依赖方向——边界即判断节点(ENG·10,高半径 × 不可逆)Draw seams and dependency direction — the boundary as judgment node (ENG·10, high radius × irreversible)
编码CodingEXECUTE
逐行手写 · "打字"是稀缺人时Line-by-line by hand · "typing" is scarce labor
agentic coding 成默认:写码 / 测试 / 重构充裕、可并行(ENG·00 ① 充裕)Agentic coding is the default: code / tests / refactors abundant and parallel (ENG·00 ① abundance)
几乎全沉——只剩 构成性品味处人偶尔接管Almost wholly sunk — only constitutive taste needs the occasional human takeover
测试TestingVERIFY
手写用例 · QA 阶段 · 回归靠人跑Hand-written cases · a QA phase · regressions run by hand
类型 / schema / 测试 / eval 作可机检护栏,CI 自动验、错误回流成新 eval(ENG·03 / 05 ⑥)Types / schema / tests / evals as machine-checkable guardrails; CI verifies, errors flow back as new evals (ENG·03 / 05 ⑥)
"何为对"的判据、读独立 checker 标出的少数异常(承重墙仍在人这端)Set the bar for "correct", read the few anomalies the independent checker flags (the wall stays human-side)
集成IntegrationINTEGRATE
集成窗口 · 手工合并 · 发布评审Integration windows · manual merges · release reviews
CI 即选择压力,PR 即评审门;Continuous AI 工作流默认只读、写操作须显式 safe-output(ENG·08)CI is the selection pressure, the PR the review gate; Continuous AI workflows are read-only by default, writes declared safe-output (ENG·08)
不可逆处的确认门:对外发布、权限变更、数据迁移Guard the confirmation gate at the irreversible: external releases, permission changes, data migrations
运维OperationsLEARN
事后复盘 · 值班盯屏 · 经验留在个人脑里Post-hoc retros · on-call screen-watching · lessons trapped in heads
遥测 + 事故回流成 eval / rule,沉淀为随时间复利的项目级资产;监督从实时改为异步分诊(ENG·09)Telemetry + incidents flow back as evals / rules, settling into project assets that compound; supervision shifts from real-time to async triage (ENG·09)
在结构标记的异常处接管,并问"哪条护栏本该拦住它"Take over at the anomalies structure flags, and ask "which guardrail should have caught it"

表的右侧两列藏着和组织卷一样的结论:"研发流程"作为一串串行阶段整体让位,幸存下来的不是阶段,而是每个阶段里那半截构成性判断。读这张表有一个常见的误读要避免——它不是"自动化吃掉测试与运维岗",而是同一个人的着力点沿可验证性梯度上移:从写实现,到定何为对、划接缝、守不可逆。这正是 ENG·10 角色融合的逐阶证据。

The two right-hand columns conceal the same conclusion as the organization volume: the "dev process" as a chain of serial stages yields wholesale, and what survives is not the stages but the constitutive-judgment half inside each one. One common misreading to avoid: this is not "automation eats the testing and ops roles," but the same person's leverage climbing the verifiability gradient — from writing implementation to setting what is correct, drawing seams, and guarding the irreversible. This is the stage-by-stage evidence for ENG·10's role fusion.

检验信号Test signal

先行:返工率随轮次下降、规格被复用而非每次重写。反指标:PLAN.md 越写越厚、规格沦为合规摆设没人回流——那是把环又拉直成了线。Leading: rework rate falls across rounds; specs get reused, not rewritten each time. Counter-signal: PLAN.md keeps swelling and the spec becomes compliance theater that no one feeds back into; that is the loop pulled straight into a line again.

ENG
06
DELEGATE · 评审矩阵
DELEGATE · REVIEW MATRIX
决策 · 可照做
Decision · Copyable

trust-but-verify 落成三档分工:交办 / 评审 / 自持

trust-but-verify becomes three tiers: Delegate / Review / Own

"信任但要核验"不是口号,是一张可照做的分工表。每个任务沿一条问题落到三档之一:能完全交给 agent(Delegate)、要 agent 做但人逐 diff 评审(Review)、还是只有人能持有判断(Own)。分档的尺子,就是内核第②步那条可验证性梯度。

"Trust but verify" is not a slogan but a copyable division of labor. Each task drops into one of three tiers along a single question: fully delegable to the agent (Delegate), agent does it but a human reviews diff by diff (Review), or only a human can hold the judgment (Own). The ruler that sorts the tiers is the kernel's step-② verifiability gradient.

受力分析 · 一道问题定档。问:"这步的对错,机器能不能廉价且确定地判?"——能,则 Delegate;半能(机器查得了形式、查不了意图),则 Review;不能(构成性判断:何为对、风险容忍、信任边界),则 Own。这与 ENG·00 的可验证性梯度同一把尺:可机检的并入①充裕、被自动化;构成性的下沉④、留给人。三档不是按"重要性"分,是按可验证性分——这是最常被搞反的地方。

Force analysis · one question sets the tier. Ask: "can a machine judge this step's correctness cheaply and deterministically?" Yes: Delegate. Half (the machine checks form but not intent): Review. No (constitutive judgment: what counts as correct, risk tolerance, trust boundaries): Own. This is ENG·00's verifiability gradient as one ruler: the machine-checkable joins ① abundance and gets automated; the constitutive sinks to ④ and stays with people. The tiers split by verifiability, not by "importance" (the most commonly reversed point).

Delegate · 完全交办Delegate · hand off
  • 样板代码、CRUD、格式与重命名
  • Boilerplate, CRUD, formatting, renames
  • 有测试覆盖的重构
  • Refactors under test coverage
  • 补单测 / 写文档 / 跑 lint
  • Adding unit tests / docs / running lint
Own · 只有人能持有Own · only humans hold
  • "何为对"的判据、产品取舍
  • The bar for "correct," product trade-offs
  • 信任边界、权限、安全敏感接缝
  • Trust boundaries, permissions, security seams
  • 不可逆 / 大爆炸半径的架构决策
  • Irreversible / large-blast-radius architecture
FIG. E6.0 / DELEGATE · REVIEW · OWN · 可逆性 × 爆炸半径 看懂:任务落在平面哪个区,就归哪一档 Read: where a task lands on the plane sets its tier
可逆性 → 易退回 ····· 难退回 / 不可逆 reversibility → easy to undo ····· hard / irreversible 爆炸半径 ↑ 局部 ····· 全局 blast radius ↑ local ····· systemic DELEGATE · 完全交办 DELEGATE 低半径 × 易退回 = 放心给 agent low radius × easy undo = safe for the agent REVIEW · 逐 diff 评审 REVIEW 机器查得了形式 · 查不了意图 machine checks form, not intent OWN · 只有人能持有 OWN 高半径 × 不可逆 = 承重判断 high radius × irreversible = load-bearing 格式化 / 重命名format / rename 有测试的重构tested refactor 新业务逻辑new business logic 数据迁移 / API 契约data migration / API contract 删库 / 转账 / 权限变更drop DB / transfer / perms 架构接缝architecture seam 模型变强 → Review 档逐季缩小、向 Delegate 迁 models strengthen → Review shrinks, migrates to Delegate
三档不是按"重要性"排,是按一道物理量定位:可逆性(错了能不能便宜地退回)× 爆炸半径(错了会波及多大)。低半径 × 易退回落到左下的 Delegate;高半径 × 不可逆落到右上的 Own;中间那条对角带是 Review——机器查得了形式、查不了意图,必须人逐 diff 看。注意那条朱蓝虚线箭头:模型每强一档,你补几条 eval,Review 带就向左下退、把任务交还给 Delegate——这正是 ENG·01 杠杆点上移在评审面的样子。下面的 INSTRUMENT 07 把这张平面做成可拖的滑杆。
The three tiers are sorted not by "importance" but by two physical quantities: reversibility (can a mistake be cheaply undone) × blast radius (how far it spreads). Low radius × easy undo lands at lower-left, Delegate; high radius × irreversible lands at upper-right, Own; the diagonal band between is Review — the machine checks form but not intent, so a human must read it diff by diff. Note the dashed arrow: each model generation, with a few added evals, pushes the Review band down-left, handing tasks back to Delegate — ENG·01's climbing leverage seen on the review face. INSTRUMENT 07 below turns this plane into draggable sliders.

中间那档最难,也最值钱。Review = agent 生成、人逐 diff 评审:新业务逻辑、数据迁移、外部 API 契约、性能敏感路径。法则有两条:其一,评审 diff,不评审产物——盯改了什么,而非只看跑不跑得起来;其二,以测试为目标交办——让 agent 先写出会失败的测试、再写实现,人审"测试是否锁住了正确的意图"。Review 档随模型变强会缩小:今天要逐 diff 看的,明天可能加几条 eval 就降为 Delegate——这正是 ENG·01 杠杆点上移在评审面的样子。

The middle tier is the hardest and the most valuable. Review = agent generates, human reviews diff by diff: new business logic, data migrations, external API contracts, performance-sensitive paths. Two rules. First, review the diff, not the artifact: watch what changed, not just whether it runs. Second, delegate toward a test: have the agent write the failing test first, then the implementation, and the human reviews whether the test locks the right intent. The Review tier shrinks as models strengthen: what needs diff-by-diff review today may drop to Delegate tomorrow with a few added evals; this is ENG·01's climbing leverage seen on the review face.

三档分工先于工具,也长于工具。这张矩阵讲的是"哪些判断由人持有",不是"用哪个 IDE"。从业者把人要持有的判断拆成三件:定 intent(要什么、为什么)、设 constraints(架构、标准、非目标)、拥有 verification(测试、评审、质量门)。这三件正好对应组织卷里"判断退守到稀缺节点"——agent 接管执行,人持有这三件不可外包的判断。配套的三条协作守则同样可照做:先计划(从 chat 进到 plan,是 spec-driven 的第一步)、保持上下文干净(别让无关历史稀释窗口,呼应 ENG·02 的有效窗口)、知道何时叫停(agent 在原地打转时,问题往往在规格不在执行,停下来改规格而不是再试一轮)。

The three tiers precede tools and outlive them. This matrix is about "which judgments humans hold," not "which IDE." Practitioners split the human-held judgment into three: set intent (what, why), set constraints (architecture, standards, non-goals), and own verification (tests, review, quality gates). These three map exactly onto the organization volume's "judgment retreats to scarce nodes" — the agent takes over execution, the human holds these three non-outsourceable judgments. The accompanying three collaboration rules are equally copyable: plan first (moving from chat to plan is spec-driven's first step), keep context clean (do not let irrelevant history dilute the window, echoing ENG·02's effective window), and know when to stop (when the agent spins in place the problem is usually in the spec, not execution; stop and edit the spec rather than try another round).

证据 · 级 ⅣDelegate / Review / Own 三档分工矩阵,源 Graziano《AI-Native Engineering》(Day 1,团队层分工)转引 OpenAI《Build an AI-Native Engineering Team》;"逐 diff 评审、以测试为目标、保持上下文干净"为其 Day 3 协作守则。"评审 diff 不评审产物"与本系列验证篇的"独立 checker 是唯一承重墙"同源——见验证篇 ↗
Evidence · grade ⅣThe Delegate / Review / Own matrix comes from Graziano's AI-Native Engineering (Day 1, team-level division) via OpenAI's Build an AI-Native Engineering Team; "review by diff, target a test, keep context clean" are its Day 3 collaboration rules. "Review the diff, not the artifact" shares a root with this series' Verification chapter ("the independent checker is the one load-bearing wall"); see the Verification chapter ↗.
INSTRUMENT 07 · 分档计算器INSTRUMENT 07 · Delegation-Tier Calculator ● LIVE
REVIEW
DELEGATE完全交办hand off
REVIEW逐 diff 评审review by diff
OWN只有人能持有only humans hold
检验信号Test signal

先行:Review 档的任务逐季向 Delegate 迁移(说明你在补 eval、在上移杠杆)。反指标:什么都塞进 Review、人审带宽被吞——那是没在分档,是在用人肉追指数。Leading: Review-tier tasks migrate to Delegate quarter by quarter (you are adding evals and climbing the leverage). Counter-signal: everything lands in Review and human bandwidth is eaten; that is not tiering but chasing an exponential by hand.

三档是动态的:靠补护栏把任务往左推

The tiers are dynamic: push tasks left by adding guardrails

Delegate / Review / Own 这三档最容易被误用成一张静态的"任务清单"——把任务一次性固定在某一档,然后照着分工。但三档真正的价值在于它会动:一个今天必须 Review(人逐 diff 看)的任务,一旦你为它补上了能自动判对错的护栏(一条 eval、一个类型约束、一个独立 checker),它就可以左移到 Delegate(放手让 agent 全自动跑)。所以这张矩阵不是用来"安排谁做什么"的,是用来追踪你的护栏在往哪长的:健康的团队里,任务会季度性地从 Review 往 Delegate 迁移,因为护栏在变厚;而 Own 那一档(构成性判断:何为对、风险边界、架构取舍)几乎不动,因为它本就不可机检、不该下放。把这件事和 ENG·03 的可验证性梯度对上,就是同一把尺:一个任务能否左移,取决于"它的对错能否被廉价且确定地机检"——能,就左移;不能,就留在右边给人。〔源 Graziano《AI-Native Engineering》Day 1 Delegate/Review/Own 矩阵,证据级 Ⅳ 一手从业者[R1]

Delegate / Review / Own is most easily misused as a static "task list" — pinning each task to a tier once and dividing labor accordingly. But the tiers' real value is that they move: a task that today must be Reviewed (a human reading every diff) can, once you add a guardrail that auto-decides correctness (an eval, a type constraint, an independent checker), shift left to Delegate (handed fully to the agent). So this matrix is not for "assigning who does what" but for tracking where your guardrails are growing: in a healthy team tasks migrate quarter by quarter from Review toward Delegate because the guardrails are thickening; while the Own tier (constitutive judgment: what is right, risk boundaries, architectural trade-offs) barely moves, because it is not machine-checkable and should not be delegated. Mapped onto ENG·03's verifiability gradient, it is the same ruler: whether a task can shift left depends on "can its correctness be machine-checked cheaply and deterministically" — if yes, shift left; if no, keep it on the right for people. [Source: Graziano, AI-Native Engineering Day 1 Delegate/Review/Own matrix, grade Ⅳ practitioner. [R1]]

Own 那一档为什么永远不空

Why the Own tier is never empty

三档里 Delegate 在变大、Review 在收缩,一个自然的疑问是:随着护栏越补越厚,Own 那一档会不会也终将被清空、人最终无事可做?答案是不会,而且原因是结构性的,不是"现在还做不到"的暂时性。Own 那一档装的是构成性判断——决定何为对、定义风险容忍、划信任边界、做架构取舍——这些之所以留在 Own,不是因为机器暂时不够强,而是因为它们没有一个独立于人的判据可供机检。"这个产品该不该有这个功能""这个性能和复杂度的取舍我们能不能接受""这个对外契约一旦定了就要长期负责,我们认不认"——这些问题的"对"不是一个客观事实,而是一个由人的价值、处境、责任共同定义的判断。你可以用 eval 把"符不符合已定的标准"机检掉,但"标准本身该是什么"这件事,定义它的动作本身就只能由人来做——一旦让机器来定标准,你只是把判断换了个地方藏起来,没有消除它。这正是内核第④步在工程面的落点:充裕和护栏清空的是可机检的那半,留下的恰好是构成性的那半,而后者是人回到工程师本职的地方。可证伪信号:若有人声称把 Own 那一档也自动化了,去看他到底自动化的是什么——大概率是把"按已定标准判合不合格"自动化了(那本就该自动化),而"标准该是什么"这个真正的 Own 判断,要么还藏在某个人手里,要么被悄悄默认成了模型训练分布里的某个值(那是把判断让渡给了一个没人为之负责的来源)。

With Delegate growing and Review shrinking, a natural question: as guardrails thicken, will the Own tier eventually be emptied too, leaving humans nothing to do? The answer is no, and the reason is structural, not a temporary "we cannot do it yet." The Own tier holds constitutive judgments — deciding what is correct, defining risk tolerance, drawing trust boundaries, making architectural trade-offs — and these stay in Own not because machines are momentarily too weak but because they have no criterion independent of humans to machine-check against. "Should this product have this feature," "can we accept this performance-versus-complexity trade-off," "this external contract, once set, carries long-term responsibility — do we own it" — the "correct" of these questions is not an objective fact but a judgment defined jointly by human values, situation, and responsibility. You can machine-check "does it conform to the set standard" with an eval, but "what the standard itself should be" — the act of defining it can only be done by people; let a machine set the standard and you have merely hidden the judgment elsewhere, not eliminated it. This is the kernel's step ④ landing on the engineering face: abundance and guardrails clear the machine-checkable half and leave precisely the constitutive half, which is where people return to the engineer's true work. Falsifiable signal: if someone claims to have automated the Own tier too, look at what they actually automated — most likely "judging pass/fail against a set standard" (which should be automated), while the real Own judgment of "what the standard should be" is either still in someone's hands or quietly defaulted to some value in the model's training distribution (which is ceding the judgment to a source no one is responsible for).

FIG. E6.1 / TRUST-BOUNDARY ZONES · 信任边界的同心圈 看懂:能力越往外圈走、回退越难、爆炸半径越大,批准权就越要收回到人手里 Read: the farther out the ring, the harder to undo and the larger the blast radius — so approval is pulled back toward humans
只读 read-only 查询 / 搜索 / 读文件 query / search / read 限域写 scoped-write 改工作区文件 / 开 PR / 沙箱跑 edit workspace / open PR / sandbox run 特权 · 不可逆 privileged · irreversible prod 部署 / 删库 / 转账 / 改权限 prod deploy / drop DB / transfer / perms 谁来批准 who approves agent 自批 · 无需人 agent self-approves · no human ≈ Delegate 档 ≈ Delegate tier 人审 diff 后合并 human reviews diff, then merges ≈ Review 档 ≈ Review tier 具名负责人显式批 + 留痕 named owner approves explicitly + logged ≈ Own 档 · 永不外包 ≈ Own tier · never outsourced 向外 → 回退更难 · 爆炸半径更大 outward → harder to undo · larger blast radius
权限不是一个"信不信任 agent"的开关,而是一组同心圈:能力按"回退难度 × 爆炸半径"分层,批准权随圈层向外逐级收回人手。这正是 INSTRUMENT 07 那把"可逆性 × 爆炸半径"的尺子画成空间——最外圈的特权动作不会因为模型更强而内移,因为"该不该按下这个不可逆按钮"是构成性判断,永远落在 Own 那一档。 Permission is not one "do we trust the agent" switch but a set of concentric rings: capability is tiered by "cost-to-undo × blast radius," and approval is pulled back toward humans as you move outward. This is INSTRUMENT 07's "reversibility × blast-radius" ruler drawn as space — the outermost privileged actions do not migrate inward as models improve, because "should this irreversible button be pressed" is a constitutive judgment that always lands in the Own tier.
ENG
07
SPEC LADDER · 规格成熟度
SPEC LADDER · MATURITY
机理 · 阶梯
Mechanism · Ladder

规格不是开关,是一道成熟度阶梯

A spec is not a switch but a maturity ladder

ENG·03 说"规格要能被机器检验",但"写规格"不是有或无的开关。它分三阶:规格先写一次(Spec-First)、规格随代码同步活着(Spec-Anchored)、规格成为单一真源、代码由它生成(Spec-as-Source)。爬得越高,"对不对"越能被自动收敛。

ENG·03 says "specs must be machine-checkable," but "writing a spec" is not an on/off switch. It has three rungs: the spec is written once (Spec-First), the spec stays alive in sync with code (Spec-Anchored), the spec becomes the single source from which code is generated (Spec-as-Source). The higher you climb, the more "correctness" can converge on its own.

受力分析 · 为何要分阶。把"写规格"讲成一刀切,会逼团队在还没有验证基建时硬上 Spec-as-Source,结果规格沦为没人维护的死文档。阶梯让投入与回报匹配:每升一阶,规格从"一次性意图说明"变成"可回归的承重工件",而升阶的前提是下一阶的可机检条件已就位。这与 ENG·03 缝合——可机检是纵轴(深度),成熟度是横轴(耐久度);一条规格只有同时可机检又活着,才真正成为生成循环的目标函数。

Force analysis · why rungs. Treating "write a spec" as all-or-nothing forces teams onto Spec-as-Source before the verification infrastructure exists, and the spec rots into a document no one maintains. The ladder matches investment to payoff: each rung turns the spec from a one-shot intent brief into a regressible load-bearing artifact, and climbing a rung presupposes that the next rung's machine-checkable conditions are in place. This stitches to ENG·03: machine-checkability is the vertical axis (depth), maturity is the horizontal axis (durability); a spec becomes the generation loop's objective function only when it is both checkable and alive.

Spec-First · 规格先行Spec-First
动手前先写规格、再生成。治 vibe-coding,但规格写完即与代码分叉漂移。Write the spec before generating. Cures vibe-coding, but the spec drifts from code once written.
Spec-Anchored · 规格锚定Spec-Anchored
规格与代码同住同版、像代码一样被评审;CI 检查二者一致。Spec lives and versions with code, reviewed like code; CI checks the two stay consistent.
Spec-as-Source · 规格即源Spec-as-Source
规格是唯一真源,代码是它的产物;改行为先改规格。门槛最高、回报最大。The spec is the single source, code is its output; change behavior by changing the spec. Highest bar, largest payoff.

可拷贝判据 · 你在第几阶。问三句:(1) 改一个行为,你是先改代码还是先改规格?先代码 = 阶 Ⅰ。(2) 规格和代码不一致时,CI 会红吗?不会 = 还没到阶 Ⅱ。(3) 规格里有没有一层"硬规则"(如 constitution.md)是 agent 不许违反、且被强制层挡住的?没有 = 还没摸到阶 Ⅲ 的门。多数团队真实位置在阶 Ⅰ 到 Ⅱ 之间,且常误以为自己在阶 Ⅲ——这正是规格沦为合规摆设的根因。

Copyable test · which rung you are on. Ask three things. (1) To change a behavior, do you edit code first or the spec first? Code first = rung Ⅰ. (2) When spec and code disagree, does CI go red? No = not yet rung Ⅱ. (3) Is there a "hard-rule" layer (like a constitution.md) the agent may not violate and that an enforcement layer blocks? No = not yet at rung Ⅲ's door. Most teams sit between rungs Ⅰ and Ⅱ and often believe they are at rung Ⅲ (exactly why specs become compliance theater).

证据 · 级 Ⅳ规格成熟度三阶 Spec-First → Spec-Anchored → Spec-as-Source,源 Graziano《AI-Native Engineering》(Day 5–6),转引 GitHub Spec-kit(规格与代码同住、PR 即评审门、constitution.md 作硬规则强制层)。本系列在 ENG·03 强调"可机检",此阶梯补其缺的"耐久度演进"维度——二者正交、互补。
Evidence · grade ⅣThe three-rung spec maturity ladder (Spec-First → Spec-Anchored → Spec-as-Source) comes from Graziano's AI-Native Engineering (Days 5–6) via GitHub Spec-kit (specs live with code, the PR is the review gate, constitution.md as a hard-rule enforcement layer). This series stresses "machine-checkable" in ENG·03; the ladder adds its missing "durability evolution" axis; the two are orthogonal and complementary.

为什么大多数团队卡在阶 Ⅰ 到 Ⅱ 之间。升到阶 Ⅱ(规格锚定)需要一个常被低估的基建:一个能在"规格与代码不一致"时让 CI 变红的检查器。没有它,规格和代码会无声分叉——人改了代码忘了改规格,或反过来——而没有任何信号提醒。这正是 ENG·03 的"可机检"在阶梯上的作用:可机检是让规格活着的前提,因为只有可机检的规格才能被 CI 持续比对。所以这条阶梯和 ENG·03 是正交两轴:纵轴是"规格有多可机检"(深度),横轴是"规格活得多久"(耐久度)。一条规格只有同时在两轴上都够高,才真正成为生成循环的目标函数——可机检但没人维护的规格会腐烂,活着但不可机检的规格只是一篇没有约束力的作文。

Why most teams stall between rungs Ⅰ and Ⅱ. Climbing to rung Ⅱ (Spec-Anchored) needs an often-underestimated piece of infrastructure: a checker that turns CI red when "spec and code disagree." Without it, spec and code fork silently — someone edits the code and forgets the spec, or vice versa — with no signal to flag it. This is exactly where ENG·03's "machine-checkable" does its work on the ladder: machine-checkability is the precondition for the spec to stay alive, because only a machine-checkable spec can be continuously diffed by CI. So this ladder and ENG·03 are two orthogonal axes: the vertical is "how machine-checkable the spec is" (depth), the horizontal is "how long the spec stays alive" (durability). A spec becomes the generation loop's objective function only when it is high on both — a machine-checkable spec no one maintains rots, and a living but un-checkable spec is just a non-binding essay.

阶 Ⅲ 的那道硬规则层,是把"何为对"从建议升格为强制。Spec-as-Source 的标志不是"有规格",而是规格里存在一层 agent 不许违反、且被一个强制层挡住的硬规则(如一份 constitution.md)。区别在执行力:阶 Ⅰ/Ⅱ 的规格是"应该这样",agent 可以违反、人事后发现;阶 Ⅲ 的硬规则是"不能这样",违反在生成或合并时就被自动拦下。这把 ENG·03 的"机器可检验"推到极致——不只检验产出对不对,还检验过程有没有越过不可逾越的红线。也正因门槛最高、回报最大,它最容易被误判:很多团队以为自己在阶 Ⅲ,其实只是有一堆没人强制的规范文档,那仍是阶 Ⅰ。

Rung Ⅲ's hard-rule layer promotes "what is correct" from suggestion to enforcement. The mark of Spec-as-Source is not "having a spec" but a layer of hard rules the agent may not violate, blocked by an enforcement layer (such as a constitution.md). The difference is teeth: rung Ⅰ/Ⅱ specs say "it should be this way," and the agent can violate them with humans noticing later; rung Ⅲ hard rules say "it cannot be this way," and a violation is intercepted automatically at generation or merge. This pushes ENG·03's machine-checkability to its limit — checking not only whether the output is correct but whether the process crossed an inviolable red line. And precisely because the bar is highest and the payoff largest, it is the most commonly misjudged: many teams believe they are at rung Ⅲ when they merely have a pile of unenforced convention docs, which is still rung Ⅰ.

检验信号Test signal

先行:改行为时人自然先去改规格。反指标:规格目录最后更新在三个月前,代码却天天变——规格已死,你掉回了阶 Ⅰ 之下。Leading: to change behavior, people instinctively edit the spec first. Counter-signal: the spec folder was last touched three months ago while code changes daily; the spec is dead and you have fallen below rung Ⅰ.

让规格像代码一样被评审:constitution 作硬规则层

Review the spec like code: a constitution as the hard-rule layer

规格要爬到 Spec-as-Source 那一档,光靠"先写规格"的纪律不够——纪律会松。真正让规格活下去的,是把它放进和代码同一套可评审、可 diff、可强制的机制里。一个具体可照做的做法来自 Spec-kit:规格文件与代码同住一个仓库,每次改规格走 PR、像评审代码一样评审规格的改动;再立一份 constitution.md 作为硬规则强制层——把那些"无论如何不能违反"的约束(安全红线、架构不变量、对外契约)写进去,由流水线在每次生成/提交时强制校验,agent 和人都不能绕过。这一层的妙处在于它把规格的两种性质分开了:大部分规格是"该怎么做"的可演进描述,走 PR 评审即可;少数是"绝不能怎样"的不变量,需要被当作硬约束机器强制。这恰好把 ENG·03 的"可机检"和这一节的"地位"缝在一起:constitution 既是可机检的(机器能验违没违),又是地位最高的(它是规格里不可协商的那部分)。可证伪信号:若你的团队"有规格"但没有任何一条规格是被流水线硬性强制的、全靠人自觉遵守,那这套规格大概率会沿"纪律松弛"那条路慢慢退回 Spec-First 甚至更低。〔源 Graziano《AI-Native Engineering》Day 5–6 Spec-kit(PR 即评审门、constitution.md 硬规则层),证据级 Ⅳ 一手从业者[R5][R1]

For a spec to climb to the Spec-as-Source rung, the discipline of "write the spec first" is not enough — discipline slackens. What truly keeps a spec alive is placing it in the same reviewable, diffable, enforceable mechanism as code. One concrete copyable practice comes from Spec-kit: spec files live in the same repo as code, every spec change goes through a PR, and a spec change is reviewed like code; then a constitution.md serves as a hard-rule enforcement layer — write the "must not be violated under any circumstances" constraints (security red lines, architectural invariants, external contracts) into it, enforced by the pipeline on every generation/commit, bypassable by neither agent nor human. The elegance of this layer is that it separates two natures of a spec: most of the spec is an evolvable description of "how to do it," which a PR review suffices for; a few are "must never" invariants that must be machine-enforced as hard constraints. This stitches ENG·03's "machine-checkable" to this sheet's "standing": a constitution is both machine-checkable (a machine verifies violation) and of the highest standing (the non-negotiable part of the spec). Falsifiable signal: if your team "has specs" but not a single spec rule is hard-enforced by the pipeline, all relying on voluntary compliance, that spec set will most likely slide back along the "discipline slackens" path to Spec-First or below. [Source: Graziano, AI-Native Engineering Day 5–6 Spec-kit (PR as review gate, constitution.md as hard-rule layer), grade Ⅳ practitioner. [R5][R1]]

规格不是开关,是阶梯:怎么判断该爬到哪一档

A spec is not a switch but a ladder: judging which rung to climb to

把成熟度阶梯当成"越高越好、所有项目都该冲到 Spec-as-Source"是一个常见的误用——它会让团队为低风险、快迭代的探索性代码也强行套上重规格,把本该轻快的事拖慢。阶梯的正确读法是:不同的代码,该停在不同的档,由这段代码的"改一次的代价"和"错一次的代价"共同决定。一段探索性的、随时可能整段扔掉的原型,停在 Spec-First 甚至更轻就够了——为它写重规格是浪费,因为它的寿命短、错了也便宜。而一段对外契约、一段被许多下游依赖的核心模块、一段安全敏感的代码,则值得爬到 Spec-Anchored 乃至 Spec-as-Source——因为它改一次牵动很广、错一次代价很高,把"何为对"固定在一份权威规格里的收益,远超维护规格的成本。这条判据和 ENG·10 的边界、INSTRUMENT 07 的爆炸半径是同一把尺:越靠近系统接缝、爆炸半径越大的代码,越值得往阶梯上爬。所以"规格成熟度"不是一个团队级的统一档位,而是一张随代码重要性变化的地图。可证伪信号:若你的团队要么所有代码都没规格(全在阶 0)、要么所有代码都套重规格流程(不分轻重一刀切),那都是没在按代价分档——前者会在核心模块上栽跟头,后者会被自己强加的规格负担拖慢探索。〔源 Graziano《AI-Native Engineering》Day 5–6 SDD"规格不是开关而是阶梯",证据级 Ⅳ 一手从业者[R1]

Treating the maturity ladder as "higher is always better, every project should push to Spec-as-Source" is a common misuse — it makes teams force heavy specs onto low-risk, fast-iterating exploratory code, dragging down what should be light and quick. The ladder's correct reading is: different code should stop on different rungs, decided jointly by that code's "cost to change once" and "cost to get wrong once." An exploratory prototype that may be thrown away wholesale at any time stops at Spec-First or lighter — writing a heavy spec for it is waste, because it is short-lived and cheap to get wrong. But an external contract, a core module many downstreams depend on, a piece of security-sensitive code is worth climbing to Spec-Anchored or even Spec-as-Source — because changing it once moves a lot and getting it wrong once is costly, and the payoff of nailing "what is correct" into an authoritative spec far exceeds the cost of maintaining it. This criterion is the same ruler as ENG·10's boundaries and INSTRUMENT 07's blast radius: the closer to a system seam and the larger the blast radius, the more the code is worth climbing the ladder for. So "spec maturity" is not a single team-wide rung but a map that varies with code importance. Falsifiable signal: if your team either has no spec for any code (all on rung 0) or forces the heavy spec process onto all code (one-size-fits-all regardless of stakes), neither is tiering by cost — the former trips on core modules, the latter is dragged down on exploration by a self-imposed spec burden. [Source: Graziano, AI-Native Engineering Day 5–6 SDD "a spec is not a switch but a ladder," grade Ⅳ practitioner. [R1]]

把规格成熟度与可机检形式合起来看,会得到这一卷一个简洁的判据:一份好规格,既要在形式上尽量靠近可机检(ENG·03 的光谱左端),又要在地位上爬到与代码风险相称的那一档(这一节的阶梯)。两者都达标的规格,才能真正给生成循环当目标函数——形式让机器能验,地位让团队会改。缺任何一个,规格都会退化:只有形式没有地位的规格被写完就丢,只有地位没有形式的规格全靠人读人审、机器使不上力。这就是为什么"写了规格"不是终点,"规格活着且可机检"才是。

Take spec maturity and machine-checkable form together and you get a compact criterion for this volume: a good spec must in form sit as far left toward machine-checkable as possible (the left end of ENG·03's spectrum) and in standing climb to the rung commensurate with the code's risk (this sheet's ladder). Only a spec that meets both can truly serve as the generation loop's objective function — form lets the machine verify, standing makes the team edit. Missing either and the spec degrades: form without standing is written then dropped; standing without form leans on people to read and review, with the machine unable to help. This is why "wrote a spec" is not the finish line; "the spec is alive and machine-checkable" is.

FIG. E8.0 / THE SPEC MATURITY LADDER · 可机检份额逐档抬升 看懂:每爬一档,规格里"机器能自己验"的份额变大,人审的负担变小 Read: each rung climbed grows the machine-checkable share and shrinks the human-review burden
机器能验的份额 ↑ machine-checkable share ↑ 规格成熟度 · 越往右越多"对"被钉成可机检的判据 → spec maturity · further right, more of "correct" is nailed into machine-checkable criteria → ≈ 全自动验 ≈ fully auto-checked 人读人审 human-read 散文意图 prose intent 范例输入输出 examples 类型 / 模式 types / schema 性质测试 property tests 形式化证明 formal proof 机器能自验的份额 machine-checkable share 仍需人判的份额 still needs human judgment
同一句"对",从散文意图爬到形式化证明,并不是变得"更对",而是越来越多地被翻译成机器能独立复验的判据——蓝色份额每档抬升一截。这条阶梯不是"越高越好":该停在哪一档由代码的改动成本与出错成本决定(见上文判据),但只要往右爬一档,规格当生成循环目标函数的能力就强一分。 The same "correct," climbing from prose intent to formal proof, does not become more correct — it gets translated, rung by rung, into criteria a machine can re-verify on its own; the blue share steps up each rung. The ladder is not "higher is always better": where to stop is set by the code's cost-to-change and cost-to-get-wrong (the criterion above), but every rung climbed strengthens the spec's power to serve as the generation loop's objective function.
ENG
08
TRUST BOUNDARY · 安全边界
TRUST BOUNDARY
重画 · 一等结构
Redraw · First-class

信任边界是结构里的一等元素,不是事后审查

The trust boundary is a first-class structural element, not an afterthought audit

agent 什么都碰,而且很快。它能读你的代码库、调外部工具、连第三方 MCP server——每一处都是攻击面。安全敏感的接缝必须前置、显式、由人把守:最小权限、含住爆炸半径。这是架构卷"信任边界即一等结构"在工程面的落地。

Agents touch everything, fast. They read your codebase, call external tools, connect to third-party MCP servers; every one is an attack surface. Security-sensitive seams must be front-loaded, explicit, and human-guarded: least privilege, contained blast radius. This is the Architecture chapter's "trust boundary as first-class structure" landed on the engineering face.

受力分析 · 新攻击面从何而来。把工具暴露给 agent,等于把"谁能执行什么"的边界交到一个会被自然语言操纵的系统手上。三类新风险:tool poisoning(MCP 工具描述里藏指令,诱导 agent 越权调用)、prompt injection(被处理的数据里夹带指令,劫持 agent 行为)、凭证泄露(agent 把密钥、token 写进日志或回传)。三者同根:把不可信输入接到了有权限的执行器上,结构性地长出来,不是模型偶发 bug——只能用边界挡,挡不住的靠权限收窄爆炸半径。

Force analysis · where the new attack surface comes from. Exposing tools to an agent hands the "who may execute what" boundary to a system steerable by natural language. Three new risks: tool poisoning (instructions hidden in an MCP tool's description, luring the agent into out-of-scope calls), prompt injection (instructions smuggled in processed data, hijacking agent behavior), and credential leakage (the agent writes keys or tokens into logs or sends them back). All three share a root: wiring untrusted input to a privileged executor grows them structurally; they are not occasional model bugs. You stop what you can with boundaries, and what you cannot stop you contain by narrowing the blast radius.

可拷贝清单 · 最小权限边界

Copyable checklist · least-privilege boundary

证据 · 级 ⅣMCP 安全(tool poisoning / prompt injection / 凭证泄露 / 最小权限清单)与 Continuous AI(CI/CD 里 agentic 工作流默认只读、写操作须显式 safe-output、人审后合并),源 Graziano《AI-Native Engineering》(Day 4 MCP 安全、Day 7 Continuous AI)转引 GitHub《Continuous AI in Practice》。"信任边界即一等结构、最小权限、含住爆炸半径"与本系列架构篇 ↗同源。
Evidence · grade ⅣMCP security (tool poisoning / prompt injection / credential leakage / least-privilege list) and Continuous AI (agentic CI/CD workflows read-only by default, writes declared safe-output, merged after human review) come from Graziano's AI-Native Engineering (Day 4 MCP security, Day 7 Continuous AI) via GitHub's Continuous AI in Practice. "Trust boundary as first-class structure, least privilege, contained blast radius" shares a root with this series' Architecture chapter ↗.

为什么"事后审查"这条旧防线失效。传统安全把信任边界当成一道事后的关卡:先建,临上线再审。这在 agent 时代行不通,原因有二。其一是速度——agent 在几秒内就能调几十个工具、改几百个文件,等人来审,越权早已发生。其二是攻击面的种类变了:过去的攻击面是确定的代码路径,可以静态扫描;agent 的攻击面是自然语言,而自然语言无法被穷举地静态分析——一段藏在工具描述或被处理数据里的指令,看起来就是普通文本。所以信任边界必须从"事后审查"前移成"结构里的一等元素":在 agent 能做任何事之前,它能做什么就已经被权限、沙箱、通道隔离结构性地限定死了。这和 ENG·10"边界即判断节点"是同一件事的两面——架构边界既是正确性的接缝,也是安全的接缝。

Why the old "review afterward" line of defense fails. Traditional security treats the trust boundary as a checkpoint after the fact: build first, audit before launch. That does not work in the agent era, for two reasons. One is speed — an agent can call dozens of tools and change hundreds of files in seconds; by the time a human reviews, the over-privileged action has already happened. The other is that the kind of attack surface changed: the old surface was determinate code paths that could be statically scanned; the agent's surface is natural language, which cannot be exhaustively statically analyzed — an instruction hidden in a tool description or in processed data simply looks like ordinary text. So the trust boundary must move from "review afterward" to "a first-class element in the structure": before the agent can do anything, what it can do is already bounded structurally by permissions, sandbox, and channel isolation. This is two faces of the same thing as ENG·10's "boundary as judgment node" — an architecture boundary is both a correctness seam and a security seam.

把"接一个新工具"当成"引一个新依赖"来审。这是一条可照做的纪律。我们早就学会不随便 npm install 一个陌生包——会看下载量、看维护者、看它要什么权限。接一个 MCP server / 工具应该走同一道关:它的工具描述里有没有可疑指令(tool poisoning)?它要读写哪些资源、是不是远超它该有的范围?它会不会把数据回传到你不控制的地方?Continuous AI 给出流水线层的默认姿态——CI/CD 里的 agentic 工作流默认只读,任何写操作必须被显式声明为 safe-output、并经人审后才合并。默认只读这一条尤其关键:它把"出事"的默认方向从"已经写坏了再回滚"翻转成"想写必须先举手",爆炸半径在结构上就被限制在接近零。

Vet "connecting a new tool" as you would "adding a new dependency." This is a copyable discipline. We long ago learned not to npm install a stranger's package casually — we check downloads, maintainers, what permissions it wants. Connecting an MCP server / tool should pass the same gate: are there suspicious instructions in its tool descriptions (tool poisoning)? Which resources does it read and write, and is that far beyond what it should need? Could it send data somewhere you do not control? Continuous AI gives the pipeline-level default posture — agentic workflows in CI/CD are read-only by default, and any write must be explicitly declared safe-output and merged only after human review. The read-only default matters most: it flips the default direction of "something goes wrong" from "it already wrote damage, now roll back" to "to write, it must first raise its hand," nailing the blast radius structurally near zero.

检验信号Test signal

先行:你能一句话说出每个 agent 能碰什么、不能碰什么。反指标:图省事给了通配权限、agent 跑在你的主机账户上——一次 prompt injection 就是全盘失守。Leading: you can state in one sentence what each agent can and cannot touch. Counter-signal: wildcard permissions for convenience, the agent running under your main account: one prompt injection and the whole thing is lost.

能力接口要可组合,但每个接口单独授权

Capability interfaces must be composable, yet each authorized alone

"可组合的能力接口"(skills / MCP / CLI)是 agent 扩展触达范围的方式,但"可组合"和"安全"之间有一个必须当面解决的张力:你既希望 agent 能灵活地把多个能力拼起来完成复杂任务,又不希望任何一个被投毒的接口能把整套权限带跑。解法不是在二者间折中,而是把组合性放在能力的接口形状上、把授权放在每个接口的边界上——接口可以自由拼接,但每个接口携带的能力都被单独授权、单独审计。一个 MCP 工具就算被投毒,它能造成的影响也被限制在它被授予的那一小块能力里(只读这个目录、只调这个只读 API),出不了那道边界。这正是 ENG·04"可组合接口"和本节"最小权限"在同一个对象上的合流:接口的能力面要可组合(让 agent 灵活),接口的权限面要最小且独立(让爆炸半径可控)。把这两面分开看,"既要灵活又要安全"就不再是矛盾,而是同一个接口的两个正交属性。可证伪信号:若给 agent 新增一个能力时,你发现自己不得不连带放开一堆不相关的权限(因为它们捆在一把钥匙上),说明你的接口把能力面和权限面耦合了——这正是一次投毒能全盘失守的结构成因。

"Composable capability interfaces" (skills / MCP / CLI) are how an agent extends its reach, but between "composable" and "secure" sits a tension to resolve head-on: you want the agent to flexibly compose several capabilities for a complex task, yet you do not want any one poisoned interface to run off with the whole permission set. The solution is not a compromise between the two but putting composability on the interface shape of a capability and authorization on each interface's boundary — interfaces compose freely, but the capability each carries is authorized and audited separately. Even a poisoned MCP tool's impact is nailed to the small slice of capability it was granted (read this directory only, call this read-only API only) and cannot cross that boundary. This is where ENG·04's "composable interfaces" and this sheet's "least privilege" converge on one object: an interface's capability face must be composable (for agent flexibility), its permission face minimal and independent (for a controllable blast radius). Hold the two faces apart and "flexible yet secure" stops being a contradiction and becomes two orthogonal properties of one interface. Falsifiable signal: if adding one capability to an agent forces you to open a bundle of unrelated permissions (because they share one key), your interface has coupled the capability face to the permission face — exactly the structural cause by which one poisoning loses everything.

可组合不等于失控:接口让 agent 安全地扩展自己

Composable is not uncontrolled: interfaces let the agent extend itself safely

把这一节和 ENG·04 的"可组合接口"接到底,会看到一个对 AI-Native 工程很关键的设计姿态:agent 扩展能力的正确方式,是经由一组定义清晰、单独授权的接口,而不是给它一个无所不能的通用入口。诱惑总是存在的——给 agent 一个能执行任意 shell 命令的工具,它似乎"什么都能做",省去你定义一堆细粒度接口的麻烦。但这恰恰是把可组合性建在了沙子上:一个万能入口意味着它的能力面和权限面完全耦合,你无法表达"它可以做 A 和 B,但不能做 C",因为 A、B、C 共用同一个不受约束的通道。相反,把每个能力做成一个独立的接口(一个只读这个数据源的 skill、一个只调这个 API 的 MCP 工具、一个只在这个目录下操作的 CLI),可组合性反而更强——因为定义清晰的接口才能被可靠地拼接,而每个接口自带的权限边界让这种拼接不会越界。这就是为什么"可组合"和"最小权限"在好的接口设计里不是对立的,是互相成就的:清晰的接口边界同时服务于组合(让拼接可预测)和安全(让授权可表达)。可证伪信号:若你给 agent 的主要扩展方式是"一个能跑任意命令的口子",那你既没有真正的可组合性(拼接不可预测),也没有真正的最小权限(无法表达细粒度约束)——你只是把"省事"误当成了"灵活"。

Wiring this sheet to ENG·04's "composable interfaces" reveals a design stance crucial to AI-Native engineering: the correct way for an agent to extend its capability is through a set of clearly defined, separately authorized interfaces, not by giving it one omnipotent general entry. The temptation always exists — give the agent a tool that runs arbitrary shell commands and it seems to "do everything," sparing you the trouble of defining many fine-grained interfaces. But that builds composability on sand: an omnipotent entry means its capability face and permission face are fully coupled, and you cannot express "it may do A and B but not C" because A, B, C share one unconstrained channel. Conversely, making each capability an independent interface (a skill that reads only this data source, an MCP tool that calls only this API, a CLI that operates only under this directory) makes composability stronger — because only clearly defined interfaces can be reliably composed, and the permission boundary each interface carries keeps that composition from overreaching. This is why "composable" and "least privilege" are not opposed in good interface design but mutually enabling: a clear interface boundary serves both composition (making it predictable) and security (making authorization expressible). Falsifiable signal: if your main way to extend the agent is "a hole that runs arbitrary commands," you have neither real composability (composition is unpredictable) nor real least privilege (you cannot express fine-grained constraints) — you have merely mistaken "convenient" for "flexible."

ENG
09
FAILURE MODES · 失败模式学
FAILURE MODES
反例 · 为何非验不可
Anti-pattern · Why verify

为何非验不可:会猜的系统,错误会滚雪球

Why you cannot skip verification: a guessing system snowballs its errors

前面讲了一堆"怎么验证";这一张讲"为何非验不可"。trust-but-verify 的,是一组可证伪、可演示的失败模式——它们不是模型偶发故障,是会猜的系统的结构性副产物。认得出它们,才知道护栏该补在哪。

The sheets above covered "how to verify"; this one covers "why you cannot skip it." The cause behind trust-but-verify is a set of falsifiable, demonstrable failure modes: not occasional model glitches but structural by-products of a guessing system. Recognize them and you know where the guardrails go.

受力分析 · 一个共同的根。LLM 按"下一个 token 最可能是什么"生成,它没有"我不知道"这个自然状态——所以它的失败大多自信而错,而非沉默而空。这一条根长出下面五种模式。其中 snowball effect 最关键:早期一个小误解被后续每一步当作既定前提,沿多步指数放大——它单独就论证了"每个有意义的步骤都要有验证检查点",因为越往后纠错越贵。

Force analysis · one common root. An LLM generates by "what token is most probable next"; it has no natural state of "I don't know," so its failures are mostly confidently wrong rather than silently empty. That root grows the five modes below. Of these, the snowball effect matters most: an early small misread is taken as a fixed premise by every later step and amplifies exponentially across them; it alone argues for "a verification checkpoint at every meaningful step," because correcting later is dearer.

幻觉Hallucination
Hallucination
凭空生成看似合理、实则不存在的 API、字段、引用。护栏:可机检的类型 / schema / 编译器,让"不存在"当场报错。Invents plausible-looking but nonexistent APIs, fields, citations. Guardrail: machine-checkable types / schema / compiler so "does not exist" errors out at once.
自信而错Confident wrongness
Confident wrongness
错误答案与正确答案语气一样笃定,没有不确定信号。护栏:独立 checker 判对错,别信生成者的自评。A wrong answer sounds as certain as a right one, with no uncertainty signal. Guardrail: an independent checker judges correctness; do not trust the generator's self-assessment.
上下文腐化Context rot
Context rot
长会话里早期信息被稀释、覆盖、自相矛盾,模型悄悄漂离原意图。护栏:用文件作持久真源,而非靠对话历史记事。Across a long session, early information gets diluted, overwritten, self-contradictory; the model quietly drifts from the original intent. Guardrail: use files as the persistent source, not conversation history.
隐藏假设Hidden assumptions
Hidden assumptions
把未言明的前提当事实补全,沿错误前提一路自洽地跑下去。护栏:规格显式写出非目标与边界条件。Fills in unstated premises as fact and runs on, internally consistent atop a wrong premise. Guardrail: specs state non-goals and boundary conditions explicitly.
雪球效应Snowball effect
Snowball effect
早期小错被后续每步当作既定前提,沿多步指数放大。护栏:每个有意义步骤设验证检查点,趁错小就拦。An early small error is taken as a fixed premise by each later step and amplifies exponentially. Guardrail: a verification checkpoint at every meaningful step, catching errors while small.
核心图KEY FIGFIG. E9.0 / THE SNOWBALL · 雪球 vs 检查点 看懂:为什么每个有意义步骤都要一个验证检查点 Read: why every meaningful step needs a verification checkpoint
推理 / 生成的步数 → steps of reasoning / generation → 累积偏差 ↑ accumulated error ↑ 02468 无独立验证器 WITHOUT a verifier 早期小误解被每步当既定前提 → 指数放大 early misread taken as premise each step → exponential 有检查点 WITH checkpoints 每步验证、趁错小就拦回零 ✓ checkpoint verify each step, reset error while small 共同的根:LLM 无"我不知道"的自然状态 → 失败多为"自信而错" common root: an LLM has no natural "I don't know" → failures are mostly "confidently wrong"
两条曲线从同一个起点出发,结局天差地别。朱红线是没有独立验证器的轨迹:早期一个小误解被后续每一步当作既定前提,沿多步指数放大成雪球——而且因为 LLM 没有"我不知道"的自然状态,它一路自信,不会自己刹车。蓝线是每个有意义步骤都设验证检查点的轨迹:每次趁错还小就拦回近零,累积偏差被压成有界的锯齿。这张图单独就论证了 trust-but-verify 的——不是因为 agent 笨,是因为会猜的系统的偏差会复利,而越往后纠错越贵。〔失败模式分类源 Graziano,证据级 Ⅳ;雪球机制为本系列对其的形式化呈现〕
Two curves leave the same origin and end worlds apart. The vermilion line is the trajectory without an independent verifier: an early small misread is taken as a fixed premise by every later step and snowballs exponentially — and because an LLM has no natural "I don't know," it stays confident the whole way and never brakes itself. The blue line is the trajectory with a verification checkpoint at every meaningful step: each time the error is caught and reset near zero while still small, accumulated drift is squeezed into a bounded sawtooth. This figure alone argues the cause behind trust-but-verify — not that the agent is dumb, but that a guessing system's drift compounds and correcting later is dearer. [Failure taxonomy from Graziano, grade Ⅳ; the snowball mechanism is this series' formalization of it.]

反讽边界 · 盯得更紧不是答案。系统越可靠,监督者的警觉性衰减越快——而恰在最该接管的异常时刻,人已丢失情境感知(Bainbridge 1983《自动化的反讽》)。所以"人在环上实时盯屏"注定失败:答案不是盯得更紧,是把人的介入从实时监督改成异步分诊——让结构(独立 checker、检查点、eval)先把绝大多数错挡掉,人只在被结构标记的少数异常处接管。这把失败模式学和本系列验证篇缝在一起。

The irony boundary · watching harder is not the answer. The more reliable the system, the faster a supervisor's vigilance decays, and at the very anomaly that most needs a takeover, the human has already lost situational awareness (Bainbridge 1983, Ironies of Automation). So "a human on the loop watching the screen in real time" is doomed: the answer is not to watch harder but to shift human intervention from real-time supervision to asynchronous triage, letting structure (independent checkers, checkpoints, evals) block the vast majority while the human takes over only at the few anomalies structure has flagged. This stitches the failure-mode study to this series' Verification chapter.

证据 · 级 Ⅳ + Ⅱ失败模式分类(hallucination / confident wrongness / context rot / hidden assumptions / snowball effect)源 Graziano《AI-Native Engineering》(Day 2 失败模式),证据级 Ⅳ 从业者策展。"自动化的反讽"(系统越可靠、监督者警觉性衰减越快)源 Lisanne Bainbridge,《Ironies of Automation》, Automatica 1983,证据级 Ⅱ 经典人因工程文献——本系列验证篇 ↗同引。
Evidence · grade Ⅳ + ⅡThe failure taxonomy (hallucination / confident wrongness / context rot / hidden assumptions / snowball effect) comes from Graziano's AI-Native Engineering (Day 2 failure modes), grade Ⅳ practitioner curation. "Ironies of automation" (the more reliable the system, the faster supervisor vigilance decays) comes from Lisanne Bainbridge, Ironies of Automation, Automatica 1983, grade Ⅱ classic human-factors literature, cited the same way in this series' Verification chapter ↗.

五种模式同出一根,所以护栏可以系统地配。把五种模式排在一起会发现它们不是五个独立缺陷,而是同一个根("会猜、且没有'我不知道'状态")在不同环节的显形——于是每一种都对应一类结构化护栏,而不是"让人更小心":幻觉对类型 / schema / 编译器(让"不存在"当场报错),自信而错对独立 checker(别信生成者的自评),上下文腐化对文件持久真源(别靠对话历史记事),隐藏假设对显式写出的非目标与边界条件,雪球对每步的验证检查点(趁错小就拦)。这张对应表本身就是一份可照做的护栏清单:遇到一类失败,先认它属于哪一根,再补对应那一格的结构,而不是加一轮人肉复查。

The five modes share one root, so guardrails can be assigned systematically. Line the five up and you see they are not five independent defects but the same root ("guesses, and has no 'I don't know' state") showing itself at different junctures — so each maps to a class of structural guardrail rather than "be more careful": hallucination to types / schema / compiler (so "does not exist" errors out on the spot), confident wrongness to an independent checker (do not trust the generator's self-assessment), context rot to a file-based persistent source (do not bookkeep on conversation history), hidden assumptions to explicitly written non-goals and boundary conditions, the snowball to a verification checkpoint at each step (catch errors while small). This mapping is itself a copyable guardrail checklist: when a failure class appears, first identify which root it belongs to, then add the structure for that cell, instead of adding one more round of human re-checking.

自动化的反讽,是这一卷最反直觉、也最重要的一条。Bainbridge 1983 的发现是:你把系统做得越可靠,人类监督者就越难在它真出错时接得住——因为长时间无事发生会让警觉性自然衰减,而恰恰在最罕见、最该接管的异常时刻,监督者已经丢失了情境感知。这条对 AI-Native 工程是直接的:很多团队的本能反应是"既然 agent 会犯错,那就安排人盯紧点"。Bainbridge 告诉你这注定失败——盯得越久、系统越稳,人越盯不住那个关键时刻。正确的解不是加强实时监督,是改变监督的形态:从"人在环上实时盯屏"改成"结构先挡、人做异步分诊"。让独立 checker、检查点、eval 把绝大多数错在发生时就拦掉,只把结构标记出来的少数真异常推给人,且推的时候带齐上下文。这样人面对的不是"连续几小时的平静里突然一个异常",而是"一个已经被框定、附带证据的待判事项"——警觉性衰减这个人因陷阱,就被结构绕开了。

The irony of automation is this volume's most counterintuitive and most important point. Bainbridge's 1983 finding: the more reliable you make a system, the harder it is for a human supervisor to catch it when it does fail — because long stretches of nothing-happening let vigilance decay naturally, and at the rarest, most-needs-a-takeover anomaly the supervisor has already lost situational awareness. This bears directly on AI-Native engineering: many teams' instinct is "since the agent errs, station a human to watch closely." Bainbridge tells you this is doomed — the longer the watch and the steadier the system, the less the human catches that critical moment. The right fix is not stronger real-time supervision but a change in its shape: from "a human on the loop watching the screen" to "structure blocks first, the human does asynchronous triage." Let independent checkers, checkpoints, and evals block the vast majority at the moment of occurrence, and push to the human only the few real anomalies structure has flagged — and push them with full context attached. Then the human faces not "a sudden anomaly inside hours of calm" but "an already-framed item with evidence to judge" — and the human-factors trap of vigilance decay is routed around structurally.

检验信号Test signal

先行:每次事故能回答"哪条护栏本该拦住它"并补上。反指标:靠"让人盯紧点"防雪球——那是在用衰减的注意力对抗指数的错误增长。Leading: every incident can answer "which guardrail should have caught it" and gets one added. Counter-signal: fighting the snowball by "having people watch more closely" (decaying attention against exponential error growth).

为什么"让人盯紧点"是错的方向

Why "have people watch more closely" is the wrong direction

面对 agent 频繁出错,最自然的反应是"那就让人审得更仔细一点"。这条路在 agentic 尺度上必然失败,原因是一个量级问题,不是态度问题:人的注意力是衰减的、有限的、且不随产出增长,而错误增长是指数的。一个 agent 一夜产出的待审量,可以轻松超过一个人一周能认真审的量;指望人靠"更努力"去追这个差距,等于用一条平的线去追一条指数曲线,差距只会越拉越大。而且越是疲劳,人越容易对自洽、流畅、自信的错误放行——恰好就是 agent 最擅长产出的那种错。所以正确的方向不是加大人审强度,而是把人审从热路径上挪走:把能机检的对错沉淀成自动护栏(让机器在人之前过滤掉绝大多数),把人留在机器够不着的少数构成性判断上,并且把人审的形态从"实时盯屏"改成"异步分诊"(处理机器标记出来的可疑项,而非每一项)。这条和 ENG·06 的三档分工、ENG·15 的 eval 复利是同一个动作的不同侧面:唯一能跟上指数产出的,是另一套能随产出一起增长的自动验证,而不是一个再努力也只有那么多带宽的人。可证伪信号:若你的团队应对质量问题的主要手段是排班加人审、而非补自动护栏,那你正在用衰减的注意力对抗指数的错误增长——这场仗的结局是确定的。

Faced with an agent erring often, the most natural reaction is "then have people review more carefully." This path necessarily fails at agentic scale for a reason of magnitude, not attitude: human attention is decaying, finite, and does not grow with output, while error growth is exponential. The review backlog an agent produces overnight can easily exceed what one person can carefully review in a week; expecting humans to close that gap by "trying harder" is chasing an exponential curve with a flat line, and the gap only widens. Worse, the more fatigued, the more readily a human waves through self-consistent, fluent, confident errors — exactly the kind the agent is best at producing. So the right direction is not to crank up review intensity but to move human review off the hot path: distill machine-checkable correctness into automatic guardrails (let the machine filter out the vast majority before the human), keep the human on the few constitutive judgments the machine cannot reach, and change the form of human review from "watching the screen in real time" to "asynchronous triage" (handling items the machine flagged, not every item). This is a different face of the same move as ENG·06's tiers and ENG·15's eval compounding: the only thing that keeps pace with exponential output is another automatic verification that grows with output, not a human who has only so much bandwidth however hard they try. Falsifiable signal: if your team's main response to quality problems is more shifts and more reviewers rather than more automatic guardrails, you are fighting exponential error growth with decaying attention — the outcome of that fight is settled.

ENG
10
BOUNDARY · 边界即判断节点
BOUNDARY · JUDGMENT NODE
机理 · 角色融合
Mechanism · Role fusion

架构边界,就是非人不可的判断节点

An architecture boundary is exactly the judgment node only a human can hold

当实现充裕,"边界放哪"成了那少数非人不可的决策。模块、服务、接口的边界,正是可逆性与爆炸半径所在——架构师决定接缝放哪,agent 去填模块里面。这把前面所有 SHEET 收束成一句:工程师从实现者,融合成编排者。

When implementation is abundant, "where the boundaries go" becomes one of the few decisions only a human can make. The boundaries of modules, services, interfaces are exactly where reversibility and blast radius live: the architect decides where the seams go, the agent fills in what is inside the modules. This collapses every sheet above into one line: the engineer fuses from implementer into orchestrator.

受力分析 · 边界为何是判断节点。一个决策值不值得人来做,看两件事:可逆性(错了能不能便宜地退回)和爆炸半径(错了会波及多大)。模块边界恰好同时决定这两者——接缝划错,错误会跨边界蔓延、且难以回退。所以架构边界天然是高爆炸半径、低可逆性的承重决策,正是该把判断花在的地方;边界里面的实现则低半径、高可逆,可放心交给 agent。这与 ENG·07 的分档同一把尺,只是作用在结构面:边界即 Own,模块内即 Delegate

Force analysis · why a boundary is a judgment node. Whether a decision deserves a human turns on two things: reversibility (can a mistake be cheaply undone) and blast radius (how far a mistake spreads). A module boundary happens to set both at once: draw the seam wrong and errors spread across it and resist rollback. So an architecture boundary is inherently a high-blast-radius, low-reversibility load-bearing decision, exactly where judgment should go; the implementation inside the boundary is low-radius, high-reversibility and can be safely handed to the agent. Same ruler as ENG·07's tiers, on the structural face: the boundary is Own, inside the module is Delegate.

角色融合 · 实现者 → 编排者 → 调度者。把以上连起来:当模块内的实现可交办,工程师的着力点上移到设计边界、写规格、搭 harness、设计循环、最后调度一支并行的 agent 队伍。这正是 ENG·01 那栋楼的电梯——prompt → context → spec → harness → loop → fleet,杠杆点逐层上移。融合不是"职级合并",是同一个人持有的判断层次在上移:他越来越少地写每行代码,越来越多地决定接缝放哪、何为对、谁能碰什么。这是内核第④步在工程面的落点——人回归于只有人能定的判断与品味。

Role fusion · implementer → orchestrator → scheduler. Connect the above: once in-module implementation is delegable, the engineer's leverage climbs to designing boundaries, writing specs, building the harness, designing the loop, and finally scheduling a parallel fleet of agents. This is ENG·01's elevator (prompt → context → spec → harness → loop → fleet), leverage climbing floor by floor. Fusion is not a "merger of titles" but the level of judgment one person holds moving upward: writing each line less and less, deciding where the seams go, what counts as correct, and who may touch what more and more. This is the kernel's step ④ on the engineering face: people return to the judgment and taste only people can set.

FIG. E10.0 / ROLE FUSION · 实现者 → 编排者 → 调度者 看懂:边界是那个不可交办的人类判断节点 Read: the boundary is the non-delegable human judgment node
同一个人持有的判断层次在上移——不是职级合并 the level of judgment one person holds moves upward — not a merger of titles 实现者 IMPLEMENTER IMPLEMENTER 逐行写实现 —— 这一层已下沉,交给 agent 填模块内部 writes lines by hand — this layer has sunk; the agent fills module internals 编排者 ORCHESTRATOR ORCHESTRATOR 定 intent + 设 constraints + 拥有 verification —— 划接缝、写规格、搭 harness sets intent + constraints + owns verification — draws seams, writes specs, builds harness 调度者 SCHEDULER SCHEDULER 设计 loop、调度并行 fleet —— 对应 ENG·01 楼层的顶层 designs loops, schedules a parallel fleet — the top of ENG·01's building 内核④:人回归判断与品味 kernel ④: people return to judgment & taste ★ 架构边界 = 编排者持有的、不可交办的判断节点(高爆炸半径 · 低可逆性) ★ the architecture boundary = the orchestrator's non-delegable judgment node (high blast · low reversibility)
把前面所有图纸收束成一句:当模块内的实现可交办,工程师的着力点沿 ENG·01 那栋楼上移——实现者 → 编排者 → 调度者。融合不是职级合并,是同一个人持有的判断层次在上移:他越来越少写每行代码,越来越多决定接缝放哪、何为对、谁能碰什么。那颗朱红节点就是不可交办的核心——架构边界天然高爆炸半径、低可逆性(ENG·06 的 Own 档在结构面的样子),模块内部则低半径、高可逆,放心交给 agent。这是内核第④步在工程面的落点。
This collapses every sheet above into one line: once in-module implementation is delegable, the engineer's leverage climbs ENG·01's building — implementer → orchestrator → scheduler. Fusion is not a merger of titles but the upward move of the level of judgment one person holds: writing each line less and less, deciding where the seams go, what counts as correct, and who may touch what more and more. The vermilion node is the non-delegable core — an architecture boundary is inherently high-blast-radius, low-reversibility (ENG·06's Own tier on the structural face), while inside the module is low-radius, high-reversibility and safely the agent's. This is the kernel's step ④ on the engineering face.
模块内 · 交给 agentInside the module · to the agent
  • 实现细节、内部算法、数据结构选型
  • Implementation detail, internal algorithms, data-structure choices
  • 有契约约束的填充——契约由人定,填充由 agent 做
  • Filling under a contract (humans set the contract, the agent fills it)
边界上 · 留给人On the boundary · keep with humans
  • 接缝放哪、模块如何切分、依赖方向
  • Where seams go, how to split modules, dependency direction
  • 不造什么(范围纪律)、信任边界、接口契约
  • What not to build (scope discipline), trust boundaries, interface contracts
证据 · 级 Ⅳ + 内部综合"边界即判断节点(可逆性 + 爆炸半径)、范围纪律、信任边界即一等结构"源本系列架构篇 ↗;"实现者 → 编排者"角色框架源 Graziano《AI-Native Engineering》(Day 1)转引 OpenAI;"杠杆点逐层上移 prompt→…→fleet"源本系列谱系篇 ↗。本张是把三者在工程面缝合的内部综合——〔走探索账:连接为本系列推演,非外部直接断言〕。
Evidence · grade Ⅳ + internal synthesis"Boundary as judgment node (reversibility + blast radius), scope discipline, trust boundary as first-class structure" come from this series' Architecture chapter ↗; the "implementer → orchestrator" role frame from Graziano's AI-Native Engineering (Day 1) via OpenAI; "leverage climbing floor by floor, prompt→…→fleet" from this series' Genealogy chapter ↗. This sheet is the internal synthesis that stitches the three on the engineering face [exploration ledger: the connection is this series' reasoning, not an external direct assertion].

为什么实现充裕反而让架构更稀缺、不是更不重要。有一种危险的误读:"既然 agent 能生成任何实现,架构不就无所谓了?"恰恰相反。实现便宜时,把系统拖垮的不再是"写不出来",而是"写得太多太快、却没有结构约束"——生成会以惊人的速度堆出技术债,模块边界一旦划错,错误就跨边界蔓延、且难以回退。所以架构边界是那个不让生成坍缩成一团泥的稀缺结构:它便宜不下来,因为它本质是构成性判断,不是可机检的对错。这解释了一个表面的悖论——agent 越强,划接缝、定依赖方向、做范围纪律(决定造什么)这些工作的相对价值越高,而不是越低。

Why abundant implementation makes architecture scarcer, not less important. A dangerous misreading: "since the agent can generate any implementation, doesn't architecture stop mattering?" The opposite. When implementation is cheap, what drags a system down is no longer "can't write it" but "writes too much too fast with no structural constraint" — generation piles up technical debt at startling speed, and once a module boundary is drawn wrong, errors spread across it and resist rollback. So the architecture boundary is the scarce structure that keeps generation from collapsing into mud: it cannot get cheap, because it is constitutive judgment, not machine-checkable correctness. This resolves an apparent paradox — the stronger the agent, the higher the relative value of drawing seams, setting dependency direction, and exercising scope discipline (deciding what not to build), not the lower.

角色融合不是"人变少了",是"人持有的判断变高了"。把 ENG·01 的楼层、ENG·06 的三档、ENG·10 的边界放在一起看,会浮现同一个动作:同一个工程师,着力点从"写每一行"上移到"决定每一处接缝"。这不是裁掉实现者、保留架构师的人事故事,而是同一个人身上判断层次的迁移——他还在做工程,只是工程的重心从可机检的那一半(实现)移到了构成性的那一半(边界、契约、何为对、谁能碰什么)。本系列把这称作内核第④步在工程面的落点:人不做吞吐,回到只有人能做的判断与建造。它也给了一个可证伪的组织预测——如果一个团队"AI 化"之后,工程师的时间分配没有从写实现明显移向划边界、定契约、补护栏,那它多半只是把 AI 嫁接到了旧流程上,并没有真正重画研发图。

Role fusion is not "fewer people" but "the judgment people hold rises." Put ENG·01's floors, ENG·06's tiers, and ENG·10's boundary side by side and one move surfaces: the same engineer, leverage climbing from "writing every line" to "deciding every seam." This is not a headcount story of cutting implementers and keeping architects, but a migration of the level of judgment within the same person — still doing engineering, only with engineering's center of gravity moved from the machine-checkable half (implementation) to the constitutive half (boundaries, contracts, what is correct, who may touch what). This series calls it the kernel's step ④ on the engineering face: people do not do throughput; they return to the judgment and building only people can do. It also yields a falsifiable organizational prediction — if, after a team "goes AI," engineers' time has not visibly shifted from writing implementation toward drawing boundaries, setting contracts, and adding guardrails, the team has most likely only grafted AI onto the old process and not redrawn the development graph at all.

检验信号Test signal

先行:人的时间从写实现细节,转到划接缝、定契约、设权限。反指标:架构师还在逐行写模块内部,却没人盯边界——表面忙碌,承重决策无人持有。Leading: human time shifts from writing implementation detail to drawing seams, setting contracts, scoping permissions. Counter-signal: architects still hand-write module internals while no one watches the boundaries: busy on the surface, load-bearing decisions unheld.

边界即判断节点:哪些角色融合,哪些反而分化

Boundaries as judgment nodes: which roles fuse, which split apart

"角色融合"容易被听成"所有人都变成全栈、边界消失",那是误读。准确的图景是:实现层面的分工在融合,判断层面的分工反而在分化、并被抬高。过去前端/后端/测试/运维的分工,很大一部分是按"谁来写哪段实现"切的;当实现被 agent 吸收,按实现切的那些边界确实在塌——同一个人借助 agent 可以跨栈交付。但与此同时,一组新的、按"谁来持有哪个判断"切的边界在浮现并变得更重要:谁来定这个对外契约、谁来判这次架构取舍、谁来设这道安全边界、谁来守"何为对"的标准。这些不是实现工种,是判断节点,而且它们恰好落在系统的接缝处——模块之间、服务之间、信任域之间。这就是为什么这一卷反复说"架构边界即判断节点":当实现充裕,架构(也就是边界怎么划)成了不让生成坍缩成技术债的那个稀缺结构,而划边界这件事本身不可机检、是构成性判断,所以它下沉到④留给人。可证伪推论:一个团队"AI 化"之后,若你观察到的是"实现工种合并、但出现了更清晰的边界/契约/安全的判断归属",那是真融合;若你观察到的是"实现工种合并、且没人明确为边界负责",那不是融合,是承重决策悬空——表面少了几个工种,实际多了一处无人持有的风险。深潜见架构篇。

"Role fusion" is easily heard as "everyone becomes full-stack and boundaries disappear," which is a misreading. The accurate picture is: division of labor at the implementation level fuses, while division at the judgment level instead splits apart and is lifted higher. The old frontend/backend/test/ops split was largely cut along "who writes which implementation"; once implementation is absorbed by agents, the boundaries cut along implementation do collapse — one person, with an agent, delivers across the stack. But at the same time a new set of boundaries, cut along "who holds which judgment," emerges and grows more important: who sets this external contract, who judges this architectural trade-off, who designs this security boundary, who guards the standard of "what is correct." These are not implementation trades but judgment nodes, and they fall precisely at the seams of the system — between modules, between services, between trust domains. This is why the volume keeps saying "architectural boundaries are judgment nodes": when implementation is abundant, architecture (how boundaries are drawn) becomes the scarce structure that keeps generation from collapsing into tech debt, and drawing boundaries is itself not machine-checkable but a constitutive judgment, so it sinks to ④ and stays with people. Falsifiable corollary: after a team "goes AI," if what you observe is "implementation trades merge, but a clearer ownership of boundary/contract/security judgments appears," that is real fusion; if what you observe is "implementation trades merge, and no one clearly owns the boundaries," that is not fusion but load-bearing decisions left dangling — a few fewer trades on the surface, one more unheld risk underneath. Deep dive in the Architecture chapter.

ENG
12
FAILURE MODES · 失败学
FAILURE MODES
机理 · 为何非验不可
Mechanism · Why Verify

四种失败不是偶发,是结构产物

Four failures are not flukes but structural products

trust-but-verify 讲的是"怎么验",这一张讲的是"为何非验不可"。幻觉、自信而错、雪球、上下文腐烂——每一种都不是模型偶尔抽风,而是这套架构在确定条件下必然产出的东西。先认得产出它的结构,才知道该用哪条护栏去阻尼它。

trust-but-verify says "how to verify"; this sheet says "why verification is non-optional." Hallucination, confident wrongness, the snowball, context rot — none is the model occasionally glitching; each is something this architecture necessarily produces under definite conditions. Recognize the structure that produces it first, and you know which guardrail damps it.

把验证当成"勤快一点的好习惯"是这一卷最容易犯的错。验证之所以是承重墙,不是因为模型偶尔会错,而是因为这类错有确定的结构成因——它们从生成式系统的工作原理里直接长出来,跟模型聪明不聪明无关。下面四种是 Graziano 在《AI-Native Engineering》Day 2 列出的核心失败模式;这里给每一种补上"架构为什么必然产出它"和"harness 用哪条护栏阻尼它"两层。〔源 Graziano《AI-Native Engineering》Day 2 失败模式,证据级 Ⅳ 一手从业者[R1]

Treating verification as "a diligent good habit" is the easiest mistake in this volume. Verification is a load-bearing wall not because the model occasionally errs but because this class of error has a definite structural cause — it grows directly out of how a generative system works, independent of how smart the model is. The four below are the core failure modes Graziano lists in AI-Native Engineering Day 2; here each gets two layers added: "why the architecture necessarily produces it" and "which harness guardrail damps it." [Source: Graziano, AI-Native Engineering Day 2 failure modes, grade Ⅳ practitioner. [R1]]

为什么是这套架构的产物,而非模型的缺陷

Why a product of the architecture, not a defect of the model

一个自回归语言模型在每一步只做一件事:在给定上文的条件下,对下一个 token 取概率最高的延续。它没有一个独立的"我确定吗"的内部状态,也没有一个把输出对照外部真值的环节——除非你在它外面搭一个。于是"流畅"和"正确"在它内部是同一个量:读起来对的,和实际上对的,用的是同一套打分。这一句就解释了下面四种里的前两种:幻觉是模型在没有证据时仍按"最像真话"的方式补全,自信而错是这种补全恰好戴着确信的语气。它们不是 bug,是"按概率续写"这件事在缺乏外部锚点时的默认行为。后两种则来自把这个单步行为放进多步循环:误差会沿步骤累积(雪球),上下文会随长度退化(上下文腐烂)。所以四种失败可以两两归类:前两种是单步的认识论缺陷,后两种是多步的动力学缺陷。

An autoregressive language model does one thing at each step: under the given context, take the highest-probability continuation for the next token. It has no independent internal state of "am I sure," and no stage that checks output against external truth — unless you build one outside it. So "fluent" and "correct" are the same quantity inside it: what reads right and what is right share one scoring function. That single fact explains the first two below: hallucination is the model completing in the "most truth-like" way even with no evidence, and confident wrongness is that completion happening to wear an assured tone. They are not bugs but the default behavior of "continue by probability" when external anchors are absent. The latter two come from placing this single-step behavior inside a multi-step loop: error accumulates across steps (the snowball), and context degrades with length (context rot). So the four sort into two pairs: the first two are single-step epistemic defects, the latter two are multi-step dynamical defects.

失败模式Failure mode 结构成因(架构为何产出它)Structural cause (why the architecture produces it) harness 阻尼它的那条护栏The harness guardrail that damps it
幻觉Hallucination 无证据时仍按"最像真话"续写;"流畅"与"正确"共用一套打分,模型分不清记得与编造。Completes in the "most truth-like" way with no evidence; "fluent" and "correct" share one score, so it cannot tell recall from invention. 把真源喂进上下文(sensor:检索 / grounding),并用 computational guard 校验引用真实存在(如 API 签名 / 文件路径过编译)。Feed the source into context (sensor: retrieval / grounding) and use a computational guard to check the citation really exists (API signature / file path compiles).
自信而错Confident wrongness 语气的确信度与答案的正确度由不同机制决定,二者解耦——错的答案可以毫无保留地自信。Tone-confidence and answer-correctness are set by different mechanisms and decoupled — a wrong answer can be utterly self-assured. 独立验证器(与生成分离的 checker)只读结果不读语气:测试是绿是红,与它说得多笃定无关。An independent verifier (a checker separate from generation) reads only the result, never the tone: tests pass or fail regardless of how certain it sounded.
雪球Snowball 多步循环里,第 k 步的输出是第 k+1 步的输入;早期小错被当作既定事实继续推演,误差沿步骤指数放大。In a multi-step loop, step k's output is step k+1's input; an early small error is taken as settled fact and compounds, error growing across steps. 在每个有意义的步骤插 HITL 检查点 + 频繁绿条:把循环切短,让错误在放大前被截断(见 FIG E9.0)。Insert a HITL checkpoint at each meaningful step plus frequent green bars: shorten the loop so error is cut off before it amplifies (see FIG E9.0).
上下文腐烂Context rot 有效注意力随上下文变长而稀释;窗口里塞得越多,早期关键约束越被淹没,输出反而漂移。Effective attention dilutes as context grows; the more crammed in the window, the more early key constraints drown, and output drifts. 用 SPEC / PLAN / TASKS 文件作持久真源、定期压缩对话、按需重新装配窗口——少即是多(见 ENG·02)。Use SPEC / PLAN / TASKS files as persistent source, compact the conversation periodically, reassemble the window on demand — less is more (see ENG·02).
核心图KEY FIGFIG. E12.1 / THE FAILURE-MODE MAP · 成因 × 护栏 · cause × guardrail 看懂:每种失败落在哪格,就该配哪条护栏 Read: where a failure sits is which guardrail it needs
结构成因 → structural cause → 单步 · 认识论缺陷 single-step · epistemic 多步 · 动力学缺陷 multi-step · dynamical 护栏所在 ↑ guardrail lives ↑ guides · 写进上下文/规格 guides · into context/spec sensors · 机检/检查点 sensors · checks/checkpoints 幻觉 Hallucination → grounding / 检索 → grounding / retrieval 自信而错 Confident wrongness → 独立验证器 → independent verifier 雪球 Snowball → 每步 HITL 检查点 → per-step HITL checkpoint 上下文腐烂 Context rot → 文件作真源 / 压缩 → files as source / compact 隐藏假设 Hidden assumptions → 把约定显式写进上下文 → conventions written in 绕过机检——只 guides 能挡 bypasses checks — only guides catch it
把五种失败按两条轴摆开:横轴是成因(单步认识论缺陷 / 多步动力学缺陷),纵轴是阻尼它的护栏所在层(sensors 机检 / guides 写进上下文)。摆完就读出一条纪律——每种失败都对准一条本该存在的护栏,哪格空着,哪种失败就在你系统里反复出现。隐藏假设单独画成虚框,因为它横跨单步与多步、且绕过机检,只有 guides(把约定显式写进上下文)能挡。这张图就是把失败学倒过来用:它不是"会出哪些错"的清单,是设计 harness 的需求规格。〔源 Graziano《AI-Native Engineering》Day 2 失败模式,证据级 Ⅳ 一手从业者[R1]
Lay the five failures on two axes: the horizontal is cause (single-step epistemic / multi-step dynamical), the vertical is the layer where the damping guardrail lives (sensors as machine checks / guides written into context). Laid out, one discipline reads off — each failure points at a guardrail that ought to exist, and wherever a cell is empty, that failure recurs in your system. Hidden assumptions is drawn as its own dashed box because it straddles single- and multi-step and bypasses machine checks; only guides (conventions written explicitly into context) catch it. This figure is the failure taxonomy read in reverse: not a list of "which errors occur" but the requirements spec for designing the harness. [Source: Graziano, AI-Native Engineering Day 2 failure modes, grade Ⅳ practitioner [R1].]

雪球:四种里最该单独盯的一种

The snowball: the one to watch most

雪球之所以单列,是因为它把前三种的危害乘起来。一次幻觉若发生在一段三十步推演的第二步,且戴着确信的语气、又没有检查点,那么后面二十八步全都建在这个错前提上——agent 会一本正经地为一个不存在的函数写测试、为一个错的边界条件设计回退、在一份已经偏离事实的计划上继续追加细节。它读起来始终自洽,因为每一步都忠实地从上一步推下来;坏就坏在上一步本身是错的。这正是为什么"每个有意义的步骤都要有人在环检查点"不是谨慎,是结构必需:检查点的唯一作用,是在雪球还小的时候把它截断。把这件事画出来就是 FIG E9.0——同一条曲线,有检查点的那条在每个节点被拉回基线,无检查点的那条指数离散。可证伪信号:若你的 agent 在长任务上"前半段都对、后半段忽然集体崩",几乎一定是某个早期步骤错了而无人拦,而非模型在后半段变笨。

The snowball is listed alone because it multiplies the harm of the first three. If a hallucination lands on step two of a thirty-step derivation, wears a confident tone, and faces no checkpoint, the remaining twenty-eight steps all build on that false premise — the agent will earnestly write tests for a function that does not exist, design a fallback for a wrong edge condition, keep adding detail to a plan that has already gone off the rails. It always reads self-consistent, because each step follows faithfully from the last; the rot is that the last step was itself wrong. This is exactly why "a human-in-the-loop checkpoint at every meaningful step" is not caution but a structural necessity: a checkpoint's sole job is to cut the snowball while it is still small. Drawn out, this is FIG E9.0 — one curve pulled back to baseline at each node by checkpoints, the other diverging exponentially without them. Falsifiable signal: if your agent is "right in the first half, then suddenly collapses wholesale in the second" on long tasks, it is almost certainly an early step that was wrong and uncaught, not the model getting dumber later.

旧 · 把错当偶发Before · treat error as a fluke
"模型偶尔会错,多检查几遍就好"——把验证当态度问题,靠人盯、靠运气,错在哪、为何错、下次怎么拦,都没有结构性的答案。
"The model errs now and then, just check a few more times" — verification as an attitude, leaning on human vigilance and luck, with no structural answer to where it erred, why, or how to catch it next time.
新 · 按成因配护栏After · match guardrail to cause
每种失败先归到它的结构成因(单步认识论 / 多步动力学),再配上对应护栏:grounding 治幻觉、独立 checker 治自信而错、检查点治雪球、上下文卫生治腐烂。验证成了可设计的系统,而非态度。
Each failure is first sorted to its structural cause (single-step epistemic / multi-step dynamical), then matched to its guardrail: grounding for hallucination, an independent checker for confident wrongness, checkpoints for the snowball, context hygiene for rot. Verification becomes a designable system, not an attitude.
检验信号Test signal

证伪:若把同一类错误的成因归对后、补上对应护栏,该类错误的复发率没有下降,那"按成因配护栏"这个主张就是错的——很可能是成因归错了层(把多步动力学问题当单步认识论问题治)。Falsified if: after correctly attributing a class of error and adding its matched guardrail, the recurrence rate of that class does not drop, then "match guardrail to cause" is wrong here — most likely the cause was sorted to the wrong layer (treating a multi-step dynamical problem as a single-step epistemic one).

隐藏假设:第五种,最难被测试抓到的一种

Hidden assumptions: the fifth, the hardest for tests to catch

前四种之外还有一种值得单列,因为它绕过了大多数护栏:隐藏假设(hidden assumptions)。agent 在生成时会无声地填上一堆它没问、你也没说的前提——这个 API 的分页是从 0 还是 1 开始、这个金额是分还是元、这个时区是 UTC 还是本地、这个"用户"指的是登录用户还是被操作的用户。这些假设单看每一个都"合理",代码也能跑、测试也能过,因为测试本身往往建立在同一个隐藏假设上。它的危险恰恰在于它不报错:它不是一个会变红的失败,而是一个静默的、要到生产环境里遇到边界情况才暴露的语义错位。为什么这套架构必然产出它?因为模型的工作是补全最可能的延续,而"最可能"是按训练分布算的、不是按你这个系统的真实约定算的——当你的约定偏离常见分布(比如你这个遗留系统金额用分),它就会自信地按常见分布填错。阻尼它的护栏不在 computational 那一侧(测试抓不到自己假设里的错),而在 guides 那一侧:把这些约定显式写进上下文和规格,不让 agent 去猜。这也是 ENG·02"上下文即基设"和这一节的接口——隐藏假设是上下文缺失在失败学里的投影。可证伪信号:若你的事故复盘里反复出现"它以为是 X,其实我们这儿是 Y",那不是模型不小心,是你的约定没有进上下文,agent 只能按训练分布猜。〔源 Graziano《AI-Native Engineering》Day 2 失败模式(hidden assumptions),证据级 Ⅳ 一手从业者[R1]

Beyond the first four, one more deserves its own listing because it bypasses most guardrails: hidden assumptions. While generating, the agent silently fills in a pile of premises it did not ask and you did not state — whether this API's pagination starts at 0 or 1, whether this amount is in cents or units, whether this timestamp is UTC or local, whether "user" means the logged-in user or the one being acted upon. Each assumption looks "reasonable" alone, the code runs, the tests pass — because the tests are often built on the same hidden assumption. Its danger is precisely that it does not error: it is not a failure that turns red but a silent semantic mismatch that surfaces only on an edge case in production. Why does this architecture necessarily produce it? Because the model's job is to complete the most probable continuation, and "most probable" is computed over the training distribution, not over your system's actual conventions — when your convention departs from the common distribution (say, this legacy system holds amounts in cents), it confidently fills in the common-distribution wrong. The guardrail that damps it is not on the computational side (a test cannot catch an error inside its own assumption) but on the guides side: write these conventions explicitly into context and spec and do not let the agent guess. This is the interface between ENG·02 "context as infrastructure" and this sheet — a hidden assumption is the projection of missing context onto failure taxonomy. Falsifiable signal: if your incident retrospectives repeatedly read "it thought it was X, but here it is Y," the model was not careless; your convention never entered context and the agent could only guess from the training distribution. [Source: Graziano, AI-Native Engineering Day 2 failure modes (hidden assumptions), grade Ⅳ practitioner. [R1]]

把失败学倒过来用:它是设计护栏的需求清单

Read the taxonomy in reverse: it is the requirements list for guardrails

这张失败学不只是"知道会出哪些错"的清单,倒过来读,它是设计 harness 的需求规格——每一种失败模式,都对应一条本该存在的护栏,而你的脚手架里那条护栏在不在、强不强,决定了这种失败会不会反复发生。这就把"我的 harness 该补什么"从凭感觉,变成了一道可以逐条对照的检查:你有没有 grounding/检索来治幻觉?有没有独立于生成的 checker 来治自信而错?有没有在每个有意义步骤设检查点来治雪球?有没有上下文卫生(文件作真源、定期压缩)来治腐烂?有没有把关键约定显式写进上下文来治隐藏假设?哪一条答"没有",哪一种失败就会在你的系统里反复出现——而且它会以那种"看起来是模型偶尔抽风"的方式出现,让你误以为是运气问题,从而一直不去补那条结构性缺口。这正是把 ENG·04 的 guides/sensors 分类法和这张失败学缝起来的地方:分类法告诉你护栏有哪几种形状,失败学告诉你每种护栏对应防住哪种失败。两张表合起来,"我的脚手架够不够"就有了一个不依赖直觉的答案。可证伪信号:把你最近十次 agent 事故逐一归到这五种失败模式,若发现它们高度集中在某一两种上,那不是巧合,是你的 harness 在那一两种对应的护栏上系统性地欠投——补那一条,比泛泛地"让人更小心"有效得多。

This taxonomy is not only a list of "which errors will occur"; read in reverse, it is the requirements spec for designing the harness — each failure mode corresponds to a guardrail that ought to exist, and whether that guardrail is present and strong in your scaffolding decides whether the failure recurs. This turns "what should my harness add" from a gut feeling into a check you can run line by line: do you have grounding/retrieval for hallucination? An independent-of-generation checker for confident wrongness? A checkpoint at each meaningful step for the snowball? Context hygiene (files as source, periodic compaction) for rot? Key conventions written explicitly into context for hidden assumptions? Wherever the answer is "no," that failure will recur in your system — and it will recur in the guise of "the model occasionally glitching," fooling you into thinking it is a luck problem so that you never plug the structural gap. This is where ENG·04's guides/sensors taxonomy and this failure taxonomy stitch together: the taxonomy tells you which shapes of guardrail exist, the failure modes tell you which failure each guardrail blocks. Put the two tables together and "is my scaffolding enough" gets an answer independent of intuition. Falsifiable signal: sort your last ten agent incidents into the five failure modes; if they cluster heavily on one or two, that is no coincidence but your harness systematically under-investing in the corresponding one or two guardrails — adding that one is far more effective than vaguely "having people be more careful."

这四种为什么都跟"模型聪明不聪明"无关

Why none of the four is about whether the model is smart

有一个常见的误判会让整套失败学失效:把这些失败归因到"模型还不够强,等下一代就好了"。这是个会让你停止建护栏的危险想法,因为它把结构问题误读成能力问题。事实是:这四种失败的成因,是生成式系统的工作原理,而不是它的能力上限。一个更强的模型会更少幻觉、更少自信而错——但"按概率续写、流畅与正确共用一套打分"这个根本机制不会变,所以幻觉的概率会降、却不会归零;而只要不归零,雪球就仍然可能从任何一个未被检查点拦住的早期错开始。同理,上下文腐烂是有效注意力随长度稀释这个机制的产物,模型窗口再大也改变不了"塞太多会稀释"的方向,只会改变拐点的位置。所以把宝押在"等模型更强"上,等于把你系统的可靠性外包给了一个你不控制、且原理上不会归零的东西。正确的姿态是:承认这四种失败是这套架构的常驻特性而非临时缺陷,据此把护栏建成永久的结构,而不是把它当成"等模型成熟就能拆掉的脚手架"。可证伪信号:若你的团队反复以"等下一代模型"为由推迟建某条护栏,去看历史——上一代模型发布后,那一类失败真的消失了吗?多半是变少了但没消失,且在某次你以为安全的场合又咬了你一口。这就是把结构问题当能力问题的代价。

A common misjudgment can void the whole taxonomy: attributing these failures to "the model is not strong enough yet, the next generation will fix it." This is a dangerous thought that makes you stop building guardrails, because it misreads a structural problem as a capability problem. The fact is: the cause of these four failures is how a generative system works, not its capability ceiling. A stronger model hallucinates less, is confidently wrong less often — but the fundamental mechanism of "continue by probability, with fluent and correct sharing one score" does not change, so the probability of hallucination drops but does not reach zero; and as long as it is non-zero, the snowball can still start from any early error a checkpoint failed to catch. Likewise, context rot is the product of effective attention diluting with length; however large the window, it cannot change the direction that "cramming too much dilutes," only move the inflection point. So betting on "wait for a stronger model" outsources your system's reliability to something you do not control and that in principle does not reach zero. The correct stance: admit these four are permanent properties of this architecture, not temporary defects, and build guardrails as permanent structure rather than "scaffolding to dismantle once the model matures." Falsifiable signal: if your team repeatedly defers building a guardrail on the grounds of "wait for the next model," look at history — after the last generation shipped, did that class of failure really vanish? Most likely it grew rarer but did not vanish, and bit you again in some setting you thought safe. That is the cost of treating a structural problem as a capability problem.

ENG
13
JIT PLANNING · 即时规划
JIT PLANNING
重画 · 流程
Redraw · Process

规划视野随执行变充裕而坍缩

The planning horizon collapses as execution gets cheap

长视野规划的全部价值,建立在"执行很贵、返工更贵"上。当执行近乎免费,提前把第七步规划到底的期望收益就塌了——计划在被执行前就过期。于是规划从一次性的前期资产,变成贴着信号、即时生成、随时重来的活动。

The entire value of long-horizon planning rests on "execution is expensive and rework more so." When execution is near-free, the expected payoff of planning step seven to the bottom in advance collapses — the plan expires before it is executed. Planning shifts from a one-off upfront asset to an activity generated just-in-time against signal and rerun at will.

先把长视野规划的前提挖出来。瀑布、季度路线图、详尽的前期设计文档,它们合理的唯一条件是:执行昂贵、且返工的代价远高于规划的代价。在那个世界里,多想一周省下三个月的错误实现,是稳赚的。所以人类一百年的工程管理都在加厚前期:写更细的 spec、画更全的甘特图、把第七步都先规划好。这套逻辑没有错,它只是对一组特定参数最优——而 AI 把那组参数改了。当 agentic coding 让"实现一版"从三个月压到三小时,提前规划的算式就反转了:你为第七步做的精细规划,很可能在第三步执行完拿到真实反馈后就作废,因为真实代码会告诉你前面的假设哪条错了。规划得越远,作废得越多。这不是说规划没用,是说规划的最优视野缩短了

First dig out the premise of long-horizon planning. Waterfall, quarterly roadmaps, exhaustive upfront design docs — their sole condition for being reasonable is that execution is expensive and the cost of rework far exceeds the cost of planning. In that world, a week more of thinking that saves three months of wrong implementation is a sure bet. So a century of engineering management thickened the front: finer specs, fuller Gantt charts, step seven planned in advance. This logic is not wrong; it is merely optimal for one parameter set — and AI changed that set. When agentic coding compresses "ship a version" from three months to three hours, the arithmetic of planning ahead inverts: your fine plan for step seven will likely be void once step three executes and hands back real feedback, because real code tells you which earlier assumption was wrong. The farther you plan, the more you void. This does not mean planning is useless; it means the optimal planning horizon shortens.

核心图KEY FIGFIG. E13.1 / PLANNING HORIZON COLLAPSE · 瀑布 → 即时 · waterfall → JIT 看懂:执行越便宜,该规划的视野越短 Read: the cheaper execution, the shorter the horizon worth planning
值得提前规划的视野 ↑ horizon worth planning ↑ 执行昂贵 execution expensive 执行近乎免费 execution near-free 执行成本下降 → execution cost falls → 瀑布:厚计划、规划到第七步 Waterfall: thick plan, planned to step 7 返工贵 → 多想一周稳赚 rework dear → a week more thinking pays 即时:薄计划,只规划下一段 JIT: thin plan, only the next leg 远段计划执行前就过期 far legs expire before execution 薄规划下一段 thin-plan next leg 执行取信号 execute for signal 红测试/漂移 = 重规划 red/drift = re-plan
长视野规划的全部价值,建立在"执行昂贵、返工更贵"上。把那条前提画成横轴:执行成本越往右越低,值得提前规划的视野就越塌——朱红曲线就是这条坍缩线。左端是瀑布:返工贵,多想一周省三个月,把第七步先规划好是稳赚;右端是即时:执行近乎免费,为远段做的精细规划在执行前就过期。所以纪律不是"别规划",而是把规划视野缩短、并把重规划的触发权从日历交给信号——底部那条环:薄规划下一段 → 执行取信号 → 红测试/漂移即重规划,循环回来。反信号见正文:PLAN.md 只增不改、越铺越厚,就是视野没随执行变充裕而坍缩的证据。〔源 本系列工程实践综合 + Graziano《AI-Native Engineering》Day 3,证据级 Ⅳ[R1]
The entire value of long-horizon planning rests on "execution is expensive and rework dearer." Draw that premise as the horizontal axis: the farther right execution cost falls, the more the horizon worth planning collapses — the vermilion curve is that collapse. At the left is waterfall: rework is dear, a week more thinking saves three months, planning step seven ahead is a sure bet; at the right is JIT: execution is near-free and fine plans for far legs expire before they run. So the discipline is not "do not plan" but shorten the horizon and hand the re-plan trigger from the calendar to signal — the loop along the bottom: thin-plan the next leg → execute for signal → a red test / drift triggers a re-plan, back around. The anti-signal is in the prose: a PLAN.md that only grows and never revises is evidence the horizon has not collapsed with cheap execution. [Source: this series' engineering-practice synthesis + Graziano, AI-Native Engineering Day 3, grade Ⅳ [R1].]

即时规划:贴着信号生成,凭信号重来

JIT planning: generate against signal, re-plan on signal

即时规划(just-in-time planning)的纪律只有两条。第一,只把下一段规划到"足够开始"的精度,不预先规划尚未拿到反馈的远段——因为远段的输入还不存在。第二,把每次执行的真实结果当作重新规划的触发器:测试红了、性能不达标、一个边界条件冒出来,都不是"偏离计划",而是"该重规划的信号"。这和敏捷的"小步迭代"形似而神不同:敏捷缩短的是交付周期,规划本身还是按固定节奏(每个 sprint)做的;即时规划缩短的是规划视野本身,并把重规划的触发权交给信号而非日历。一个具体对照:瀑布问"我们这季度要做的全部,先列清楚";即时规划问"为了拿到下一个能改变后续判断的信号,最少要做哪一步"。前者优化覆盖,后者优化信息增益。

JIT planning has just two disciplines. First, plan only the next leg to "enough to start" precision, never pre-planning far legs whose feedback you have not received — because the far leg's inputs do not yet exist. Second, treat each execution's real result as the trigger to re-plan: a red test, a missed performance target, an edge case surfacing are not "deviations from the plan" but "signals to re-plan." This resembles agile's "small iterations" but differs in spirit: agile shortens the delivery cycle while planning still runs on a fixed cadence (each sprint); JIT planning shortens the planning horizon itself and hands the re-plan trigger to signal rather than the calendar. A concrete contrast: waterfall asks "list everything we will do this quarter up front"; JIT planning asks "what is the smallest step that gets the next signal capable of changing later judgment." The former optimizes coverage, the latter information gain.

反信号:PLAN.md 越写越厚

The anti-signal: PLAN.md getting thicker

有一个干净的反信号能当场告诉你规划已经偏离轨道:PLAN.md 一直在变厚,却很少被执行打回去改。健康的即时规划里,计划文件是个薄的、活的、频繁被真实结果推翻重写的东西——它今天列三步,执行完第一步后可能整段重写,因为第一步的反馈改变了对后两步的判断。如果你的计划文件只增不改、越来越详尽地铺陈尚未验证的远期步骤,那说明团队在用"规划"代替"拿信号":把本该靠执行去证伪的假设,写成了看起来很周全的文档。这是瀑布心智在 agentic 时代的残留——它把规划的厚度误当成确定性,而真正的确定性只能来自执行反馈。可证伪信号:统计你的计划文件被"执行结果触发的重写"次数 vs "纯追加"次数;前者远低于后者,就是规划视野没有随执行变充裕而坍缩的证据。〔源 本系列工程实践综合 + Graziano Day 3"先计划 / 保持上下文干净 / 知道何时叫停",证据级 Ⅳ[R1]

One clean anti-signal tells you on the spot that planning has gone off: PLAN.md keeps getting thicker yet is rarely pushed back by execution. In healthy JIT planning the plan file is a thin, living thing, frequently overturned and rewritten by real results — it lists three steps today, and after step one may be rewritten wholesale because step one's feedback changed the judgment on the other two. If your plan file only grows and never revises, laying out ever more detail of unverified far steps, the team is substituting "planning" for "getting signal": writing assumptions that execution was supposed to falsify into a document that merely looks thorough. This is the waterfall mind's residue in the agentic era — it mistakes the thickness of the plan for certainty, when real certainty can only come from execution feedback. Falsifiable signal: count how often your plan file is rewritten-triggered-by-result versus pure-append; the former far below the latter is evidence the planning horizon has not collapsed with cheap execution. [Source: this series' engineering-practice synthesis + Graziano Day 3 "plan first / keep context clean / know when to stop," grade Ⅳ. [R1]]

薄规划Thin plan
只规划到"足够开始下一段"的精度,远段留白——它的输入还不存在。Plan only to "enough to start the next leg"; leave far legs blank — their inputs do not exist yet.
执行取信号Execute for signal
执行的目的不只是产出,更是拿到能改变后续判断的真实反馈。Execution aims not only to produce but to get real feedback that can change later judgment.
凭信号重规划Re-plan on signal
红测试 / 漂移 / 新边界 = 重规划触发器,不是"偏离计划"。计划是活的。Red test / drift / new edge = a re-plan trigger, not "off-plan." The plan is alive.

先计划,再放手:从 Chat 到 Plan 的那一步

Plan first, then let go: the step from Chat to Plan

即时规划"视野要短",但短不等于"不规划"——恰恰相反,agentic coding 里最高杠杆的一个习惯,就是在让 agent 动手之前,先让它(和你)把这一段的计划摆出来。这就是从 Chat 模式到 Plan 模式的关键转变:Chat 模式下你直接说"去做 X",agent 立刻开始边想边改,错误在它跑起来之后才暴露、且已经雪球了;Plan 模式下它先产出一份"我打算这样切这一段、按这个顺序、在这几处验证"的计划,你在它真正改任何代码之前先审这份计划。这一步之所以高杠杆,是因为审一份计划比审一堆已经写出来的 diff 便宜得多——计划里一个错的假设,在它变成三十个文件的改动之前就被你拦下了。它和 JIT 的"短视野"不矛盾:短视野说的是"不要规划尚未拿到反馈的远段",先计划说的是"对要动手的这一近段,先把计划显式化再执行"。两条合起来:近段先计划再放手,远段留白等信号。〔源 Graziano《AI-Native Engineering》Day 3 "从 Chat 到 Plan" + 与 agent 协作最佳实践(先计划 / 保持上下文干净 / 知道何时叫停 / 以测试为目标 / 逐 diff 评审),证据级 Ⅳ 一手从业者[R1]

JIT planning says "keep the horizon short," but short does not mean "do not plan" — quite the opposite: one of the highest-leverage habits in agentic coding is to have it (and you) lay out the plan for this leg before letting the agent act. This is the key shift from Chat mode to Plan mode: in Chat mode you say "go do X" and the agent immediately thinks-while-changing, with errors surfacing only after it runs and already snowballed; in Plan mode it first produces a plan of "I intend to cut this leg this way, in this order, verifying at these points," and you review that plan before it touches any code. This step is high-leverage because reviewing a plan is far cheaper than reviewing a pile of already-written diffs — a wrong assumption in the plan is caught before it becomes a change across thirty files. It does not conflict with JIT's "short horizon": short horizon says "do not plan far legs without feedback," plan-first says "for the near leg you are about to act on, make the plan explicit before executing." Together: plan the near leg then let go, leave the far leg blank and wait for signal. [Source: Graziano, AI-Native Engineering Day 3 "from Chat to Plan" + best practices with agents (plan first / keep context clean / know when to stop / target tests / review by diff), grade Ⅳ practitioner. [R1]]

这也重排了"计划"这件事在团队里的产权。瀑布把规划当成项目早期一次性的、由少数人产出的厚资产;即时规划把它变成贴着每一段执行、由"最接近这段反馈的人"持续产出的薄活动。谁来重规划?不是季度初定路线图的那个人,而是拿到了上一段执行结果的那个人(或 agent)——因为重规划的输入(真实反馈)就攥在他手里。这把"规划权"从日历和层级里解放出来,交给信号。它对组织的可证伪推论是:一个真正即时规划的团队,其"计划被谁改"应该高度分布、且紧跟执行;若计划仍然只能由季度初那几个人改、且改动周期是季度而非天,那这个团队只是把瀑布的甘特图换了个工具,规划视野并没有随执行变充裕而坍缩。

This also reorders the ownership of "planning" in a team. Waterfall treats planning as a one-off thick asset produced early by a few; JIT planning makes it a thin activity produced continuously against each leg of execution by "whoever is closest to that leg's feedback." Who re-plans? Not the person who set the roadmap at quarter start, but the person (or agent) who received the last leg's execution result — because the input to re-planning (real feedback) is in their hands. This frees the "right to plan" from the calendar and the hierarchy and hands it to signal. Its falsifiable organizational corollary: a genuinely JIT-planning team's "who edits the plan" should be highly distributed and tightly tracking execution; if the plan can still only be edited by the few who set it at quarter start, on a quarterly rather than daily cycle, the team has merely swapped tools for the waterfall Gantt chart, and the planning horizon has not collapsed with cheap execution.

知道何时叫停:即时规划的另一半纪律

Knowing when to stop: the other half of JIT discipline

即时规划讲的是"什么时候重规划",但有一个对称的、同样重要的判断常被漏掉:什么时候叫停。当一段执行的反馈反复是坏的——agent 第三次没把这个问题解对、计划被推翻了两轮、每次重规划都在原地打转——这本身就是一个信号,但它指向的不是"再规划一次",而是"这条路可能根本走不通,该退回去换个切法、或者把这一段升级成需要人深想的判断"。知道何时叫停之所以难,是因为 agent 永远显得"再试一次就好":它不会疲惫、不会气馁、每次都信心满满地再来一版,这种永不言弃的特质恰恰会把人拖进一个低效的循环——人不断重规划、agent 不断重试,却没人后退一步问"是不是这个问题被切错了"。所以即时规划的完整纪律其实是三条而非两条:薄规划、凭信号重规划、以及识别"重规划已经无效"这个元信号、及时叫停升级。这第三条把人的判断放在了最该放的地方——不是替 agent 做事,而是判断"这件事现在还该不该这么做下去"。可证伪信号:若你发现自己在同一段任务上重规划了三四轮、每轮都期待"这次 agent 能行",那大概率不是该再规划一次,是该叫停——退回去重新切问题,或者承认这一段需要人亲自深想,而不是继续喂给一个在原地打转的循环。〔源 Graziano《AI-Native Engineering》Day 3 "知道何时叫停",证据级 Ⅳ 一手从业者[R1]

JIT planning is about "when to re-plan," but a symmetric, equally important judgment is often missed: when to stop. When a leg's feedback is repeatedly bad — the agent failed to solve this correctly a third time, the plan was overturned two rounds running, each re-plan spins in place — that is itself a signal, but it points not to "plan once more" but to "this path may simply not work; back out and cut it differently, or escalate this leg into a judgment a human must think hard about." Knowing when to stop is hard because the agent always looks like "one more try and it'll be fine": it does not tire, does not get discouraged, comes back confident with another version every time, and this never-give-up trait drags humans into a low-yield loop — the human re-plans, the agent retries, and no one steps back to ask "was this problem cut wrong?" So JIT planning's full discipline is actually three rules, not two: thin plan, re-plan on signal, and recognizing the meta-signal that "re-planning has stopped working" and escalating in time. This third puts human judgment where it most belongs — not doing the agent's work but judging "should this still be done this way at all right now." Falsifiable signal: if you find yourself re-planning the same leg three or four rounds, each time expecting "this time the agent will get it," it is most likely not time to plan again but to stop — back out and re-cut the problem, or admit this leg needs a human to think it through personally, rather than keep feeding a loop spinning in place. [Source: Graziano, AI-Native Engineering Day 3 "know when to stop," grade Ⅳ practitioner. [R1]]

即时规划如何不退化成"边想边改"

How JIT planning avoids degrading into "think-while-changing"

即时规划有一个必须当面挡住的滑坡:把"视野短"误解成"不用计划、走一步看一步",结果退化回最坏的那种 vibe-coding——没有 spec、没有检查点,agent 边想边改,错误雪球到不可收拾。短视野和无规划是两件完全不同的事,区分它们的是一个简单的判据:每一段开始执行前,这一段的"什么算完成、在哪几处验证"是不是显式的?是,就是健康的即时规划(短,但每一段都有明确的验收和检查点);否,就是退化的边想边改(既短又盲)。所以即时规划的"薄计划"薄的是视野(不规划远段),不是密度(近段的验收标准要清晰)。把这条和 ENG·05 的检查点、ENG·15 的 eval 接起来,即时规划其实是给每一段执行配了一对边界:开头一个显式的"这段要达成什么",结尾一组"达成了没有"的可机检验证。两者之间 agent 可以自由发挥,但它发挥的空间被这对边界框住,错误出不了这一段。这恰好让即时规划兼得了快和稳:快,因为不为远段做会作废的规划;稳,因为每一近段都有明确的验收闸。可证伪信号:若你的"即时规划"实践里,agent 经常在一段还没说清"什么算完成"的情况下就开跑、且要到跑完很久才发现方向错了,那不是即时规划,是把无规划包装成了敏捷——它会以雪球的方式定期惩罚你。

JIT planning has a slope to block head-on: misreading "short horizon" as "no planning, take it a step at a time," which degrades back into the worst vibe-coding — no spec, no checkpoints, the agent thinking-while-changing, errors snowballing beyond recovery. Short horizon and no planning are entirely different things, distinguished by a simple criterion: before each leg executes, is "what counts as done, and where to verify" explicit for that leg? If yes, it is healthy JIT planning (short, but each leg has clear acceptance and checkpoints); if no, it is degraded think-while-changing (both short and blind). So JIT planning's "thin plan" is thin in horizon (do not plan far legs), not in density (the near leg's acceptance criteria must be clear). Wire this to ENG·05's checkpoints and ENG·15's evals, and JIT planning actually fits each leg of execution with a pair of boundaries: an explicit "what this leg should achieve" at the start, and a set of machine-checkable verifications of "did it achieve it" at the end. Between them the agent improvises freely, but its room is framed by that pair, and errors cannot leave the leg. This is exactly how JIT planning gets both fast and stable: fast, because no will-be-voided planning for far legs; stable, because every near leg has a clear acceptance gate. Falsifiable signal: if in your "JIT planning" practice the agent often starts running before a leg has stated "what counts as done," and only long after it finishes do you discover the direction was wrong, that is not JIT planning but no-planning dressed as agile — it will punish you periodically by snowball.

ENG
14
TRUST BOUNDARY · 安全与信任边界
TRUST BOUNDARY
重画 · 边界
Redraw · Boundary

agent 的权限默认应为只读

An agent's privileges should default to read-only

把一个会被外部文本操纵、又会自信地犯错的执行体,接上能改世界的工具,安全模型就变了:威胁不再主要来自外部攻击者,而来自你自己授权的 agent 被诱导去做你没打算让它做的事。默认只读、按需提权、不可逆动作留人,是这条新边界的三块基石。

Connect an executor that can be steered by outside text and that errs confidently to tools that change the world, and the security model shifts: the threat is no longer mainly an external attacker but your own authorized agent being induced to do what you never meant it to. Read-only by default, escalate on demand, keep irreversible actions with a human — the three foundations of this new boundary.

新的攻击面来自把"会被文本操纵"和"能改世界"接在了一起。传统安全里,被信任的进程执行的是确定的代码;而一个 agent 执行的是它当场推理出来的动作,而那个推理可以被它读到的任何文本改写——包括它从一个工具拉回来的内容。这就是两类具体威胁的根:prompt injection(提示注入)是攻击者把指令藏在 agent 会读到的数据里(一个网页、一封邮件、一段代码注释),诱导它把这些数据当成你的指令来执行;MCP tool poisoning(工具投毒)是更隐蔽的一种——一个看似普通的 MCP 工具,在它的描述或返回里夹带指令,悄悄改写 agent 对"现在该做什么"的理解。两者的共同点是:agent 分不清"待处理的数据"和"该服从的指令",因为对它而言两者都只是上文里的 token。所以安全边界不能再只防外部入口,必须防到agent 自己的每一次工具调用。〔源 Graziano《AI-Native Engineering》Day 4 MCP 安全(tool poisoning / prompt injection / 凭证泄露 / 最小权限清单),证据级 Ⅳ 一手从业者[R6][R1]

The new attack surface comes from wiring "steerable by text" to "can change the world." In traditional security a trusted process runs definite code; an agent runs an action it reasons out on the spot, and that reasoning can be rewritten by any text it reads — including content it pulls back from a tool. This is the root of two concrete threats: prompt injection is an attacker hiding instructions in data the agent will read (a web page, an email, a code comment), inducing it to execute that data as if it were your instruction; MCP tool poisoning is subtler — an innocuous-looking MCP tool smuggles instructions in its description or return value, quietly rewriting the agent's sense of "what to do now." What they share: the agent cannot distinguish "data to process" from "an instruction to obey," because to it both are just tokens in context. So the security boundary can no longer only guard the external entrance; it must reach down to each of the agent's own tool calls. [Source: Graziano, AI-Native Engineering Day 4 MCP security (tool poisoning / prompt injection / credential leakage / least-privilege manifest), grade Ⅳ practitioner. [R6][R1]]

最小权限:默认只读,按能力授权

Least privilege: read-only by default, scoped by capability

这条新边界的第一块基石是把最小权限原则从人扩展到 agent,并把默认值设成最严。具体三条:其一,默认只读——agent 拿到的工具,能力默认只到"读和提议",写、删、部署、转账这类改变世界的动作不在默认集合里。这正是成熟实践把 CI/CD 里的 agentic workflow 设成"默认只读、写操作须显式声明 safe-output"的同一条原则(Graziano Day 7 Continuous AI)。其二,按能力授权(capability scoping)——不是给 agent 一把"管理员钥匙",而是给一组细粒度、可单独审计的能力凭证:这个 agent 能读这个仓库、能跑测试,但不能 push、不能碰生产数据库。这样一次工具投毒最多只能在被授予的那一小块能力内造成影响。其三,凭证不进上下文——密钥、token 绝不写进 prompt 或对话历史,否则一次 prompt injection 就能把它们读出来外泄;凭证应由 harness 在工具调用层注入,agent 本身永远看不到明文。

The first foundation of this boundary extends the principle of least privilege from people to agents and sets the default to the strictest. Three rules: first, read-only by default — the tools an agent receives default to "read and propose," with world-changing actions (write, delete, deploy, transfer) outside the default set. This is the same principle by which mature practice sets CI/CD agentic workflows to "read-only by default, write actions must declare safe-output explicitly" (Graziano Day 7 Continuous AI). Second, capability scoping — not handing the agent one "admin key" but a set of fine-grained, separately auditable capability credentials: this agent may read this repo and run tests but may not push, may not touch the production database. So a single tool-poisoning can affect at most the small slice of capability it was granted. Third, credentials never enter context — keys and tokens are never written into a prompt or conversation history, or one prompt injection reads them out and exfiltrates them; credentials are injected by the harness at the tool-call layer, and the agent itself never sees the plaintext.

不可逆与高权动作:把人留在这一道闸上

Irreversible and privileged actions: keep the human at this gate

第二块基石回到 INSTRUMENT 07 的爆炸半径:把人留在哪一道闸上,由"错了能不能便宜地退回"和"错了会波及多大"两个量决定。可逆且低半径的动作(改一个本地文件、跑一次只读查询)可以放手给 agent 全自动;不可逆或高半径的动作(删生产数据、对外发邮件、合并到主干、动钱)必须有人在环显式确认,因为这类动作一旦做错,没有便宜的 undo。注意这和"信不信任模型"无关——即使模型完全可信,prompt injection 也可能让它执行一个被诱导的高权动作;人在这道闸上的作用,不是替模型做决定,而是给不可逆动作加一道独立于模型推理的确认。把这条和最小权限合起来看,整套安全边界其实是同一句话的两半:能力默认收到最小,不可逆的那部分能力则永远要人来临时解锁。这也解释了为什么"全自动 agent fleet"在工程上能成立而在涉及不可逆动作时必须留闸——不是技术不到位,是爆炸半径决定了这道判断不能下放。

The second foundation returns to INSTRUMENT 07's blast radius: which gate you keep the human at is set by "can the error be reversed cheaply" and "how far does the error reach." Reversible, low-radius actions (editing a local file, running a read-only query) can be handed fully to the agent; irreversible or high-radius actions (deleting production data, sending external email, merging to main, moving money) must have a human-in-the-loop explicit confirmation, because once such an action goes wrong there is no cheap undo. Note this is unrelated to "trusting the model" — even a fully trustworthy model can be made by prompt injection to execute an induced privileged action; the human at this gate does not make the decision for the model but adds a confirmation independent of the model's reasoning to irreversible actions. Put together with least privilege, the whole boundary is two halves of one sentence: capability defaults to the minimum, and the irreversible part of capability is always unlocked momentarily by a human. This is why a "fully autonomous agent fleet" holds up in engineering yet must keep a gate for irreversible actions — not a technology gap but the blast radius deciding this judgment cannot be delegated.

默认交给 agent(可逆 · 低半径)Default to the agent (reversible · low-radius)
  • 只读查询、检索、读代码库
  • Read-only queries, retrieval, reading the codebase
  • 改本地文件 / 在分支上提交(可回退)
  • Edit local files / commit on a branch (revertible)
  • 跑测试、lint、类型检查
  • Run tests, lint, type-checks
留给人临时解锁(不可逆 / 高半径)Human unlocks momentarily (irreversible / high-radius)
  • 删生产数据 / 改生产配置
  • Deleting production data / changing prod config
  • 对外发邮件、发布、合并到主干
  • Sending external email, releasing, merging to main
  • 动钱、授予权限、改安全策略
  • Moving money, granting access, changing security policy
检验信号Test signal

证伪:若把所有写操作都设成默认只读 + 按能力授权后,团队的交付速度并没有可感的下降,那"安全边界拖慢一切"这个常见反对就被证伪了——多数高权动作本就不在热路径上。反过来,若你无法说清某个 agent 被授予了哪几条能力,说明它的爆炸半径不可控。Falsified if: after setting all write actions to read-only-by-default plus capability scoping, delivery speed shows no perceptible drop, then the common objection that "security boundaries slow everything" is falsified — most privileged actions were never on the hot path. Conversely, if you cannot state which capabilities a given agent was granted, its blast radius is uncontrolled.

数据与指令的混淆:注入攻击的根

Confusing data with instruction: the root of injection

把 prompt injection 和 tool poisoning 归到同一个根,能让防御从"列举一堆攻击花样"变成"堵一个结构漏洞":agent 无法在原理上区分"待处理的数据"和"该服从的指令",因为对它而言两者都只是上文里的 token。传统软件有清晰的数据/代码分界——一段用户输入永远不会被当作可执行指令,除非你犯了注入类 bug;而一个 agent 的"指令"本身就是自然语言、和它读到的数据同一种介质,于是边界天然模糊。这解释了为什么这类攻击防不胜防:你没法靠转义、参数化这类传统手段一劳永逸,因为没有一个语法层面的分界线可供转义。可操作的缓解是承认这个混淆、然后从结构上限制它的后果,而不是指望模型"学会分辨":其一,把来自不可信来源的内容(外部网页、第三方工具返回)标注为数据、降权处理,让模型知道这部分不是来自你的指令;其二,回到最小权限——既然无法保证模型不被诱导,就保证它就算被诱导,能调用的高权动作也已被收走、或必须过人这道独立闸。重心于是落在"别让它被骗"之外的那一半——"被骗了也炸不大"。这正是最小权限、能力授权、不可逆动作留人这三块基石在同一个威胁上的合力。

Sorting prompt injection and tool poisoning to one root turns defense from "enumerate a pile of attack tricks" into "plug one structural hole": an agent cannot in principle distinguish "data to process" from "an instruction to obey," because to it both are just tokens in context. Traditional software has a clean data/code line — a piece of user input is never treated as an executable instruction unless you commit an injection-class bug; but an agent's "instructions" are themselves natural language, the same medium as the data it reads, so the boundary is natively blurred. This explains why such attacks are so hard to fully prevent: you cannot solve it once and for all with escaping or parameterization, because there is no syntactic dividing line to escape against. The operational mitigation is to admit the confusion and then structurally bound its consequences rather than hoping the model "learns to tell them apart": first, mark content from untrusted sources (external web pages, third-party tool returns) as data and down-weight it, so the model knows this part is not your instruction; second, fall back to least privilege — since you cannot guarantee the model is never induced, guarantee that even when induced, the privileged actions it can call have already been removed or must pass the independent human gate. In other words, injection defense centers not on "do not let it be fooled" but on "even when fooled, the blast is small." This is exactly the combined force of the three foundations — least privilege, capability scoping, irreversible-action gating — on one threat.

Continuous AI:把"默认只读"写进流水线

Continuous AI: write "read-only by default" into the pipeline

这条信任边界最实的落地,是把它从"约定"变成流水线里强制的默认值。当 agent 进入 CI/CD、开始在自动化工作流里跑(Continuous AI),"默认只读、写操作须显式声明 safe-output"这条规则就不再是某个人记得遵守的纪律,而是流水线层面的硬约束:一个 agentic workflow 默认拿到的是只读能力,任何会改变世界的输出(提交、发布、对外调用)都必须被显式标记为安全输出、并走它自己那道审查。这把安全边界从"靠人在每次操作时谨慎"前移到了"系统默认就不给危险能力"——前者依赖人不出错,后者依赖人主动解锁,而主动解锁这个动作本身就是一道天然的检查点。它和前面的爆炸半径是同一个设计:低半径动作默认通过,高半径动作默认拦下、要人显式放行。把这条放进自动化的意义在于,agent 在流水线里是无人值守地跑的——正因为没人实时盯着,默认值才必须设成最安全的那一档,因为出了事没有一个在场的人能当场喊停。可证伪信号:若你的自动化流水线里 agent 默认就能 push、能发布、能调用生产 API,且没有一道"写操作须显式声明"的闸,那么这条流水线的安全性完全依赖于"模型永不被诱导"这个你无法保证的前提——它不是安全,是侥幸。〔源 Graziano《AI-Native Engineering》Day 7 Continuous AI(CI/CD 里的 agentic workflow,默认只读、写操作须显式声明 safe-output),证据级 Ⅳ 一手从业者[R1]

The most concrete landing of this trust boundary is turning it from a "convention" into a pipeline-enforced default. When agents enter CI/CD and run inside automated workflows (Continuous AI), the rule "read-only by default, write actions must declare safe-output explicitly" stops being a discipline someone remembers to keep and becomes a hard constraint at the pipeline level: an agentic workflow gets read-only capability by default, and any world-changing output (commit, release, external call) must be explicitly marked as a safe output and pass its own review. This moves the security boundary from "rely on humans being careful on each action" forward to "the system gives no dangerous capability by default" — the former depends on a human not erring, the latter on a human actively unlocking, and the act of unlocking is itself a natural checkpoint. Putting this into automation matters because the agent runs unattended in the pipeline — precisely because no one watches in real time, the default must be set to the safest rung, since when something goes wrong there is no present human to call a halt. Falsifiable signal: if agents in your automated pipeline can push, release, and call the production API by default with no "write actions must declare explicitly" gate, then that pipeline's security rests entirely on the premise that "the model is never induced," which you cannot guarantee — that is not security but luck. [Source: Graziano, AI-Native Engineering Day 7 Continuous AI (agentic workflows in CI/CD, read-only by default, write actions must declare safe-output), grade Ⅳ practitioner. [R1]]

凭证泄露:把密钥挡在上下文之外

Credential leakage: keep keys out of context

这条边界还有一个具体到操作层的红线值得单列:密钥、token、凭证绝不能进上下文。原因和注入是同一个根——既然 agent 分不清数据和指令,那么一旦凭证出现在它的上下文里,一次成功的 prompt injection 就能让它把凭证当作"可以输出的数据"读出来、再经由某个看似无害的工具调用外泄。这不是假想:任何能让 agent 读到密钥、又能让它对外发出内容的组合,都构成一条泄露通道。正确的做法是结构性地把凭证挡在 agent 的视野之外——凭证由 harness 在工具调用的那一层注入到实际请求里,agent 只看到"调用这个工具"这个动作,永远看不到凭证的明文。这样即使 agent 被完全诱导,它手里也没有可外泄的东西。这条和最小权限是同一套思路的延伸:最小权限收的是"能做什么",凭证隔离收的是"能看到什么敏感数据",两者合起来才让"被诱导"的后果可控。可证伪信号:去查你给 agent 的上下文(系统提示、工具定义、对话历史)里有没有任何明文密钥或可直接换取权限的 token——只要有一处,你的安全就建立在"没有人成功注入"这个你无法保证的假设上,而不是建立在结构上。〔源 Graziano《AI-Native Engineering》Day 4 MCP 安全(凭证泄露 / 最小权限清单),证据级 Ⅳ 一手从业者[R6][R1]

This boundary has one operational-level red line worth listing alone: keys, tokens, and credentials must never enter context. The reason shares the root with injection — since the agent cannot tell data from instruction, once a credential appears in its context, one successful prompt injection can make it read the credential as "data it may output" and exfiltrate it through some innocuous-looking tool call. This is not hypothetical: any combination that lets the agent read a key and also emit content externally forms a leak channel. The correct approach structurally keeps credentials out of the agent's view — the harness injects credentials into the actual request at the tool-call layer, and the agent sees only the action "call this tool," never the credential in plaintext. So even fully induced, the agent has nothing exfiltratable in hand. This extends the same idea as least privilege: least privilege constrains "what it can do," credential isolation constrains "what sensitive data it can see," and together they make the consequence of "being induced" controllable. Falsifiable signal: check whether the context you give the agent (system prompt, tool definitions, conversation history) contains any plaintext key or a token directly exchangeable for access — if even one is there, your security rests on the assumption "no one injects successfully," which you cannot guarantee, rather than on structure. [Source: Graziano, AI-Native Engineering Day 4 MCP security (credential leakage / least-privilege manifest), grade Ⅳ practitioner. [R6][R1]]

ENG
15
EVALS · 承重墙
EVALS
重画 · 验证基础设施
Redraw · Verification Infra

评测套件是组织沉淀下来的判断

The eval suite is the organization's accumulated judgment

一次性人审今天对、明天换个 prompt 又错,没人知道。把每一类错误沉淀成一条 eval、进回归套件,验证就从一次性人力变成随产出复利的基础设施。eval 套件不是测试的别名,它是这个团队对"何为对"的判断被写下来、能被机器反复执行的那一份。

One-off human review is right today and wrong tomorrow under a new prompt, with no one the wiser. Distill each class of error into an eval and enter it into the regression suite, and verification turns from one-off labor into infrastructure that compounds with output. An eval suite is not a synonym for tests; it is this team's judgment of "what counts as correct," written down and re-runnable by machine.

先说清 eval 和传统 QA 的种类差别。传统一次性 QA 是个事件:在发布前,人把这一版过一遍,判断它够不够好,然后这次判断就消失了——它活在那个人那一刻的脑子里,不可回归。下次换个 prompt、改个模型、调个参数,上次确认过的东西可能又错了,而没有任何机制会告诉你。eval 是把这件事反过来:每发现一类错误,就把它固化成一条可机器执行的判定,进回归套件,从此机器替你在每次改动后重新盯住它。这条差别是种类性的,不是程度性的:QA 验证"这一版",eval 积累"这个团队对何为对的全部判断"。所以随着时间推移,QA 的总量随发布次数线性堆叠又线性蒸发,而 eval 套件只增不减——它是沉淀,每条都对应一次曾经痛过的教训。〔源 本系列验证篇综合:把验证沉淀成评测台、错误回流成 eval、可观测性把生产错样回流,证据级 Ⅳ〕

First, the categorical difference between evals and traditional QA. One-off QA is an event: before release a human runs this version through, judges whether it is good enough, and then that judgment vanishes — it lived in that person's head at that moment and cannot regress. Next time, under a new prompt, a swapped model, a tuned parameter, what was confirmed last time may be wrong again, with no mechanism to tell you. An eval inverts this: each time a class of error is found, you solidify it into a machine-runnable verdict, enter it into the regression suite, and from then on the machine watches it for you after every change. This difference is of kind, not degree: QA verifies "this version," an eval accumulates "all of this team's judgment of what is correct." So over time the total QA stacks linearly with releases and evaporates just as linearly, while the eval suite only grows — it is sediment, each entry a once-painful lesson.

错误回流:每个 bug 离开时留下一条 eval

Errors flowing back: every bug leaves an eval behind

承重墙是怎么垒起来的,全在一条纪律:每个被修复的错误,离开时必须留下一条 eval。修一个 bug 而不补一条能在它复发时变红的 eval,等于这次教训只在一个人脑子里存了一份、且会随时间忘掉——下次同类错误会原样回来。把这条纪律执行到底,套件就成了一个随产出复利的资产:产出越多、踩过的坑越多、回流的 eval 越多,覆盖越厚;而覆盖越厚,下一次同类错误被自动拦下的概率越高,需要人盯的部分越少。这正是和组织卷"自改进循环"同构的工程版本——把每次失败沉淀成可复用资产,监督需求随时间下降(Graziano Day 4 把它叫 steering loop:每次 agent 失败就问"哪条 guide/sensor 本该拦住",补那一条)。再配上可观测性,这个循环还能自己喂自己:生产环境里出错的真实样本被自动回流成新 eval,套件就不只复利于内部踩坑,还复利于真实世界的反馈。这就是为什么趋势上未来最大的工程投入会流向验证基础设施——它是唯一一处投入会随时间利滚利的地方。

How the load-bearing wall is laid up comes down to one discipline: every fixed error must leave an eval behind when it goes. Fixing a bug without adding an eval that turns red when it recurs means the lesson was stored in one head and will be forgotten over time — next time the same class of error returns unchanged. Carry this discipline through and the suite becomes an asset that compounds with output: the more you produce, the more pits you have stepped in, the more evals flow back, the thicker the coverage; and the thicker the coverage, the higher the chance the next same-class error is auto-caught, the less a human must watch. This is the engineering isomorph of the organization volume's "self-improving loop" — distill each failure into a reusable asset, supervision demand falling over time (Graziano Day 4 calls it the steering loop: on each agent failure, ask "which guide/sensor should have caught this" and add that one). Add observability and the loop feeds itself: real error samples from production flow back automatically into new evals, so the suite compounds not only on internal pitfalls but on real-world feedback. This is why the trend is for the largest future engineering investment to flow into verification infrastructure — it is the one place where investment compounds over time.

为什么生成越廉价,eval 越是唯一瓶颈

Why cheaper generation makes evals the one bottleneck

把 ENG·00 的"瓶颈搬家"放到这里收口:生成近乎免费之后,唯一没有变便宜的就是判断生成出来的东西对不对。一个能一夜产出一万行的 agent,若没有一面能自动判定这一万行对错的墙,它产出的不是价值而是待审债务——而人审一万行的速度并没有因为模型变强而提高。所以 eval 套件的厚度,直接决定了你能安全地让生成跑多快:墙越厚,越多的产出能被自动判过,人就越能从实时盯屏退到只处理机器拦不住的少数判断。反过来,墙薄的团队会发现"agent 越快、人越累"——因为每一份产出都回流到那个没有变快的人审瓶颈上。这给出一个可证伪的资源配置预测:一个真正 AI-Native 的工程团队,其验证基础设施(eval / checker / 可观测回流)的投入占比应该随产出能力上升而上升,而非下降;若一个团队"AI 化"后把省下的人力全投回写更多功能、却没有同步加厚验证墙,它迟早会被自己的产出淹没——这正是 vibe-coding 陷阱在团队尺度上的版本。

Close the loop with ENG·00's "the bottleneck moves" here: once generation is near-free, the one thing that did not get cheaper is judging whether the generated thing is correct. An agent that can produce ten thousand lines overnight, lacking a wall that auto-decides whether those ten thousand are right, produces not value but review-debt — and a human's speed at reviewing ten thousand lines did not rise because the model got stronger. So the thickness of the eval suite directly sets how fast you can safely let generation run: the thicker the wall, the more output gets auto-judged, and the more a human can retreat from watching the screen in real time to handling only the few judgments the machine cannot. Conversely a team with a thin wall finds "the faster the agent, the more tired the human" — because every unit of output flows back to that human-review bottleneck that did not speed up. This yields a falsifiable resource prediction: a truly AI-Native engineering team's share of investment in verification infrastructure (evals / checkers / observability feedback) should rise, not fall, as production capacity rises; if a team "goes AI" and reinvests all freed labor into writing more features without thickening the verification wall in step, it will eventually be drowned by its own output — the team-scale version of the vibe-coding trap.

发现一类错Find a class of error
人审、生产事故或测试抓到一个真实的错——这是循环的输入。Human review, a prod incident, or a test catches a real error — the loop's input.
固化成 evalSolidify into an eval
写一条能在它复发时变红的判定,进回归套件——判断被写下来。这是承重墙在长高。Write a verdict that turns red on recurrence, into the regression suite — judgment written down. The wall grows taller.
机器替人盯The machine watches
此后每次改动自动重跑;监督需求随覆盖变厚而下降,循环复利。It re-runs on every change thereafter; supervision demand falls as coverage thickens, the loop compounding.
检验信号Test signal

先行:被修复的 bug 中"同时补了一条会变红的 eval"的占比在上升。证伪:若你的 eval 总数停滞、甚至同类 bug 反复回归,说明错误没有回流——验证还停在一次性人审,承重墙没有在长高。Leading: the share of fixed bugs that "also added an eval that turns red" is rising. Falsified if: your eval count stagnates or same-class bugs keep regressing — errors are not flowing back; verification is still one-off human review, and the wall is not growing.

独立验证器为什么必须和生成分离

Why the verifier must be separate from the generation

eval 套件要能当承重墙,有一个常被忽略却致命的结构条件:判对错的那个东西,必须独立于产出它的那个东西。让生成代码的同一个 agent 来判自己写的代码对不对,约等于让考生自己批卷——它用来生成的那套"觉得对"的判断,恰好就是它用来自评的那套判断,于是它会系统性地对自己的盲区视而不见(自信而错正是这种盲区的典型)。所以承重墙的承重,不在"有没有验证",而在"验证是不是一道独立的力":一条 eval、一个类型检查器、一套测试,它们的判定不经过生成模型的"觉得",而是对照一份外部的、确定的判据。这就是为什么 computational 的检查(编译器、测试)在验证里地位特殊——它们天生独立于生成,不会被生成的语气或自洽性带跑。当你不得不用 inferential 的检查(比如让另一个模型做语义评审)时,独立性要靠"换一个上下文、换一套判据"来人为制造,而不能假设它天然存在。可证伪信号:若你的"验证"其实是同一个 agent 在同一条上下文里自己说"我检查过了没问题",那你没有承重墙,你有一面画在纸上的墙——它在最该挡住自信而错的时候恰好挡不住。

For an eval suite to be a load-bearing wall there is an often-overlooked but fatal structural condition: the thing that judges correctness must be independent of the thing that produced it. Letting the same agent that generated the code judge whether its own code is correct is roughly letting a candidate grade their own exam — the very "feels right" judgment it used to generate is the judgment it uses to self-assess, so it systematically goes blind to its own blind spots (confident wrongness is the textbook case of such a blind spot). So the load-bearing of the wall is not in "is there verification" but in "is the verification an independent force": an eval, a type checker, a test suite render their verdict without passing through the generating model's "feeling," checking against an external, definite criterion. This is why computational checks (compilers, tests) hold a special place in verification — they are natively independent of generation and cannot be carried off by its tone or self-consistency. When you must use an inferential check (say, another model doing a semantic review), independence must be manufactured by "a different context, a different criterion," not assumed to exist naturally. Falsifiable signal: if your "verification" is actually the same agent in the same context saying "I checked, it's fine," you have no load-bearing wall but a wall painted on paper — it fails to block exactly when it most needs to block confident wrongness.

eval 套件就是组织被写下来的判断

The eval suite is the organization's judgment, written down

把这一节的命题推到底,会得到一个对组织有点反直觉的结论:一个团队的 eval 套件,是它对"何为对"的集体判断被外化、被固化、能被机器反复执行的那一份——它是这个组织在质量这件事上的记忆。资深工程师的价值,很大一部分在于他脑子里那套"这样不行""这里会出事""这个边界没考虑到"的判断;但这套判断过去只能通过 code review、带新人、口头传授缓慢地、有损地复制,且随人离职而流失。把每一类判断沉淀成一条 eval,等于把这套原本住在个别人脑子里的判断,搬进一个不离职、不疲劳、对每次改动都一视同仁执行的基础设施里。于是发生一件深远的事:质量判断从"依赖在场的某个资深的人"变成"沉淀在套件里、人人可继承"。这和组织卷"上下文即基础设施"是同一条原理在质量维度上的展开——把口口相传的隐性判断,变成可读可查可执行的显性资产。它也给"为什么验证基础设施值得最大投入"一个非效率的理由:你投进 eval 的每一条,都是在把一次性的人类判断转成永久的、可继承的组织能力。可证伪信号:若一个团队的核心质量判断仍然只能靠"问那个最资深的人"、且这套判断没有任何一部分沉淀成可执行的 eval,那么这个团队的质量记忆是脆弱的——它会随那个人的离开而塌掉一大块,这正是"判断没有被写下来"的代价。

Push this sheet's thesis to its end and you reach a conclusion a little counter-intuitive for organizations: a team's eval suite is its collective judgment of "what is correct," externalized, solidified, and re-runnable by machine — it is the organization's memory on the matter of quality. Much of a senior engineer's value lies in the set of judgments in their head — "this won't do," "this will blow up," "this boundary was not considered"; but that set could previously be copied only slowly and lossily through code review, mentoring, and verbal transmission, and was lost when people left. Distilling each class of judgment into an eval moves the judgment that lived in individual heads into infrastructure that does not quit, does not tire, and executes on every change impartially. Then something profound happens: quality judgment shifts from "depends on a certain senior person being present" to "settled in the suite, inheritable by all." This is the same principle as the organization volume's "context as infrastructure" unfolded on the quality dimension — turning tacit, word-of-mouth judgment into a legible, queryable, executable explicit asset. It also gives "why verification infrastructure deserves the largest investment" a non-efficiency reason: each eval you invest in converts a one-off human judgment into a permanent, inheritable organizational capability. Falsifiable signal: if a team's core quality judgments still rely on "asking the most senior person" and no part of that judgment has settled into runnable evals, the team's quality memory is fragile — a large piece collapses when that person leaves, exactly the cost of "judgment never written down."

覆盖随产出复利:为什么 eval 是少数会越投越省的投入

Coverage compounds with output: why evals are one of the few investments that get cheaper the more you make

大多数工程投入是线性消耗的:你修一个 bug,省下的是这一个 bug 的麻烦;你写一个功能,得到的是这一个功能。eval 不一样,它是少数会复利的投入之一,而复利的机制值得讲清。每补一条 eval,你不只是拦住了这一类错误这一次,你是让它在未来每一次改动里都被自动重检——一条 eval 的价值,等于它拦住的所有未来回归之和。于是套件的总价值不是随条数线性增长,而是随"条数 × 改动频率 × 时间"增长。这条复利在 agentic 时代被进一步放大:因为生成变快、改动变频,每一条 eval 被重跑的次数暴增,它的边际价值随之暴增。这就是为什么"未来最大的工程投入流向验证基础设施"不是一句口号,而是一个由复利结构推出来的预测——在一个生成近免费、改动极频的系统里,唯一会随时间利滚利的投入,就是那套能在每次改动里自动重新行使判断的验证墙。反过来,这也解释了为什么薄墙团队会陷入"越跑越累":他们的每次改动都没有被自动复检,于是改动频率的上升直接转化成人审负担的上升,而不是被一套复利的墙吸收掉。可证伪信号:把"你的 eval 套件每天被重跑多少次 × 每次拦住多少潜在回归"算出来,若这个数随团队产出能力上升而上升,你的验证投入在复利;若它停滞,你的墙没在长高,验证还停在一次性人力。〔源 本系列验证篇综合:错误回流成 eval、可观测性回流、验证基础设施是最大投入方向,证据级 Ⅳ〕

Most engineering investment is consumed linearly: fix a bug and you save the trouble of that one bug; write a feature and you get that one feature. Evals are different — one of the few investments that compound, and the mechanism is worth stating. Each eval added does not only block this class of error this once; it makes the error auto-rechecked on every future change — one eval's value equals the sum of all future regressions it blocks. So the suite's total value grows not linearly with count but with "count × change frequency × time." This compounding is further amplified in the agentic era: as generation speeds up and changes get more frequent, the number of times each eval is re-run explodes, and its marginal value with it. This is why "the largest future engineering investment flows into verification infrastructure" is not a slogan but a prediction derived from the compounding structure — in a system where generation is near-free and changes extremely frequent, the one investment that compounds over time is the verification wall that auto-re-exercises judgment on every change. Conversely it explains why thin-wall teams fall into "the faster they run, the more tired they get": their every change is not auto-rechecked, so a rising change frequency converts directly into rising human-review burden rather than being absorbed by a compounding wall. Falsifiable signal: compute "how many times a day your eval suite is re-run × how many potential regressions each run blocks"; if this number rises with your team's production capacity, your verification investment is compounding; if it stagnates, your wall is not growing and verification is still one-off labor. [Source: this series' Verification-chapter synthesis — errors flowing into evals, observability feedback, verification infrastructure as the largest investment direction, grade Ⅳ.]

ENG
16
WORKED CASES · 走一遍
WORKED CASES
实例 · 内核落在真实现场
Cases · The Kernel on Real Ground

四个现场,同一个内核在每一处都留人一道判断

Four sites, one kernel — each leaves a human one judgment

前面的章讲原理;这一张把原理按在四个具体现场上走一遍。每个案例都带着同一组追问:执行在哪里变充裕了?错误是怎么开始滚的?哪一道结构(验证器 / 边界 / eval / 规格)把它拦住?最后——人保留的那一道判断,到底是什么、为什么不能交出去。素材取自工程实践的常见形态,数字是量级示意而非某一家的精确账。

The earlier sheets state the principles; this one walks them across four concrete sites. Each case carries the same interrogation: where did execution become abundant? How did the error start to roll? Which structure — verifier / boundary / eval / spec — caught it? And finally, the one judgment a human kept: what was it, and why could it not be handed off. The material is drawn from common shapes of engineering practice; the numbers are order-of-magnitude illustrations, not one shop's exact ledger.

案例一 · 一次交给 agent 队伍的重构,雪球如何被独立验证器拦住

Case 1 · A refactor handed to an agent fleet, and how an independent verifier stopped the snowball

一支团队要把一个三百多文件的服务从回调风格迁到 async/await。换作从前,这是两名工程师两周的机械活。现在他们把任务切成八十个独立 PR,交给一支并行的 agent 队伍——执行端一夜跑完。问题出在第十七个 PR:agent 对一个并发原语做了看似合理却语义错误的改写——把一个本该串行的写操作包进了 Promise.all。它没有报错,测试也绿,因为既有测试从未覆盖这条竞态路径。它自信地错着,并把这个错误的前提带进后续每一个依赖该模块的 PR——这正是失败学里的雪球(见 ENG·12)。

A team needed to migrate a 300-plus-file service from callback style to async/await. In the old world this was two engineers' two weeks of mechanical labor. Now they cut it into eighty independent PRs and handed them to a parallel agent fleet — the execution side finished overnight. The trouble surfaced at PR seventeen: the agent made a plausible-looking but semantically wrong rewrite of a concurrency primitive, wrapping a write that had to be serial inside a Promise.all. It threw no error and the tests were green, because the existing tests never covered that race path. It was confidently wrong, and it carried that wrong premise into every later PR that depended on the module — exactly the snowball from the failure sheet (see ENG·12).

旧 · 把人放在实时监督位Before · human on live watch
人盯着八十个 PR 滚过——第十七个的竞态肉眼看不出,等到第四十个出诡异 bug 才回溯,雪球已滚了二十多步。
A human watches eighty PRs roll by — PR seventeen's race is invisible to the eye, and by the time PR forty throws a weird bug you backtrack, the snowball has rolled twenty-plus steps.
新 · 把人换成独立验证器 + 异步分诊After · independent verifier + async triage
在合流前插一道与生成分离的验证器:一套并发属性测试(用 race detector 跑该模块的写序)+ 一条"任何 PR 触碰并发原语就标红人审"的规则。第十七个 PR 当场被红牌挡下,雪球停在第一步。
Insert a verifier separate from generation before merge: a concurrency property test (run the module's write ordering under a race detector) plus a rule that any PR touching a concurrency primitive is flagged red for human review. PR seventeen is red-carded on the spot; the snowball stops at step one.

留给人的那一道判断:不是"逐个读完八十个 diff"——那是把人重新塞回被放大的执行里,注定失败(自动化的反讽:系统越可靠,盯屏的人越走神,见 ENG·12)。留给人的判断是构成性的:哪些原语属于"碰了就必须人审"的高爆炸半径区。这道判断写一次,就成了那道把红牌发出去的结构;它把人的稀缺注意力从"看八十遍"压到"只看被标红的那一两个"。这正是内核第②步——执行退场,判断退守到那道决定"什么算危险"的边界上。〔机制源本卷 ENG·12 失败学 + Bainbridge《Ironies of Automation》, Automatica 1983,证据级 Ⅱ〕

The one judgment kept by a human: not "read all eighty diffs" — that stuffs the human back inside the amplified execution and is doomed (the irony of automation: the more reliable the system, the more the watcher drifts; see ENG·12). The kept judgment is constitutive: which primitives belong to the high-blast-radius zone where "touch it and a human must review." Written once, that judgment becomes the structure that issues the red card; it compresses the human's scarce attention from "look eighty times" to "look only at the one or two flagged red." This is kernel step ② — execution exits, judgment retreats onto the boundary that decides what counts as dangerous. [Mechanism from this volume's ENG·12 failure sheet plus Bainbridge, Ironies of Automation, Automatica 1983, grade Ⅱ.]

案例二 · 一个特性走完 Specify → Plan → Execute → Verify → Integrate → Learn

Case 2 · One feature walked end-to-end: Specify → Plan → Execute → Verify → Integrate → Learn

需求一句话:"给导出功能加限流,防止一个用户拖垮整个导出队列。"在 vibe-coding 的旧惯性里,这会变成一句丢给 agent 的提示,然后看它生成一坨"看起来对"的中间件。规格驱动开发(SDD)把它走成一个有目标函数的环——下面这六步不是流程图上的格子,是六个各自有验收口径的检查点。

The ask is one sentence: "Add rate limiting to export so one user can't starve the whole export queue." In the old vibe-coding habit this becomes a prompt tossed at an agent, and you watch it emit a lump of "looks-right" middleware. Spec-driven development (SDD) walks it as a loop with an objective function — the six steps below are not boxes on a flowchart but six checkpoints, each with its own acceptance criterion.

Specify · 规格Specify
写清"什么算对":每用户每分钟 N 次、超限返回 429 + Retry-After、限流不得影响其他用户、计数器须在重启后存活。这份规格就是后面所有步骤的目标函数——没有它,生成无处收敛。State what counts as correct: N requests per user per minute, over-limit returns 429 + Retry-After, throttling one user must not affect others, the counter must survive a restart. This spec is the objective function for every later step — without it, generation has nowhere to converge.
Plan · 即时规划Plan
只规划到下一个能验证的检查点:先选算法(token bucket vs 滑窗)、定存储(进程内 vs Redis)。不写到第十步——执行变充裕后,远期计划在落地前就过期(见 ENG·13)。Plan only to the next verifiable checkpoint: pick the algorithm (token bucket vs sliding window), choose the store (in-process vs Redis). Do not plan to step ten — once execution is abundant, far-horizon plans expire before they land (see ENG·13).
Execute · 执行Execute
交给 agent:先写会失败的测试(红),再写实现到测试转绿。这是把"以测试为目标交办"落到位——agent 的输出有了可机检的靶子。Hand to the agent: write the failing test first (red), then the implementation until it goes green. This is "delegate toward a test" made concrete — the agent's output now has a machine-checkable target.
Verify · 验证Verify
独立验证器跑规格里的每一条:并发压测确认"限一个不伤其他"、杀进程重启确认计数器存活。这一步是承重墙——人审的是"测试是否锁住了规格的意图",不是逐行读实现。The independent verifier runs every clause of the spec: a concurrency load test confirms "throttling one doesn't hurt others," a kill-and-restart confirms the counter survives. This is the load-bearing wall — the human reviews whether the tests lock the spec's intent, not the line-by-line implementation.
Integrate · 合流Integrate
PR 即评审门:CI 跑全套,规格里的每条验收都是一个不可跳过的门。合不进去 = 还没满足规格,而不是"再 push 一次试试"。PR as the review gate: CI runs the full suite, and each acceptance clause is a gate that cannot be skipped. Failing to merge means the spec is not yet met — not "push once more and see."
Learn · 沉淀Learn
上线后真有一个客户触发了一条没想到的边界(批量导出绕过了 per-user 计数)。把这条 bug 反写成一条新的 eval,进套件。下次任何 agent 生成限流,都先撞这条 eval——错误变成了组织的记忆。In production a customer hits an unforeseen edge (batch export bypassed the per-user counter). Write that bug back as a new eval and add it to the suite. Next time any agent generates rate limiting, it hits this eval first — the error became the organization's memory.

留给人的那一道判断集中在第①和第④步:规格里"什么算对"是人定的——限流到底该保护什么、429 还是排队、对滥用者多狠对正常用户多软,这是产品与风险的构成性取舍,机器没有立场可以替你持有。第④步人审的也不是代码长相,是"这套测试有没有把第①步的意图真正锁死"。其余四步——写实现、跑测试、过门、记 bug——执行充裕之后都可以、也应该交出去。〔SDD 六步与"PR 即门 / constitution 硬规则层"源 GitHub Spec-kit〔R5〕;"先写失败测试再实现"为 TDD 经典实践〕

The kept judgment concentrates in steps ① and ④: what "counts as correct" in the spec is set by a human — what the limit should protect, 429 versus queueing, how harsh on the abuser and how soft on the normal user, is a constitutive product-and-risk trade-off a machine has no standing to hold for you. And what the human reviews in step ④ is not how the code looks but whether the suite truly locks the intent of step ①. The other four steps — write the implementation, run the tests, pass the gate, log the bug — can and should be handed off once execution is abundant. [The six-step SDD plus "PR as gate / constitution as hard-rule layer" from GitHub Spec-kit [R5]; "write the failing test first" is classic TDD practice.]

案例三 · 一个权限过大的 agent,越界、回滚,与边界的重画

Case 3 · An over-privileged agent: the breach, the rollback, and the redrawn boundary

为了"省事",一支团队给一个负责清理过期数据的定时 agent 配了生产库的读写全权账号——理由是"它偶尔也要改几条状态位"。某夜,agent 在生成清理 SQL 时把一个 WHERE created_at < ? 的占位符填成了空值(上下文里那个变量被一段无关的对话腐烂掉了,见 ENG·12 的 context rot),生成了一条等价于全表删除的语句,并且——因为它有全权——直接执行了。没有任何结构在它和生产数据之间。

To "save effort," a team gave a scheduled agent that cleans expired data a full read-write production credential — the rationale being "it occasionally flips a few status bits too." One night, generating cleanup SQL, the agent filled a WHERE created_at < ? placeholder with null (the variable had rotted in context behind an unrelated exchange — the context rot of ENG·12), produced a statement equivalent to a full-table delete, and — because it held full privilege — executed it directly. Nothing structural stood between it and production data.

边界图FIGFIG. E16.0 / THE BLAST PATH · 权限即爆炸半径 看懂:每多给一档权限,左边那条"错误能走多远"的路径就长一截 Read: every extra privilege tier lengthens the left-hand path of "how far an error can travel"
同一个错误生成 · 两种权限配置 · 爆炸半径相差一个数量级 Same wrong generation · two privilege setups · blast radius differs by an order of magnitude 配置 A · 全权 agent(错误一路畅通) Setup A · full-privilege agent (error travels unimpeded) agent 生成错 SQL agent emits bad SQL WHERE … < null 无门 · 直接执行 no gate · runs directly 读写全权 full RW 全表删除 · 生产事故 full-table delete · incident blast = systemic 配置 B · 只读默认 + 写操作走确认门(错误被拦在第一步) Setup B · read-only default + writes behind a gate (error stops at step one) agent 生成同样错 SQL agent emits same bad SQL WHERE … < null 写门拦下 · 待人确认 write gate halts · awaits human 默认只读 read-only default 人看一眼 = 当场否决 human glance = vetoed blast = zero 同一段生成、同一个 bug——唯一的差别是"它能走多远"由权限结构决定,不由模型聪不聪明决定。 Same generation, same bug — the only difference is that "how far it travels" is set by the privilege structure, not by how smart the model is. 最小权限不是安全洁癖,是给错误装的限速器 least privilege is not a security fetish — it is a speed limiter on errors
爆炸半径不是模型属性,是结构属性。把 agent 默认降到只读、写操作走确认门,等于在错误和生产之间装一道限速器——同一个错误,半径从"系统级"压到"零"。Blast radius is not a model property but a structural one. Defaulting the agent to read-only and routing writes through a confirmation gate puts a speed limiter between error and production — the same error, radius compressed from "systemic" to "zero."

回滚与重画:事故当晚靠数据库的时间点恢复(PITR)回滚了四十分钟的数据,丢了少量在窗口内的写入。事后没有人去"骂模型不靠谱"——这没意义,会猜的系统本就会偶尔猜错。真正的修复是重画边界:清理 agent 的账号降为只读,它需要改的那几条状态位走一个独立的、带显式确认门的窄接口;任何 DELETE / 全表级操作进入"必须人确认"档。这把一个事后追责问题,变成了一个结构问题。〔最小权限 / 写操作设确认门 / MCP 安全边界源 Anthropic MCP 规范与安全指引〔R6〕;PITR 为数据库标准恢复机制〕

Rollback and redraw: that night a point-in-time recovery (PITR) rolled back forty minutes of data, losing the few writes inside the window. Afterward nobody set out to "blame the unreliable model" — that is pointless; a guessing system will occasionally guess wrong. The real fix was to redraw the boundary: the cleanup agent's credential dropped to read-only, the few status bits it needs to change go through a separate narrow interface with an explicit confirmation gate, and any DELETE / whole-table operation enters the "human must confirm" tier. This turns a post-hoc accountability problem into a structural one. [Least privilege / writes behind a confirmation gate / MCP security boundary from the Anthropic MCP specification and security guidance [R6]; PITR is a standard database recovery mechanism.]

留给人的那一道判断:哪些操作属于"不可逆 × 高爆炸半径"、必须设确认门。这不是模型能替你回答的——它取决于这份数据丢了赔多少、这家公司对停机的容忍度、这条接缝背后的合规约束。这正是分档计算器(INSTRUMENT 07)里 OWN 那一档:构成性判断,只有人能持有。

The kept judgment: which operations belong to "irreversible x high blast radius" and must sit behind a confirmation gate. A model cannot answer this for you — it depends on what losing this data costs, this company's tolerance for downtime, the compliance constraints behind the seam. This is precisely the OWN tier in the Delegation-Tier Calculator (INSTRUMENT 07): constitutive judgment, holdable only by a human.

案例四 · 一条 eval,怎样把一次性的踩坑变成组织的记忆

Case 4 · One eval, and how a one-time stumble became the organization's memory

一个内容生成功能反复出同一类问题:模型在给非英语用户生成界面文案时,偶尔把中文标点(如全角逗号)混进英文串里,肉眼难查,QA 每隔几周抓出一次,改一次,下次又来。这是典型的"靠人当一次性劳动去补结构漏洞"——每抓一次的成本固定,抓的次数随产能上升而上升,永远追不上。

A content-generation feature kept producing the same class of problem: when generating UI copy for non-English users, the model occasionally mixed Chinese punctuation (a full-width comma, say) into English strings — hard to spot by eye. QA caught it once every few weeks, fixed it once, and next time it returned. This is the classic "use a human as one-time labor to patch a structural hole" — each catch costs a fixed amount, the number of catches rises with capacity, and it never catches up.

旧 · 每次都用人当验证器Before · a human as the verifier each time
QA 每轮人工抽查文案——抓到一次、改一次、写进周报、然后遗忘。判断没有沉淀,下一个 agent、下一周、下一个人,从零开始踩同一个坑。
QA spot-checks copy each round — catch one, fix one, note it in the weekly report, then forget. The judgment never accretes; the next agent, next week, next person starts from zero and hits the same hole.
新 · 把这次判断写成一条 evalAfter · write this judgment as one eval
把"英文串里出现 CJK 标点 [一-鿿,。、;:] 即失败"写成一条断言,进 eval 套件、挂进 CI。从此任何 agent 生成的任何文案,合流前都先撞这条 eval——一次踩坑的判断,变成了一道永久的、零边际成本的门。
Encode "a CJK punctuation mark in an English string is a failure" as one assertion, add it to the eval suite, wire it into CI. From then on any copy from any agent hits this eval before merge — one stumble's judgment becomes a permanent, zero-marginal-cost gate.

这正是"评测套件是组织沉淀下来的判断"(ENG·15)的微观一幕。关键不在那条正则有多巧,在于发生了一次状态转变:判断从"住在某个 QA 的脑子里、每次重新调用"变成了"住在套件里、自动复用"。eval 套件因此是会复利的资产——它拦下的回归次数 = 每天跑的遍数 × 每遍挡下的潜在回归,这个数随团队产能上升而上升,而成本几乎为零。这与靠人手补漏的曲线正好相反:那条曲线的成本随产能线性上升,收益却不积累。

This is the micro-scene of "the eval suite is the organization's accumulated judgment" (ENG·15). What matters is not how clever the regex is but that a state change happened: the judgment moved from "living in one QA's head, re-invoked each time" to "living in the suite, reused automatically." The eval suite is therefore a compounding asset — the regressions it blocks equal runs-per-day times potential-regressions-caught-per-run, a number that rises with team capacity at near-zero cost. This is the exact inverse of the patch-by-hand curve, whose cost rises linearly with capacity while its benefit never accrues.

留给人的那一道判断:"什么算对"这条标准本身——是人某次发现"英文界面混进中文标点很丢人"之后做的判断。机器能廉价地、确定地执行这条标准一百万遍,但它不会替你设立这条标准;标准从来是人的品味与上下文的产物。所以这条 eval 的价值不在自动化本身,在它把一次稀缺的人类判断放大成了无限次廉价的机器执行。〔"eval 即沉淀判断 / 验证基础设施是最大投资方向"源本系列验证篇综合,证据级 Ⅳ〕

The kept judgment: the standard of "what counts as correct" itself — a judgment a human made after once noticing "Chinese punctuation leaking into an English UI looks bad." A machine can cheaply and deterministically enforce that standard a million times, but it will not set the standard for you; the standard is always a product of human taste and context. So this eval's value is not in automation per se but in how it amplified one scarce human judgment into unbounded cheap machine enforcement. [The eval-as-accumulated-judgment / verification-infrastructure-as-the-largest-investment claim is from this series' Verification synthesis, grade Ⅳ.]

检验信号Test of this sheet
把你最近一次"agent 帮上大忙又差点闯祸"的真实经历套进这四个追问:执行在哪变充裕、错误从哪开始滚、哪道结构拦住它、人保留了哪道判断。如果你答不出"哪道结构拦住它",说明你还在用人当实时验证器——那是案例一里注定失败的位置。如果你答不出"人保留了哪道判断",说明你要么把构成性判断也交了出去(危险),要么还没意识到自己其实一直在持有它(没沉淀)。
Fit your most recent real "the agent helped enormously and nearly caused harm" episode into the four interrogations: where did execution become abundant, where did the error start to roll, which structure caught it, which judgment did the human keep. If you cannot answer "which structure caught it," you are still using a human as a live verifier — the doomed position in Case 1. If you cannot answer "which judgment did the human keep," you have either handed off a constitutive judgment too (dangerous) or not yet noticed you were holding it all along (un-accreted).
ENG
17
LEGACY STRUCTURES · 旧结构的失效
LEGACY STRUCTURES
机理 · 旧结构为何在充裕下断裂
Mechanism · Why Old Structures Break

六种被供奉的工程结构,为何在执行充裕时反而成了瓶颈

Six revered engineering structures, and why they become the bottleneck when execution is abundant

这些不是稻草人——它们是过去三十年里被当作"最佳实践"供奉起来的真实结构。每一种都为"执行稀缺"的世界做过正确的优化:当写一行代码很贵、改一处很慢、人是产出的瓶颈时,它们都言之成理。问题在于约束反转了——执行变充裕、判断变稀缺之后,同一个结构开始把它本想保护的东西碾碎。下面逐个点名,讲机制,不讲情绪。

These are not straw men — they are real structures enshrined as "best practice" over the past thirty years. Each was a correct optimization for a world of scarce execution: when writing a line cost a lot, changing one was slow, and humans were the output bottleneck, every one of them made sense. The problem is that the constraint inverted — once execution is abundant and judgment is scarce, the same structure begins to crush the very thing it meant to protect. Below, each is named, with mechanism, not mood.

① 瀑布与"大设计先行"(BDUF)——规划视野的赌注下错了地方

① Waterfall and big-design-up-front (BDUF) — the planning-horizon bet placed in the wrong era

先说一个常被忘记的事实:Royce 1970 那篇被奉为瀑布起源的论文,其实是在警告单趟顺序模型有内在风险,并主张至少迭代两遍——后世把一张"反面教材"图当成了圣经〔R9〕。把它放一边,看机制本身:BDUF 把大量判断前置到信息最少的时刻(项目开端),赌的是"早期规划的价值 > 计划过期的损失"。这个赌注在执行稀缺时常常成立——既然落地要几个月,早想清楚省得返工。但执行一旦变充裕,落地从几个月坍缩到几小时,计划的保鲜期比落地周期还短:你花三周写的详尽设计,在 agent 三天就能试错完五个方案的世界里,落地前就过期了(见 ENG·13 规划视野坍缩)。BDUF 不是"做得不够细",是赌错了时代——它把判断花在了信息最少、且最快过期的那一刻。

First, a fact often forgotten: Royce's 1970 paper, revered as waterfall's origin, actually warned that the single-pass sequential model carried inherent risk and argued for iterating at least twice — posterity took a "cautionary diagram" as scripture [R9]. Set that aside and look at the mechanism: BDUF front-loads a mass of judgment to the moment of least information (project start), betting that "the value of early planning > the loss from plans expiring." That bet often held under scarce execution — if shipping takes months, thinking it through early saves rework. But once execution is abundant, shipping collapses from months to hours, and the plan's shelf life is shorter than the build cycle: the detailed design you spent three weeks on expires before it lands, in a world where an agent can trial-and-error five approaches in three days (see ENG·13, the collapsing planning horizon). BDUF is not "insufficiently detailed" — it bet on the wrong era, spending judgment at the moment of least information and fastest expiry.

② 代码评审当守门——批量、排队,与逐行读 diff 的注意力破产

② Code review as gatekeeping — batches, queues, and the attention bankruptcy of reading diffs line by line

把"高级工程师逐行读完每个 PR"当成质量的唯一闸门,在产出量小的时候是有效的——人读得过来。机制在产出量放大后断裂:当 agent 队伍一天产出五十个 PR,那道"必须有人逐行读完"的门立刻变成排队系统里的单点瓶颈。排队论说得很清楚——到达率逼近处理率时,排队时间非线性爆炸(Reinertsen《产品开发流的原则》论批量与队列)〔R10〕。结果有二:要么 PR 堆积、交付被这道门拖死;要么评审者为了清队列开始"橡皮图章",逐行读退化成扫一眼点通过——门还在,但已不挡任何东西。更深的问题是:逐行读 diff 检验的是"代码长什么样",而 agent 时代真正该守的是"测试有没有锁住意图、边界对不对、接缝合不合理"。把稀缺的高级判断耗在读语法上,是把人放回了被放大的执行里。

Treating "a senior engineer reads every PR line by line" as the sole quality gate works when output is small — a human can keep up. The mechanism breaks once output is amplified: when an agent fleet ships fifty PRs a day, the "a human must read every line" gate instantly becomes a single-point bottleneck in a queue. Queueing theory is blunt about this — as arrival rate approaches service rate, queue time explodes non-linearly (Reinertsen, The Principles of Product Development Flow, on batch size and queues) [R10]. Two outcomes follow: either PRs pile up and delivery dies at this gate, or reviewers start rubber-stamping to clear the queue, and line-by-line reading degrades into a glance-and-approve — the gate stands but blocks nothing. The deeper problem: reading diffs line by line checks "what the code looks like," whereas the thing actually worth guarding in the agent era is "do the tests lock the intent, are the boundaries right, do the seams make sense." Spending scarce senior judgment on reading syntax puts the human back inside the amplified execution.

不是取消评审,是改评审的对象。把门从"逐行读产物"移到"审规格与接缝"——评审 diff 不评审产物、以测试为目标交办(见 ENG·06),让独立验证器和 eval 去守语法与回归,人只在被结构标红处接管。这不是降低标准,是把同一份注意力投到它唯一不可替代的地方。

The fix is not abolishing review but changing what it reviews. Move the gate from "read the artifact line by line" to "review the spec and the seams" — review the diff not the artifact, delegate toward a test (see ENG·06), let independent verifiers and evals guard syntax and regressions, and have the human take over only where the structure flags red. This is not lowering the bar; it is investing the same attention where it is irreplaceable.

③ QA 当事后工序——把验证从产线末端搬回每一个节点

③ QA as an afterthought — moving verification from the line's end back into every node

"先让开发把功能堆完,再交给 QA 在末端集中测"是流水线思维的产物:验证是一道独立的、靠后的工序。这在执行稀缺时尚可忍受——产出慢,末端积压的待测量也小。执行充裕后这条假设直接崩塌:产线前端(生成)的吞吐涨了一个数量级,末端那道人力 QA 工序的吞吐没涨,中间的在制品(未验证代码)爆炸式堆积。更致命的是案例一的雪球机制——一个早期未验证的错误会被后续每一步当作既定前提放大;把验证拖到末端,等于让错误在被发现前先滚二十步,而越往后纠错越贵。

"Let developers pile up features first, then hand off to QA to test in bulk at the end" is a product-line mindset: verification is a separate, late station. This was tolerable under scarce execution — output was slow and the backlog at the end was small. Under abundant execution the assumption collapses outright: the front of the line (generation) gains an order of magnitude of throughput, the human-QA station at the end does not, and the work-in-progress between them (unverified code) piles up explosively. Worse is Case 1's snowball: an early unverified error gets amplified as a settled premise by every later step; deferring verification to the end lets the error roll twenty steps before discovery, and the later you correct, the more it costs.

机制上的修复是把验证从"末端工序"改成"每个节点的内建检查点"——这正是失败学那张双曲线图的论点(ENG·12):有检查点的轨迹趁错小就拦回近零,没检查点的指数放大成雪球。QA 不该是产线末端的一个部门,该是织进每个循环的独立验证器。质量不是末端检出来的,是每一步锁住意图锁出来的。

The mechanical fix is to change verification from an "end station" to a "built-in checkpoint at every node" — exactly the argument of the failure sheet's twin-curve figure (ENG·12): the checkpointed trajectory resets the error to near zero while it is small, the un-checkpointed one amplifies exponentially into a snowball. QA should not be a department at the end of the line but an independent verifier woven into each loop. Quality is not inspected in at the end; it is locked in, intent by intent, at every step.

④ "10x 工程师"神话——它量错了对象,而它量的那个对象正在被廉价化

④ The "10x engineer" myth — it measured the wrong thing, and that thing is now being made cheap

"10x 工程师"的说法可以一直追到 Sackman、Erikson、Grant 1968 那项实验:他们发现程序员之间编码时间差到约 20:1、调试时间差到约 25:1——但同一份数据里还有一条常被略去的发现:个人产出与经验年限无关〔R8〕。这个神话有两个结构性毛病。其一,它在方法上一直可疑(把汇编与高级语言的被试混在一起统计),后世研究虽反复确认"量级差异存在",但把它人格化成"存在一种 10 倍的天才个体"是过度解读。其二、也更要命:它当年量的"快"主要是个体的实现/打字吞吐——而这恰恰是 agentic coding 今天正在廉价化、去稀缺化的那一项能力。

The "10x engineer" phrase traces all the way back to the 1968 experiment by Sackman, Erikson, and Grant: they found coding-time differences between programmers of about 20:1 and debugging-time differences of about 25:1 — but the same data carried a finding often dropped: individual output had no relationship to years of experience [R8]. The myth has two structural defects. First, it was always methodologically suspect (it pooled assembly and high-level-language subjects), and while later studies repeatedly confirm that "order-of-magnitude differences exist," personifying that into "there exists a 10x genius individual" is over-reading. Second, and more fatal: the "fast" it measured back then was mainly individual implementation / typing throughput — precisely the capability that agentic coding is now making cheap and un-scarce.

机制:当瓶颈从"打字快"移开,"打字快 10 倍的人"这个优势项就贬值了。放大效应今天落在另一处——能把判断、品味、上下文与边界拿捏到位的人,通过指挥一支 agent 队伍,产出可以是十倍、百倍。但这已经不是 1968 年那个"个体英雄"的故事,而是内核第②步的故事:稀缺的不再是谁打字快,是谁的判断对。继续招聘、考核、神化"10x 个体的实现速度",等于在为一个正在消失的瓶颈优化。

Mechanism: once the bottleneck moves off "types fast," the advantage of "the person who types 10x faster" depreciates. The amplification now lands elsewhere — a person who can get judgment, taste, context, and boundaries right can, by directing an agent fleet, produce ten or a hundred times the output. But this is no longer the 1968 "individual hero" story; it is the story of kernel step ②: what is scarce is no longer who types fast but whose judgment is right. To keep hiring, grading, and mythologizing "the 10x individual's implementation speed" is to optimize for a vanishing bottleneck.

⑤ 工单工厂——把工程师当吞吐单元,正好优化掉了唯一还稀缺的东西

⑤ The ticket factory — treating engineers as throughput units optimizes away the one thing still scarce

工单工厂式团队把工程组织成一条流水线:需求拆成颗粒度均匀的工单,工程师是可互换的吞吐单元,绩效=单位时间关掉的工单数。这套结构服务的是"执行稀缺、产出量是瓶颈"的世界——最大化人均代码吞吐确实合理。问题:它系统性地优化掉了在 AI 充裕下唯一还稀缺的东西。把人当吞吐单元,意味着不奖励、甚至惩罚那些"慢下来想清楚什么算对""花时间画对一道接缝""停下来质疑这个工单本身要不要做"的行为——而这些恰恰是 agent 替不了、且正在变成全部价值所在的判断节点。

A ticket-factory team organizes engineering as an assembly line: requirements split into uniformly grained tickets, engineers as interchangeable throughput units, performance measured as tickets closed per unit time. This structure serves a world of "scarce execution, output as the bottleneck" — maximizing per-head code throughput is genuinely reasonable there. The problem: it systematically optimizes away the one thing still scarce under AI abundance. Treating people as throughput units means not rewarding — even penalizing — the behaviors of "slowing down to get clear on what counts as correct," "spending time to draw a seam right," "stopping to question whether this ticket should be built at all" — which are exactly the judgment nodes an agent cannot replace and which are becoming where all the value sits.

机制:你考核什么,就得到什么。用关单速度考核,得到的是更快地生成更多"看起来完成了"的代码,以及一支没人对整体负责、没人持有构成性判断的队伍。当执行变得几乎免费,"更快关更多单"的边际价值趋零,而"这单到底该不该做、做对了没"的判断变成了全部。工单工厂不是被 agent 取代,是被自己的考核函数掏空了——它奖励的那一项,正是 agent 现在白送的。

Mechanism: you get what you measure. Grade on ticket-closing speed and you get faster generation of more "looks-done" code, and a team where nobody owns the whole and nobody holds the constitutive judgment. When execution becomes nearly free, the marginal value of "close more tickets faster" trends to zero, while the judgment of "should this ticket be built at all, and was it built right" becomes everything. The ticket factory is not replaced by the agent; it is hollowed out by its own scoring function — the very thing it rewards is what the agent now gives away for free.

⑥ 每一步都人工审批——把人海塞回放大了一万倍的执行洪流里

⑥ Manual approval at every step — pouring a crowd of humans back into a flood of execution amplified ten-thousandfold

"每个变更、每次部署、每条命令都要人点确认"是用审批密度换安全感的旧反射。它的隐含模型是"变更很少,所以每个都值得人看一眼"。执行充裕把这个前提炸了:当 agent 一天发起一万次操作,"每一步都人审"在算术上不可能——人不是不够勤奋,是吞吐量差了三四个数量级。强行维持的结果只有两条,都坏:要么审批成为吞吐瓶颈,把 AI 的全部速度优势抵消干净(你买了辆跑车却规定每过一个路口都要下车推);要么人为了跟上而进入"无脑点同意"模式——审批框还在弹,但已经没有任何判断发生,纯粹是仪式。这又一次撞上自动化的反讽:绝大多数操作都没问题,人的警觉性在第一千次点同意时早已归零,恰在第一万零一次那个真该拦下的操作上,人照样点了同意。

"Every change, every deploy, every command needs a human to click confirm" is the old reflex of trading approval density for a sense of safety. Its implicit model is "changes are rare, so each is worth a human glance." Abundant execution detonates that premise: when an agent initiates ten thousand operations a day, "a human reviews every step" is arithmetically impossible — not for lack of diligence but because throughput is off by three or four orders of magnitude. Forcing it yields only two outcomes, both bad: either approval becomes the throughput bottleneck and cancels out all of AI's speed (you bought a sports car but mandated getting out to push it through every intersection), or the human, to keep up, enters "mindlessly click approve" mode — the dialog still pops but no judgment happens, it is pure ritual. This again hits the irony of automation: the vast majority of operations are fine, the human's vigilance hit zero around the thousandth approval, and on the ten-thousand-and-first — the one that truly should be stopped — the human clicks approve all the same.

机制上的修复是从"密度"切到"分级":按可逆性 × 爆炸半径分档(正是 INSTRUMENT 07 做的):低半径、易回退的操作完全交给结构(类型/测试/lint)、不打扰人;只有不可逆 × 高半径那一档才设显式确认门。这把人的稀缺确认从"每一步都点、点到麻木"重新聚焦到"一年只点几次、但每次都真的在判断"。审批不是越多越安全——审批密度过高反而通过警觉性衰减制造了不安全。安全来自把人放在对的那几个节点,不是放在所有节点。

The mechanical fix is to switch from "density" to "tiering": grade by reversibility x blast radius (exactly what INSTRUMENT 07 does): low-radius, easily-reverted operations are handed fully to structure (types / tests / lint) and never bother a human; only the irreversible x high-radius tier gets an explicit confirmation gate. This refocuses the human's scarce confirmation from "click every step into numbness" to "click a few times a year, but actually judging each time." More approval is not more safe — excessive approval density manufactures unsafety through vigilance decay. Safety comes from putting the human at the right few nodes, not at all nodes.

核心图KEY FIGFIG. E17.0 / THE INVERSION · 同一结构,约束反转前后 看懂:横轴是约束(执行稀缺→充裕),每条线是同一个旧结构从"最佳实践"滑向"瓶颈"的轨迹 Read: the x-axis is the constraint (execution scarce → abundant); each line is one old structure sliding from "best practice" to "bottleneck"
价值高 high value 成瓶颈 bottleneck 执行稀缺(旧世界) execution scarce (old world) 执行充裕(现在) execution abundant (now) 约束反转点 inversion point 瀑布/BDUF · 每步审批 waterfall/BDUF · per-step approval 评审守门 · QA 事后 review gatekeeping · late QA 10x 神话 · 工单工厂 10x myth · ticket factory 同一结构在这里是对的 the same structure is right here 约束反转后,它开始碾碎它本想保护的东西 after inversion, it crushes what it meant to protect
这六种结构不是"愚蠢"——它们曾经是对的。它们的失效是同一个机制:都为"执行稀缺"做过正确优化,而约束反转后,为旧瓶颈做的优化在新瓶颈面前变成了阻碍。批判它们不是为了嘲笑过去,是为了认出"哪些做法的有效期已经过了"。These six structures are not "stupid" — they were once right. Their failure shares one mechanism: each was a correct optimization for scarce execution, and after the constraint inverted, an optimization for the old bottleneck became an obstacle in front of the new one. Critiquing them is not to mock the past but to recognize which practices have passed their expiry.
INSTRUMENT 12 · 错误复利模拟器 INSTRUMENT 12 · Error-Compounding Simulator ● LIVE
检验信号Test of this sheet
挑你团队里最神圣不可侵犯的一条工程流程,问三个问题:它当年优化的是不是"执行稀缺"?那个约束今天还成立吗?如果不成立,它现在是在保护质量,还是在你产能上来之后变成了那道堵住一切的门?如果三个问题里有两个让你不舒服,你大概率正在供奉一个过期结构。注意:答案几乎从不是"取消它",而是"把它守的对象,从执行换成判断"。
Take the most sacred, untouchable engineering process on your team and ask three questions: was it optimizing for "scarce execution"? Does that constraint still hold today? If not, is it still protecting quality, or has it become the gate that jams everything once your capacity rose? If two of the three make you uncomfortable, you are probably enshrining an expired structure. Note: the answer is almost never "abolish it" but "switch what it guards from execution to judgment."
ENG
12
SPECULATION · 推演幕
SPECULATION · The Speculation Act
推论 · 外推,非事实
Inference · Extrapolation, Not Fact

2026-2032:工程这门手艺的下一道折痕

2026 to 2032: The Next Fold in the Craft of Engineering

这一幕不预测哪条线会发生,而是张开一个可能性空间:哪些曲线在汇流、各自的先行指标与证伪条件、以及若它们走到底,工程师这份工作会被折成什么形状。

This act does not predict which line occurs; it opens a possibility space — which curves are converging, the leading indicators and falsification conditions of each, and what shape the engineer's job gets folded into if they run to completion.

本章性质 · 推论以下是基于 2024-2026 公开轨迹的外推,不是事实陈述。本卷愿意接受的证伪条件随每条曲线列出——当外推被现实推翻时,本章应当最先被改写。
Nature of this chapter · InferenceWhat follows is extrapolation from the public trajectory of 2024-2026, not a statement of fact. The falsification conditions this volume will accept are listed with each curve — when reality overturns the extrapolation, this chapter should be the first thing rewritten.

推演不是畅想,它接着本卷的内核往前推。前十一张已经确立:执行变充裕,判断沿可验证性梯度退守到意图、约束、验证三处,上下文成为可查询的基础设施,人回到意义。推演幕只问一句——若这条内核再走六年,杠杆点会从今天停的那一层(context / harness)继续上移到哪里,而工程师又会站在哪一层?这一节给出的不是答案,是一张标了证伪条件的地图。〔源 本卷内核四步(SHEET 01)的时间轴外推,证据级 Ⅴ 推论〕

Speculation is not daydreaming; it pushes the volume's kernel forward. The first eleven sheets established it: execution becomes abundant, judgment retreats along the verifiability gradient to intent, constraints, and verification, context becomes queryable infrastructure, and people return to meaning. The speculation act asks only one thing — if that kernel runs another six years, where does the leverage point move from the floor it stands on today (context / harness) up to next, and which floor does the engineer then stand on? What this sheet offers is not an answer but a map with falsification conditions written on it. [Source: a timeline extrapolation of this volume's four-step kernel (SHEET 01), grade Ⅴ inference.]

先钉一个量过的锚:能力涨,体感骗人

First, an anchor that was measured: capability rises, the felt sense lies

推演要诚实,得先承认一个被实测反驳过的事实。2025 年 METR 做了一项随机对照试验:让 16 名资深开源开发者在自己熟悉的成熟仓库上完成 246 个真实任务,随机分成"允许用 AI 工具"与"不许用"两组。结果与几乎所有人的预期相反——用 AI 那组平均慢了约 19%,而开发者自己事前预计会快 24%、事后仍以为快了约 20%。也就是说,在一类特定场景(资深者 × 高度熟悉的大型代码库)里,当下这代工具实际拖慢了人,却制造出"我更快了"的强烈错觉。这一条不削弱本卷的命题,反而是它最锋利的脚注:生成变快 ≠ 交付变快,省下的打字时间会被找回、读懂、纠正生成物的判断时间吃掉——除非把判断也搬到验证基础设施里。把这条放在推演幕开头,是为了让后面所有"会更快、会更强"的外推都带着这条体感校正阅读。〔源 METR 2025《Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity》,随机对照试验,证据级 Ⅱ 受控实测;单项研究、特定人群与仓库,尚未广泛复现,故不外推到全部场景[R7]

For speculation to stay honest, it must first own a fact that got slapped down by measurement. In 2025 METR ran a randomized controlled trial: 16 experienced open-source developers completed 246 real tasks on repositories they knew well, randomized into "allowed to use AI tools" versus "not allowed." The result ran against almost everyone's expectation — the AI group was on average about 19% slower, while the developers had forecast a 24% speedup beforehand and still believed they had been about 20% faster afterward. In one specific setting (experienced developers × large, highly familiar codebases), this generation of tools actually slowed people down while manufacturing a strong illusion of "I am faster." This does not weaken the volume's thesis; it is its sharpest footnote: faster generation ≠ faster delivery, and the typing time saved is eaten by the judgment time of finding, reading, and correcting the generated output — unless that judgment is also moved into verification infrastructure. Nailing this at the top of the speculation act forces every downstream "will be faster, will be stronger" extrapolation to be read with this felt-sense correction attached. [Source: METR 2025, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," randomized controlled trial, grade Ⅱ controlled measurement; a single study on a specific population and codebase type, not yet widely replicated, so not extrapolated to all settings. [R7]]

三条正在汇流的曲线Three Converging Curves

Three Converging Curves

AI-Native 工程的未来不止是"模型更聪明"。它背后是三条独立成熟、正在汇流的曲线——每一条松动一组今天的约束,三条叠加决定了推演空间的边界。给每条都附一个证伪条件:若该观测出现,这条曲线就停在原地,本章对应的外推作废。

The future of AI-Native engineering is more than "models getting smarter." Behind it are three curves maturing independently and now converging — each loosening one set of today's constraints; their superposition sets the boundary of the speculation space. Each carries a falsification condition: if that observation appears, the curve stalls and this chapter's corresponding extrapolation is void.

智能体编码经济学 · AGENTIC-CODING ECONOMICS
Agentic-coding economics · AGENTIC-CODING ECONOMICS
解锁Unlocks单位代码生成的边际成本趋近一次推理的电费;长时程自主任务(数小时连续运行、自带回滚)从演示走向常驻。当"试一个方案"几乎免费,工程的瓶颈彻底从"能不能写出来"移到"这个方向对不对、改错了赔多少"。The marginal cost of a unit of generated code approaches the electricity of one inference; long-horizon autonomous tasks (hours of continuous running with built-in rollback) move from demo to resident. When "try an approach" is nearly free, the engineering bottleneck shifts wholesale from "can it be written" to "is the direction right, and what does getting it wrong cost."
TRL早期商用 2025 已有按任务计价的编码 agent;多小时无人值守仍不可靠、回滚与归因未成熟。Early commercial Per-task-priced coding agents exist by 2025; multi-hour unattended runs are still unreliable, and rollback and attribution are immature.
证伪Falsified if若推理单价停止下降(算力/电力封顶),或长时程任务的错误率不随上下文工程改善而下降,则"生成近免费"反转,本曲线停在按需调用的助手态。If inference unit price stops falling (compute or power hits a ceiling), or long-horizon error rates do not fall as context engineering improves, then "near-free generation" reverses and this curve stalls at the on-demand-assistant stage.
脚手架标准化 · HARNESS STANDARDIZATION
Harness standardization · HARNESS STANDARDIZATION
解锁Unlocks围绕模型的脚手架(guides/sensors、上下文装配、工具协议)从各家自建走向有公共契约:MCP 类协议、可移植的 skills/commands 包、跨厂商的 agent 评测基准。harness 一旦像编译器工具链那样标准化,团队级的"会自我改进的循环"就能被打包、继承、交易,而不必每队从零搭。The scaffolding around the model (guides/sensors, context assembly, tool protocols) moves from each shop building its own toward public contracts: MCP-class protocols, portable skills/commands packs, cross-vendor agent benchmarks. Once the harness standardizes the way a compiler toolchain did, a team-level "self-improving loop" can be packaged, inherited, and traded rather than rebuilt from scratch by every team.
TRL协议萌芽 2025 MCP 等协议落地、采用快增;公共评测与可移植 harness 包仍稀缺。Protocol nascent Protocols like MCP landed in 2025 with fast adoption; public benchmarks and portable harness packs remain scarce.
证伪Falsified if若主要厂商各自圈地、协议碎片化且互不兼容,harness 永远绑死在单一平台,则"可移植、可交易的循环"不成立,团队仍困在重复造脚手架。If major vendors each fence off their own turf and protocols fragment incompatibly, the harness stays welded to a single platform, "portable, tradeable loops" fail to materialize, and teams stay trapped rebuilding scaffolding.
持续式 AI 流水线 · CONTINUOUS-AI PIPELINES
Continuous-AI pipelines · CONTINUOUS-AI PIPELINES
解锁UnlocksCI/CD 之后是 "Continuous AI":agent 常驻在仓库里,对每次提交自动分诊 issue、补测试、起 PR、做评审。工程组织的产出从"人写、机器跑测试"翻转为"机器写、机器先验、人定方向与守门"。验证基础设施(SHEET 11 的承重墙)从可选项变成这条流水线能不能合上的前提。After CI/CD comes "Continuous AI": agents reside in the repository, auto-triaging issues, backfilling tests, opening PRs, and reviewing on every commit. An engineering org's output flips from "humans write, machines run tests" to "machines write, machines pre-verify, humans set direction and hold the gate." Verification infrastructure (SHEET 11's load-bearing wall) turns from optional into the precondition for whether this pipeline can close at all.
TRL早期试点 2025 已有仓库挂常驻 agent 做 issue 分诊与 PR;规模化下的信噪比与责任归属未解。Early pilot By 2025 some repos run resident agents for issue triage and PRs; signal-to-noise and accountability at scale are unsolved.
证伪Falsified if若常驻 agent 产出的 PR/评审噪音持续淹没真信号、人审负担不降反升,或独立验证无法跟上生成速度,则流水线合不上,回到人工把关的批处理节奏。If resident-agent PR/review noise keeps drowning real signal and human-review burden rises rather than falls, or independent verification cannot keep pace with generation, the pipeline fails to close and reverts to a human-gated batch rhythm.

沿这三条曲线推的时间轴A Timeline Along Those Curves

A Timeline Along Those Curves

FIG. 12.0 / 2026-2032 · ENGINEERING SPECULATION ARC 看懂:四个时间桩,看判断瓶颈逐层上移到哪一层。 Read: four time-stakes; watch which floor the judgment bottleneck climbs to.
判断瓶颈所在层 ↑ judgment-bottleneck floor ↑ 2026 2028 2030 2032 上下文工程 context engineering 脚手架/循环 harness / loop 舰队编排 fleet orchestration 意图与守门 intent & gate-keeping agent 写函数,人审 diff agents write fns, humans review diffs 循环可打包,按任务计价 loops packageable, per-task priced 常驻 agent 群跑流水线 resident agent fleets run pipelines 人只持意图/约束/验证 humans hold only intent/constraints/verification 注:曲线是外推(证据级 Ⅴ);唯一实测锚点是 METR 2025 RCT(证据级 Ⅱ),它提醒体感快 ≠ 真的快。 Note: the curve is extrapolation (grade Ⅴ); the only measured anchor is the METR 2025 RCT (grade Ⅱ), warning that felt-fast ≠ actually fast.
这条上升的阶梯不是"工程师被一层层取代",而是同一个判断瓶颈逐层上移:2026 人还在审 diff;2028 循环本身成了可打包的资产,人审的是"该不该装这条循环";2030 你管的是一群常驻 agent,判断从单次产出移到舰队级的方向与守门;2032 若三条曲线都走到底,人手里只剩内核第②③④步那三件——意图、约束、验证——其余系统自己长。每一桩都带着 FIG 底部那条 METR 校正:能力上移不等于体感诚实,越往上越要靠实测而非感觉确认自己真的更快。
This rising staircase is not "engineers replaced floor by floor" but the same judgment bottleneck climbing floors: in 2026 you still review diffs; by 2028 the loop itself becomes a packageable asset and you judge "should this loop be installed"; by 2030 you manage a fleet of resident agents, with judgment moved from single outputs to fleet-level direction and gate-keeping; by 2032, if all three curves run to completion, the human holds only the three things from kernel steps ②③④ — intent, constraints, verification — and the rest the system grows. Every stake carries the METR correction at the foot of the figure: capability climbing does not make the felt sense honest; the higher you go, the more you must confirm you are actually faster by measurement, not by feel.

来自那些年份的两件文物Two Artifacts from Those Years

Two Artifacts from Those Years

推演若只有论断会显得抽象。下面两件是 design fiction——明确虚构的未来文物,用来让"判断密度的工程师"这个抽象命题可触。它们不是预测,是把内核投影到 2032,做成你能拿在手里翻看的东西。

Speculation made only of assertions would feel abstract. The two pieces below are design fiction — explicitly fictional future artifacts that make the abstract claim "an engineer measured by judgment density" touchable. They are not predictions; they project the kernel onto 2032 as something you can hold and leaf through.

SPECULATIVE · 虚构 · Fiction
ARTIFACT 01 · 2032 招聘启事 · A 2032 Job Posting
资深工程师(舰队判断岗)· 招聘启事节选
Staff Engineer (Fleet Judgment) — Job Posting (Excerpt)
岗位职责
为一支约 40 个常驻编码 agent 的舰队持有意图、约束与验证。你不再逐行实现;你写规格、设守门标准、在 agent 之间裁决方向冲突。
What you'll do
Hold intent, constraints, and verification for a fleet of roughly 40 resident coding agents. You no longer implement line by line; you write specs, set gate-keeping criteria, and adjudicate directional conflicts among agents.
硬性要求
能把一类质量判断翻译成一条会变红的 eval;能读懂自己看不懂的领域里"哪里该停下来问人";对"自信而错"有生理级的警觉。要求每天写多少行代码。
Must-haves
Able to translate a class of quality judgment into an eval that turns red; able to sense, in a domain you don't fully understand, "where to stop and ask a human"; a physiological alertness to confident wrongness. We do not ask how many lines of code you write per day.
考核口径
判断密度(你守住的判断节点数)与方向正确度,而非产出量——印证 SHEET 06"实现者 → 编排者"的 2032 岗位形态。
How you're measured
Judgment density (the number of judgment nodes you hold) and directional correctness, not output volume — the 2032 form of SHEET 06's "implementer → orchestrator."
不招的人
把"会用 agent 快速产出大量代码"当卖点的候选人。我们见过那条曲线尽头:未经独立验证的高速产出,是债,不是资产。
Who we won't hire
Candidates who pitch "I can use agents to produce a lot of code fast" as their selling point. We've seen the end of that curve: high-speed output without independent verification is debt, not an asset.
SPECULATIVE · 虚构 · Fiction
ARTIFACT 02 · 2032 事故复盘 · A 2032 Incident Postmortem
事故复盘:一支 agent 舰队的静默回归
Postmortem: A Silent Regression Across an Agent Fleet

2032-02,一条被多支 agent 共享的"快速修复"skill 里潜入一个错误假设:它在重试失败的网络调用时悄悄吞掉了一类超时异常。23 支常驻 agent 在四天里把这条 skill 复用进了 1,900 多个 PR,全部通过了既有 eval——因为没有任何一条 eval 覆盖"被吞掉的超时"。事故不是某个 agent"变笨",而是一次教科书式的承重墙缺口:生成高速复利,而验证墙在这一类错误上恰好是空的。

In February 2032, a shared "quick-fix" skill used by multiple agents carried a buried wrong assumption: when retrying failed network calls it silently swallowed a class of timeout exceptions. Over four days, 23 resident agents reused this skill into more than 1,900 PRs, all of which passed the existing evals — because no eval covered "a swallowed timeout." The incident was not any one agent "getting dumber" but a textbook gap in the load-bearing wall: generation compounded at high speed while the verification wall happened to be empty on exactly this class of error.

根因
共享 skill 把一类判断("超时不能静默")当成了实现细节,没有沉淀成独立的 eval。复利的不只是 skill,还有它携带的盲区。
Root cause
The shared skill treated a class of judgment ("timeouts must not be silent") as an implementation detail, never distilled into an independent eval. What compounded was not only the skill but the blind spot it carried.
修复
补一条会变红的 eval;按 steering loop,问"哪条 sensor 本该拦住它",把答案磨进共享 harness——这条 skill 此后在全舰队携带它自己的检验。
Fix
Add an eval that turns red; following the steering loop, ask "which sensor should have caught this" and grind the answer into the shared harness — this skill now carries its own check across the whole fleet.

本卷记下对自己的反向赌注The Counter-Bet, on the Record

The Counter-Bet, on the Record

诚实的推演要把反对自己的最强论点也记下来。本卷的命题是:执行变充裕、判断退守到验证,于是验证基础设施成为最大投入方向、工程师价值上移到判断。反向赌注是这样反驳的——这套"判断上移"的图景,可能只是一类特定工作(资深者、成熟代码库、判断本就稀缺)的局部真相,而非工程的普遍走向。METR 那条实测已经露了口子:在那个场景里,当下工具不仅没让人更快,还制造了"更快"的错觉;若这种"体感快、实则慢"在更多场景里成立,那么"判断退守、验证补位"可能根本来不及发生——团队会先被一波看起来高效、实则注入海量未验证债务的生成淹没,在验证墙长高之前就崩盘。更尖锐地说:也许瓶颈从来不是"执行稀缺",而是"判断稀缺",而 AI 恰好不放大判断、只放大执行——那本卷"把判断带宽重新投到验证"的处方,对一个判断本就不足的团队就是空头支票,因为它们缺的正是开出这张支票所需的判断。这个反向赌注怎样被证实:若三五年后,采用 agentic 编码最激进的团队,其线上事故率、返工率、技术债指标系统性地高于克制使用的团队,且差距不随"补验证"而收敛,那么是本卷错了,不是它们错了。把这条写在这里,是因为一个不敢记录自己证伪条件的方法论,不配被照着做。〔源 反向论点综合:METR 2025 RCT(证据级 Ⅱ)+ "判断稀缺而非执行稀缺"的对立假设(证据级 Ⅴ 推论)[R7]

Honest speculation records the strongest argument against itself too. This volume's thesis: execution becomes abundant, judgment retreats to verification, so verification infrastructure becomes the largest investment direction and the engineer's value climbs to judgment. The counter-bet rebuts it like this — this "judgment climbs" picture may be only a local truth of one specific kind of work (experienced people, mature codebases, judgment already scarce), not the general direction of engineering. The METR measurement already shows the crack: in that setting, current tools not only failed to make people faster but manufactured the illusion of "faster"; if this "felt-fast, actually-slow" holds across more settings, then "judgment retreats, verification fills in" may never get the chance to happen — teams will first be drowned by a wave of generation that looks efficient but injects mountains of unverified debt, collapsing before the verification wall has grown tall. More sharply: perhaps the bottleneck was never "execution is scarce" but "judgment is scarce," and AI happens to amplify execution, not judgment — in which case this volume's prescription to "reinvest judgment bandwidth into verification" is a bad check to a team that was short on judgment to begin with, because what they lack is exactly the judgment needed to write that check. How this counter-bet gets confirmed: if, in three to five years, the teams most aggressive in adopting agentic coding show systematically higher production-incident rates, rework rates, and tech-debt metrics than restrained teams, and the gap does not converge as they "add verification," then the volume is wrong, not them. It is written here because a methodology that dares not record its own falsification condition does not deserve to be followed. [Source: counter-argument synthesis — the METR 2025 RCT (grade Ⅱ) plus the rival hypothesis that judgment, not execution, is the scarce factor (grade Ⅴ inference). [R7]]

读这一幕的方式How to Read This Act

How to Read This Act

不要把这一幕当路线图照搬,把它当一张带刻度的赌桌:三条曲线是赌注,每条都标了"什么观测会让我认输"。你能做的,是盯住那几个先行指标——推理单价的斜率、harness 协议是收敛还是碎裂、你自己团队里"被自动复检的改动占比"是涨是停——用它们校准自己站在那道阶梯的哪一桩上,而不是听任体感替你判断。这一幕唯一确定的事,是它会被改写;本卷愿意第一个改它。

Do not copy this act as a roadmap; treat it as a betting table with a scale on it: the three curves are wagers, each labeled with "what observation makes me concede." What you can do is watch those few leading indicators — the slope of inference unit price, whether harness protocols converge or fracture, whether your own team's "share of changes that get auto-rechecked" rises or stalls — and use them to calibrate which stake of the staircase you stand on, rather than letting the felt sense judge for you. The only certain thing about this act is that it will be rewritten; this volume volunteers to be the first to rewrite it.

ENG
11
PLAYBOOK · 落地
PLAYBOOK
行动 · 可执行
Action

落地 · 工具是表层,原理是底层

Rollout · tools are surface, principles are the floor

前面所有"为什么这个工具放大"都能归到五条贯穿原理。记住原理,工具换了也不慌——下一个 Markdown、下一个 TypeScript 出现时,你用同一把尺就能认出它。

Every "why this tool gets amplified" above reduces to five through-lines. Hold the principles and you are unfazed when tools change — when the next Markdown or the next TypeScript appears, the same ruler recognizes it.

五条贯穿原理

Five through-lines

这五条不是并列的口号,它们之间有依赖顺序,连起来就是前面所有图纸的压缩。可读是地基——agent 读不懂的东西,后面四条都无从谈起(ENG·02)。在可读之上,可 diff / 可版本 / 可审把"改动"变成可评审的提交而非不可追溯的覆盖(ENG·02 / 06)。再上一层,机器可检验的规格给生成一个目标函数,让对错能自动收敛(ENG·03 / 07)。可组合的能力接口(skills / MCP / CLI)让 agent 能安全地扩展自己的触达范围,且每个接口都被单独授权(ENG·04 / 08)。最后,会自我改进的循环把前四条缝成一个随产出复利的系统(ENG·04 / 05)。贯穿五条的是同一句话:人和 agent 读写同一份源。认得这五条,下一个 Markdown、下一个 TypeScript 出现时,你不用追工具新闻,用同一把尺就能判断它会不会被放大。

These five are not parallel slogans; they have a dependency order, and chained together they are the compression of every sheet above. Legibility is the foundation — what an agent cannot read makes the other four moot (ENG·02). On top of it, diffable / versionable / reviewable turns a change into a reviewable commit rather than an untraceable overwrite (ENG·02 / 06). A layer up, machine-checkable specs give generation an objective function so correctness can converge on its own (ENG·03 / 07). Composable capability interfaces (skills / MCP / CLI) let the agent safely extend its reach, each interface authorized separately (ENG·04 / 08). Finally, self-improving loops stitch the first four into a system that compounds with output (ENG·04 / 05). Running through all five is one line: humans and agents read and write the one source. Recognize these five and you need not chase tool news — when the next Markdown or the next TypeScript appears, the same ruler tells you whether it will be amplified.

三原则 + 三指标

Three principles + three metrics

01 / ↓
死磕狗粮 · 上手爬坡↓Dogfood · ramp↓
人人用自己的产品;新人多快有效产出。Everyone uses the product; how fast a newcomer becomes effective.
02 / ↓
尽量扁平 · PR 周期↓Stay flat · PR cycle↓
经理先做 IC;PR 周期暴露管线短板。Managers start as ICs; PR cycle time surfaces pipeline strain.
03 / ↑
杀死死流程 · Claude 提交↑Kill dead process · Claude commits↑
不断追问"为何还这么做"。但别把吞吐当成功。Keep asking "why still this way." But don't mistake throughput for success.

三个指标都带一条反指标,因为指标会被玩坏。上手爬坡时间↓是真信号——上下文成了基础设施,新人第一周就能交付;但若靠降低交付标准来缩短爬坡,那是把指标做漂亮、把质量做没了。PR 周期↓暴露管线短板,但若靠绕过评审门来提速,等于拆了承重墙。"Claude 提交占比↑"最危险:它极易被当成成功的代名词,但吞吐不是成功——生成一万行没人需要、没人验证的代码,比写一百行对的代码更糟。所以每个指标真正的读法是"它在不在以正确的方式上升",而不是"它有没有上升"。

Each of the three metrics carries a counter-signal, because metrics get gamed. Ramp-time↓ is a true signal — when context is infrastructure, a newcomer ships in week one; but shortening the ramp by lowering the delivery bar makes the metric pretty and the quality gone. PR-cycle↓ surfaces pipeline strain, but speeding it up by bypassing the review gate is tearing out a load-bearing wall. "Claude-commit share↑" is the most dangerous: it is easily taken as a synonym for success, but throughput is not success — generating ten thousand lines no one needs and no one verified is worse than writing a hundred correct ones. So the real reading of each metric is "is it rising the right way," not "is it rising."

最后,把这一卷倒过来读。前面十二张图纸都在回答"怎么造"——可读、可验、可组合、自改进。但和组织卷一样,在"怎么造"之前有一个更先的问题:"为何造"。效率四百年来被默认成目标本身;AI 第一次让效率变充裕,于是不必再把效率本身当成终点。若把验证、规格、harness 优化到极致,却让工程师沦为给生成流水线点"通过"的人肉橡皮图章,那只是把工程带回了吞吐逻辑。这一卷的全部技术,是为了把人从打字和吞吐里解放出来,回到只有人能做、也值得人去做的判断与建造:决定该造什么、何为对、接缝放哪。把可验证性做到极致,正是为了让工程师回到工程师真正该做的判断——决定该造什么、何为对、接缝放哪。

Finally, read this volume in reverse. The twelve sheets above all answer "how to build" — legible, verifiable, composable, self-improving. But as in the organization volume, before "how to build" sits an earlier question: "what to build it for." For four centuries efficiency was assumed to be the goal itself; AI makes efficiency abundant for the first time, so it need no longer be treated as the goal at all. If verification, specs, and the harness are optimized to the limit while the engineer is reduced to a human rubber stamp clicking "approve" on a generation pipeline, engineering has merely been pulled back into throughput logic. The entire technique of this volume exists to free people from typing and throughput and return them to the judgment and building only people can do and that is worth doing: deciding what to build, what is correct, where the seams go. Pushing verifiability to the limit is in service of returning the engineer to the judgment that is actually theirs — deciding what to build, what is correct, where the seams go.

深潜页

Deep-dive chapters

架构篇ARCHITECTURE
实现充裕后,架构是不让生成坍缩成技术债的稀缺结构。When implementation is abundant, architecture is the scarce structure that keeps generation from tech debt.
谱系篇X-ENG STACK
prompt→context→spec→harness→loop 是一栋楼,杠杆逐层上移。prompt→context→spec→harness→loop is one building; leverage climbs.
验证篇VERIFICATION
生成近免费、验证依旧贵——验证是唯一瓶颈。Generation is near-free, verification is not — the one bottleneck.

五条原理与新增四张图纸的关系

How the five through-lines relate to the four new sheets

把这一卷新增的四张图纸(失败学、即时规划、信任边界、evals 承重墙)放回那五条贯穿原理里,会看到它们不是新增的并列条目,而是五条原理在更具体处的展开,这也是检验这卷是否自洽的一道内部一致性测验。失败学(ENG·12)是"机器可检验的规格"和"会自我改进的循环"的——它讲清了为何非验不可,给前面所有"怎么验"提供了失败学底座。即时规划(ENG·13)是"会自我改进的循环"在时间维度上的纪律——它把"凭信号迭代"从一种态度落成可证伪的规划方式。信任边界(ENG·14)是"可组合的能力接口"的安全面——它说明了为何每个接口要单独授权,否则可组合就成了可失控。evals 承重墙(ENG·15)则是"会自我改进的循环"的承重结构本身——它讲清了那道墙怎么垒、为何会复利。四张新图纸于是都挂在同一条主干上:可读→可 diff→可机检→可组合→自改进,而它们各自把这条主干上某一节点处的机制讲深了。这正是这一卷区别于工具清单的地方——工具会过期,但"为何这五条会被放大、它们之间如何依赖、每一条在哪里最容易失败"这套原理不会。可证伪信号:若你能找到本卷中任何一张图纸,它讲的机制无法被归到这五条原理之一、也不被它们解释,那说明要么那张图纸是游离的(该删或该接上),要么这五条原理本身不完备(该补)——这道自检本身就是这卷"承重命题可被证伪"的体现。

Place this volume's four new sheets (failure taxonomy, JIT planning, trust boundary, the eval wall) back into the five through-lines and you see they are not new parallel entries but those five principles unfolded at more concrete points — an internal-consistency test of whether the volume coheres. The failure taxonomy (ENG·12) is the cause behind "machine-checkable specs" and "self-improving loops" — it makes clear why verification is non-optional and gives every earlier "how to verify" its failure-mode floor. JIT planning (ENG·13) is the "self-improving loop" disciplined on the time dimension — it lands "iterate on signal" from an attitude into a falsifiable way to plan. The trust boundary (ENG·14) is the security face of "composable capability interfaces" — it shows why each interface must be authorized alone, or composable becomes uncontrolled. The eval wall (ENG·15) is the load-bearing structure of the "self-improving loop" itself — it makes clear how the wall is laid up and why it compounds. In other words, all four new sheets hang on one trunk: legible → diffable → machine-checkable → composable → self-improving, and each deepens the mechanism at one node on that trunk. This is exactly where the volume differs from a tool list — tools expire, but the principle of "why these five get amplified, how they depend on each other, and where each most easily fails" does not. Falsifiable signal: if you can find any sheet in this volume whose mechanism cannot be sorted to one of the five through-lines and is not explained by them, then either that sheet is unmoored (delete it or wire it in) or the five principles are themselves incomplete (add one) — this self-check is itself the volume's load-bearing claim being falsifiable.

可执行配套 · skill

The executable companion · skill

AI-Native 工程 · 可执行 skillAI-Native Engineering

AI-Native Engineering

前面所有图纸讲"为什么这样造、按什么顺序造";这一件替你把软件真的造出来——它不是"设计一个工程组织"(那是架构师那件的活),而是执行层本身:在内核腾出的空间里干活,让 agent 默认地生成 / 转换 / 重构 / 迁移代码与测试,把人的判断只留给少数几个不可外包的节点。给它一个功能、一项服务、或一段反复返工的流程,它先过 redraw-vs-graft 闸(删掉 agent 还塌回成"人逐行打字、每步人审"=赋能,不是原生),按范围分流(绿地 / 单环增量 / 出域只赋能直说不属于本群体 / 安全-生计边界仅辅助),再跑 Specify → Plan → Execute → Verify → Integrate → Learn 这一环。

Every blueprint above covers "why build it this way, in what order"; this piece actually ships the software with you — not "design an engineering org" (that is the architect's job) but the execution layer itself: it does the work the kernel frees, letting agents generate / transform / refactor / migrate code and tests by default, and reserving human judgment for the few nodes that cannot be offloaded. Give it a feature, a service, or a loop that keeps reworking; it first runs the redraw-vs-graft gate (delete the agents and it collapses back to "a human typing every line, reviewing every step" = enablement, not native), scopes honestly (greenfield / one-loop brownfield / out-of-scope enablement told so plainly / a safety-or-livelihood boundary where it only assists), then runs the Specify → Plan → Execute → Verify → Integrate → Learn loop.

三档分工,按可验证性梯度路由:Delegate · agent 自跑Review / Own · 人判断policy · 信任边界

Three tiers, routed by the verifiability gradient:Delegate · agent runsReview / Own · human judgmentpolicy · trust boundary

# 在 Claude Code 里调用invoke inside Claude Code
$ /skill ai-native-engineering
> "用 AI 把这个功能可靠地做出来:……""build this feature the AI-native way, reliably: ..."

  可运行代码 + SPEC.md(活规格)runnable code + SPEC.md (living spec)
  eval / 验证套件(承重墙)an eval / verification suite (the load-bearing wall)
  JUDGMENT.md(判断节点图)· PERMISSIONS.md(默认只读信任边界)JUDGMENT.md (judgment-node map) · PERMISSIONS.md (read-only-default trust boundary)

开源仓库:Open-source: github.com/watterfall/ai-native-architect/skills/ai-native-engineering ↗

安装:Install: /plugin marketplace add watterfall/ai-native-architect ↗

本件性质 · 工程面的可执行配套架构师那件设计组织;这一件把工程面的方法跑成真实产物——代码、规格、评测套件与权限边界。工程本来就是面向执行的表面,但这里强调的是七件系统里的配套关系:同一内核、彼此耦合、阅读无固定起点。判断节点 + 止步线:把每个动作按可逆性 × 爆炸半径定级,不可逆 / 有害动作落到人的确认门;止步线一句——绝不让 agent 的分类就是对不可逆或有害动作的签字,它的置信分只是人决策的输入,不是决策本身;而且那个"否"(误删 / 误拒 / 误封)同样要人签,反向路径与正向路径同等慎重。
What this is · the engineering executable companionThe architect piece designs the org; this piece makes the engineering surface runnable as real output: code, specs, eval suites, and permission boundaries. Engineering is naturally execution-facing, but the point here is its companion role in the seven-piece system: one kernel, mutually coupled, with no fixed reading entry. Judgment node + stop-line: grade every action on reversibility × blast radius, and route irreversible / adverse ones to a human confirmation gate. The stop-line, stated exactly: never let an agent's classification be the sign-off on an irreversible or adverse action — its confidence score is an input to the human's decision, never the decision; and the "no" (a wrongful delete / reject / lockout) is gated with equal care, the adverse path designed as carefully as the positive one.
SPEC.V / AI NATIVE METHODOLOGY / OWL METHODOLOGY SERIES
SCOPE / 一套方法论 · 完整组织光谱 N=1 → N=众多(一人公司至 agent 网络,同一套第一性原理)One methodology · the full organizational spectrum N=1 → N=many (from the one-person company to the agent network, on a single set of first principles)
SERIES / 六卷同一内核 · 本卷是其中一个面,完整接线见上方「方法论系列」。Six volumes, one kernel · this volume is one surface; the full wiring is above under "The Series."
APPENDIX · SOURCES / 证据与引用登记 —— 分级口径: 审计级实证(监管文件交叉验证)· 同行评审 · 理论模型/工作论文(引用须写"模型预测",不得写"已证明")· 从业者一手陈述 · 咨询预测(是预测,不是事实)。引用条目以本表为准;本轮 3 票对抗复核未发现被驳倒条目。Evidence and citation registry; grading key: audit-grade empirics (cross-checked against regulatory filings) · peer-reviewed · theoretical model / working paper (citations must read "the model predicts," never "proven") · practitioner first-hand account · advisory forecast (a forecast, not a fact). Citation rows are authoritative in this table; the current 3-vote adversarial review found no overturned source.
REFGRSOURCE承重论断Load-bearing claim
R1Alfonso Graziano《AI-Native Engineering》(Day 1–7 从业者课程 / 实践笔记,本卷主干理论源 · alfonsograziano.it/book)Alfonso Graziano, AI-Native Engineering (a Day 1–7 practitioner course / field notes; the spine theory source of this volume · alfonsograziano.it/book)执行充裕≠放任、implementer→orchestrator、失败模式学(幻觉/自信而错/雪球/上下文腐烂/隐藏假设)、SDD 三级成熟度、Delegate/Review/Own、即时规划与"知道何时叫停"、MCP 安全、Continuous AI——本卷绝大多数承重论断的一手出处"Abundant execution is not licence," implementer→orchestrator, the failure-mode taxonomy (hallucination / confident-wrongness / snowball / context-rot / hidden assumptions), SDD's three maturity rungs, Delegate/Review/Own, JIT planning and "knowing when to stop," MCP security, Continuous AI — the first-hand origin of most load-bearing claims in this volume
R2Hugging Face《Tiny Agents》(开源参考实现 + 博文,2025)· Hugging Face, Tiny Agents (open-source reference implementation + blog post, 2025) · huggingface.co/blog/tiny-agentsagent = 推理客户端 + 工具 + while 循环,最小内核约 50 行——"控制论复活"与底层楼层的可运行佐证An agent is an inference client + tools + a while loop, a minimal kernel of ~50 lines — runnable evidence for "cybernetics reborn" and the building's ground floor
R3Anthropic《Effective Context Engineering for AI Agents》(一手厂商工程文章;本卷经 Graziano Day 4 转引)· Anthropic, Effective Context Engineering for AI Agents (a first-hand vendor engineering article; cited in this volume via Graziano Day 4) · anthropic.com/engineering上下文工程问"此刻窗口里该有什么";准确率随上下文长度非单调,过峰后堆得越多越降准——"少即是多"的检索式上下文装配Context engineering asks "what should be in the window now"; accuracy is non-monotonic in context length, and past the peak more crammed in lowers accuracy — the retrieval-style "less is more" context assembly
R4Martin Fowler《Harness Engineering for Coding Agents》(一手从业者文章;本卷经 Graziano Day 4 转引,ENG·04 的直接理论源)· Martin Fowler, Harness Engineering for Coding Agents (a first-hand practitioner article; cited in this volume via Graziano Day 4, the direct theoretical source for ENG·04) · martinfowler.com/articlesharness = 每次运行都生产并校验上下文的系统;steering loop / harness 复利;computational×inferential / guides×sensors 二维分类——"脚手架即产品"The harness is the system that produces and verifies context on every run; the steering loop / harness compounding; the computational×inferential / guides×sensors two-axis taxonomy — "the harness is the product"
R5GitHub《Spec-kit》(开源规格驱动开发工具:PR 即评审门、constitution.md 硬规则层;本卷经 Graziano Day 5–6 引用)· GitHub, Spec-kit (open-source spec-driven-development tooling: PR as review gate, constitution.md as the hard-rule layer; cited in this volume via Graziano Day 5–6) · github.com/github/spec-kit把"规格不是开关而是阶梯"落到可运行工具:规格作权威参照、PR 作评审门、constitution.md 作不可越的硬规则层Lands "a spec is not a switch but a ladder" on runnable tooling: the spec as authoritative reference, the PR as review gate, constitution.md as an inviolable hard-rule layer
R6Anthropic《Model Context Protocol (MCP)》(开放协议规范 + 安全指引;本卷经 Graziano Day 4 转引)· Anthropic, Model Context Protocol (MCP) (open protocol specification + security guidance; cited in this volume via Graziano Day 4) · modelcontextprotocol.io工具接入面即攻击面:tool poisoning / prompt injection / 凭证泄露与最小权限清单——"默认只读、写操作显式声明"的协议级依据The tool-attachment surface is the attack surface: tool poisoning / prompt injection / credential leakage and the least-privilege manifest — the protocol-level basis for "read-only by default, write actions declared explicitly"
R7METR《Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity》随机对照试验 · 2025-07 · arXiv:2507.09089 · METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," randomized controlled trial · 2025-07 · arXiv:2507.09089 · arxiv.org/abs/2507.09089 · metr.org(16 名资深开源维护者;作者警告勿外推至 greenfield) (16 senior open-source maintainers; the authors warn against extrapolating to greenfield work)资深开发者用 AI 实测慢 19%、自感快 20%——"合成自信"的刻度,也是"判断而非执行才是稀缺项"反向论点的实证锚(单项研究、特定人群,不外推全部场景)Senior developers measured 19% slower with AI yet felt 20% faster — a gauge of "synthetic confidence," and the empirical anchor for the counter-argument that judgment, not execution, is the scarce factor (a single study on a specific population, not extrapolated to all settings)
R8Sackman, Erikson & Grant《Exploratory Experimental Studies Comparing Online and Offline Programming Performance》· Communications of the ACM, 1968 · "10x"说法的实证源头Sackman, Erikson & Grant, "Exploratory Experimental Studies Comparing Online and Offline Programming Performance," Communications of the ACM, 1968 · the empirical origin of the "10x" claim程序员间编码时间差约 20:1、调试约 25:1——但同一数据中产出与经验年限无关;方法有瑕(汇编与高级语言被试混计)。被本卷用于拆解"10x 个体"神话:它量的是实现/打字吞吐,而这正是 agentic coding 在廉价化的能力Coding-time differences of about 20:1, debugging about 25:1 — yet in the same data, output had no relationship to years of experience; the method was flawed (it pooled assembly and high-level-language subjects). Used in this volume to dismantle the "10x individual" myth: it measured implementation / typing throughput, the very capability agentic coding is making cheap
R9Winston W. Royce《Managing the Development of Large Software Systems》· Proceedings of IEEE WESCON, 1970 · 被误读为"瀑布"起源的论文Winston W. Royce, "Managing the Development of Large Software Systems," Proceedings of IEEE WESCON, 1970 · the paper misread as waterfall's originRoyce 实际警告单趟顺序模型有内在风险、主张至少迭代两遍——后世把一张反面教材图当成圣经。被本卷用于"BDUF 是赌错时代"的论证:判断被前置到信息最少、且最快过期的时刻Royce actually warned that the single-pass sequential model carried inherent risk and argued for at least two iterations — posterity took a cautionary diagram as scripture. Used in this volume for the "BDUF bet on the wrong era" argument: judgment front-loaded to the moment of least information and fastest expiry
R10Donald G. Reinertsen《The Principles of Product Development Flow》· Celeritas, 2009 · 论批量大小与队列Donald G. Reinertsen, The Principles of Product Development Flow · Celeritas, 2009 · on batch size and queues到达率逼近处理率时排队时间非线性爆炸;大批量放大延迟与方差。被本卷用于"代码评审当守门"的失效机制:agent 队伍的高到达率使"逐行人审"成为单点瓶颈,排队时间爆炸或评审退化为橡皮图章As arrival rate approaches service rate, queue time explodes non-linearly; large batches amplify latency and variance. Used in this volume for the failure mechanism of "review as gatekeeping": an agent fleet's high arrival rate makes "line-by-line human review" a single-point bottleneck, queue time explodes, or review degrades to rubber-stamping
登记口径:本卷为 AI-Native 工程方法论分卷,引用以 Graziano《AI-Native Engineering》为主干,旁及其直接援引的一手工程文献(Anthropic / Fowler / Hugging Face / GitHub Spec-kit)与一项受控实测(METR)。凡正文中"本卷/本系列推导"字样为内部推论,不另列外部来源;带"待溯源至原始出处再行终评"字样者,评级以原始出处为准。Registry scope: this is the AI-Native Engineering volume of the series; citations center on Graziano's AI-Native Engineering as the spine, alongside the first-hand engineering literature it directly draws on (Anthropic / Fowler / Hugging Face / GitHub Spec-kit) and one controlled measurement (METR). Phrases like "this volume's / this series' derivation" in the body are internal inferences and carry no external source row; where a marker reads "trace to the original for final grading," the grade follows that original.
REV. 2026-06 / END OF VOLUME · AI-NATIVE ENGINEERING