Code is the surface closest to agents, so engineering is where AI-Native is first and most deeply buildable — but engineering was never only writing code. It is the whole craft of turning intent into reliable systems: architecture, interfaces, verification, security, operability, evolution. Once writing code, tests, and refactors becomes something you draw on at will, the bottleneck does not vanish; it moves wholesale to the harder half of that craft — verification, review, trust boundaries, and taste. This part is more "buildable" than the organization part: every sheet lands on a principle and a do-this. One discipline runs throughout — tools are only the surface; what we want is the principle beneath the tool.
本卷内核特化① 写码/测试/重构变充裕 → ② 判断沿可验证性梯度分叉(可机检的对错并入自动化,构成性品味下沉给人)→ ③ 代码库成可查询基设 → ④ 人回归专长与品味。读这一卷不必先读组织卷。
Kernel, specialized here① writing code/tests/refactors turns abundant → ② judgment forks along the verifiability gradient (machine-checkable correctness joins automation; constitutive taste sinks to people) → ③ the codebase becomes queryable infrastructure → ④ people return to expertise and taste. You need not read the organization volume first.
AI-ENABLED ENGINEERING→AI-NATIVE ENGINEERING
对象
Object
代码生成更快Faster code generation规格、验证、安全边界一起被设计Specs, verification, and trust boundaries designed together
判断
Judgment
人事后补救Humans patch afterward人守不可逆、构成性与风险边界Humans hold irreversible, constitutive, and risk-boundary calls
沉淀
Residue
一次会话的产物One-session output可复用工程工件与回流上下文Reusable engineering artifacts and context that writes back
This is not a tool list. It compresses the volume into portable engineering artifacts: thesis, operating loop, boundaries, and first move. Read this first, then enter the sheets.
Thesis
代码变充裕后,工程价值搬到约束、验证、安全与边界。
Once code is abundant, engineering value moves to constraints, verification, security, and boundaries.
AI-Native 工程不是“让 AI 多写代码”,而是把研发系统重画成一个可被规格牵引、被独立验证器制衡、被安全边界收束的循环。
AI-Native engineering is not “make AI write more code”; it redraws development as a loop steered by specs, checked by independent verifiers, and contained by trust boundaries.
Operating Artifacts
本卷真正要沉淀的是可复用工程工件。
The durable output is a set of reusable engineering artifacts.
规格源:把意图、验收条件、风险边界写成活文档。
Spec source: intent, acceptance criteria, and risk boundaries as living text.
Machine-checkable correctness will keep automating; people hold whether the spec is worth building, where boundaries sit, and what risk is tolerable. Context is not “more”; it is the right small subset in the window.
Pick one workflow that repeatedly reworks. Write its spec, checker, permissions boundary, and learning feedback. Do not start with the tool stack; run one loop and leave behind reusable assets.
From AI-Assisted Engineering to AI-Native Engineering
工程师用 AI 写得更快,仍然只是 AI 辅助:旧研发流程的吞吐提升了,但验证、边界与判断结构没有改变。AI-Native 工程承认执行已充裕,于是围绕 agent 重画整张研发图。差别不是程度,是种类。
Engineers writing faster with AI is still AI-assisted engineering: the old development process gains throughput, but verification, boundaries, and judgment structure remain unchanged. AI-Native engineering accepts that execution is abundant and redraws the whole development graph around agents. The difference is not degree, but kind.
For years engineering's scarce resource was engineering bandwidth — the hours of people who can write code. Waterfall, agile: every process was built for "typing is expensive." Once agentic coding is the default, writing code, tests, and refactors no longer slows the team, but the bottleneck does not vanish — it moves: to verification, review, security. Fill the kernel's four steps with the specifics of code and you get the whole thesis of this part.
把这件事说精确一点,要先分清两种"快"。第一种是打字快——把脑子里已经想清楚的东西敲进编辑器;自动补全、片段、脚手架早就在解决它,IDE 这二十年做的就是这件事。第二种是把没想清楚的东西变成正确系统——这一种从来不是打字慢,是判断慢:要不要这么切模块、这个边界条件算不算 bug、这个性能回退能不能接受。agentic coding 几乎把第一种压到零,却把第二种原样留下,甚至放大——因为生成越快,等着被判断的候选就越多。所以"工程师用 AI 写得更快"是个量级错误的描述:真正发生的不是同一份工作做得更快,而是工作的构成变了——打字那一份蒸发,判断那一份变成几乎全部。
To put it precisely, first separate two kinds of "fast." The first is typing fast — keying in what you have already worked out in your head; autocomplete, snippets, and scaffolds have long addressed it, and that is what IDEs spent two decades on. The second is turning the not-yet-thought-through into a correct system — and that was never slow typing but slow judgment: should the module be cut this way, does this edge case count as a bug, is this performance regression acceptable. Agentic coding crushes the first to near-zero while leaving the second untouched, even amplifying it — because the faster generation runs, the more candidates queue for judgment. So "engineers writing faster with AI" is an order-of-magnitude misdescription: what actually happens is not the same work done faster but a change in the composition of the work — the typing share evaporates and the judgment share becomes nearly all of it.
The mirror image makes the boundary clearest: the vibe-coding trap. Misread "execution is abundant" as "you may now be lax" and you fall into it — letting the agent generate on vibes with no spec and no verification, and a system built on guesses collapses in strange ways. It collapses not because the model is dumb but because the load-bearing structure is missing: with no spec as objective function, generation has nowhere to converge; with no independent checker, confident wrongness goes uncaught; with no boundary, one over-privileged call is the whole thing. So "abundant execution is not licence" is the discipline this volume keeps returning to — abundance frees typing, not judgment; the judgment bandwidth it frees must be reinvested into the three new bottlenecks of verification, specs, and boundaries, not pocketed. [Source: Graziano, AI-Native Engineering Day 1, "vibe-coding trap," grade Ⅳ practitioner. [R1]]
这也解释了为什么本卷比组织卷更"可施工":代码是离 agent 最近的面——它天生可读、可 diff、可执行、可测,反馈回路以秒计而非以季度计。组织里一个判断错置可能几个月才暴露,工程里一次 CI 红灯当场就告诉你哪条护栏漏了。这条短反馈回路是后面所有图纸能"落到可照做"的物理前提:在代码上,命题不只是被论证,还能被运行、被证伪。
This also explains why this volume is more "buildable" than the organization one: code is the surface closest to agents — natively legible, diffable, executable, testable, with a feedback loop measured in seconds rather than quarters. In an organization a misplaced judgment may take months to surface; in engineering a single red CI run tells you on the spot which guardrail leaked. That short feedback loop is the physical precondition for every sheet below landing on a do-this: on code, a thesis is not only argued but run and falsified.
① 充裕ABUNDANCE
写码 / 测试 / 重构
Code / tests / refactors
agentic coding 成默认,"打字"不再稀缺。
Agentic coding is the default; typing is no longer scarce.
② 判断JUDGMENT
验证 · 评审 · 安全 · 品味
Verify · review · security · taste
新瓶颈即新判断节点。
The new bottleneck is the new judgment node.
③ 上下文CONTEXT
代码库可查询 + 能否自动化
Queryable codebase + automate?
问 Claude 不问作者;再追问能否自动化。
Ask Claude, not the author; then ask if it can be automated.
④ 人MEANING
专长 · 品味 · 建造
Expertise · taste · building
人做系统专长与产品判断,不做吞吐。
People do deep expertise and product judgment, not throughput.
On engineering's own face, step ② is not a single stair; it forks along code's own spectrum of verifiability, and in engineering that spectrum is physical, pointable segment by segment: the far left is compiler-decidable (types, borrow-checks — a deterministic program returns correct/incorrect on the spot), the middle is test-decidable (behavior checked against fixed assertions, re-run by machine), further right is eval-decidable (semantic review, requiring a separately-manufactured independent criterion), and the far right is human-only-decidable (architectural trade-offs, naming, the boundary of "does this count as correct" — no external criterion exists; the criterion itself is the judgment). The fork rule reads straight off this: wherever on the spectrum the criterion can be externalized into a definite check that does not pass through the generating model's "feeling," it joins ① and gets automated (this is exactly SHEET 11's independent verifier); wherever the criterion cannot be externalized and must be supplied by a human on the spot, it sinks to ④ and stays with people. So "judgment retreats" is not a slogan in engineering but a concrete operation: move left-to-right along this machine-checkability spectrum, hand each machine-checkable segment to the machine, until what remains is the residue where a human must still set the criterion — and that residue is the engineer's job in 2030. [Source: this volume's SHEET 11 independent verifier plus an engineering-specific derivation of the verifiability gradient, grade Ⅴ inference; for the cross-volume comparison see the system map.]
从实现者到编排者:人持有的三件没变
From implementer to orchestrator: the three the human keeps
Land "the composition of work changed" on a copyable role description: the engineer shifts from implementer to orchestrator — no longer implementing line by line but holding three things an agent cannot hold for you. First, intent: what, why, and what not; this is the direction of generation, and get it wrong and the agent efficiently builds the wrong thing. Second, constraints: architecture, standards, non-goals; this is the boundary of generation, and without it the agent guesses at every fork, and a local optimum it guesses is often a global debt. Third, verification: tests, review, quality gates; this is the criterion of generation, and without it "is it correct" returns to the human head that cannot keep pace with generation. These three map precisely onto the kernel's ② judgment and ④ people: an orchestrator is not "a manager of agents" but the person who lifts judgment from implementation detail to the three constitutive nodes of intent, constraints, and verification. [Source: Graziano, AI-Native Engineering Day 1 — implementer→orchestrator, with intent/constraints/verification held by the human, grade Ⅳ practitioner. [R1]]
这个迁移也重排了一名工程师的技能栈。过去稀缺的是"把想清楚的东西快速正确地敲出来"的实现技能;当那一份被 agent 吸收,稀缺的变成四样上游能力:规格素养(把模糊意图写成无歧义、可机检的规格)、上下文工程(决定此刻窗口里该有什么、不该有什么)、编排(把大任务切成 agent 能可靠完成的小步、并设好检查点)、验证(设计能自动判对错的检查,而非事后逐行人审)。注意这四样没有一样是"提示词技巧"——它们都是判断密集、且离"何为对"很近的工程能力。这给个人成长一条可证伪的方向:若一个工程师"用了 AI"两年,时间却仍主要花在敲实现、而非这四样上游能力上,那他多半还停在"用 AI 写得更快"的嫁接阶段,没有真正迁移到编排者。
This migration also reorders an engineer's skill stack. What used to be scarce was the implementation skill of "keying out, fast and correctly, what you had already thought through"; once that is absorbed by the agent, scarcity moves to four upstream capabilities: spec literacy (writing fuzzy intent into an unambiguous, machine-checkable spec), context engineering (deciding what should and should not be in the window right now), orchestration (cutting a large task into small steps an agent can reliably complete, with checkpoints set), and verification (designing checks that auto-decide correctness rather than reviewing line by line afterward). Note none of these is a "prompting trick" — all are judgment-dense engineering capabilities close to "what counts as correct." This gives personal growth a falsifiable direction: if an engineer has "used AI" for two years yet still spends time mainly on keying out implementation rather than these four upstream capabilities, they are most likely still at the graft stage of "writing faster with AI," not truly migrated to orchestrator.
嫁接与重画的分界线:一个可当场做的判别
The line between grafting and redrawing: a test you can run on the spot
"把 AI 嫁接到旧流程上"和"围绕 agent 重画整张图"听起来像态度差别,其实有一个可当场判别的结构标准:看你的流程在哪里假设了"打字很贵",又在哪里已经按"验证很贵"重排。嫁接的典型样子是——流程的骨架没变(还是同样的需求评审、同样的排期、同样的人审节奏),只是在"写代码"这一格里换上了 AI,于是产出快了,但下游的验证、评审、集成全都按老速度运行,结果产出在验证那里堵成一座山。重画的样子则相反——既然写代码不再是瓶颈,整张图就该把重心从"怎么更快地写"挪到"怎么更快地判对错":验证前移、自动化、和生成解耦;人审从逐行改成异步分诊;规划视野缩短贴着信号。一个干净的判别问题是:"如果明天写代码的速度再快十倍,你的流程是会更顺,还是会在某处堵得更死?"嫁接的流程会堵得更死(因为瓶颈没动,只是被喂得更猛),重画的流程会更顺(因为它的承重结构本就建在验证那一侧)。这也呼应组织卷反复说的"瓶颈不会消失只会搬家":嫁接的失败,本质是没有承认瓶颈已经搬到了验证,还在原来那个不再是瓶颈的地方使劲。可证伪信号:若你的团队"用了 AI"之后,产出明显变快、但交付质量或周期没有同步改善、甚至更糟,那几乎一定是嫁接——快出来的产出全都堵在那个没有被重画的下游瓶颈上。
"Grafting AI onto the old process" versus "redrawing the whole graph around agents" sounds like a difference of attitude, but it has a structural test you can run on the spot: look at where your process assumes "typing is expensive" and where it has already re-arranged around "verification is expensive." The typical grafted shape is — the process skeleton is unchanged (same requirements review, same scheduling, same human-review cadence), only the "write code" cell has AI swapped in, so output speeds up, but downstream verification, review, and integration all run at the old speed, and output piles into a mountain at verification. The redrawn shape is the opposite — since writing code is no longer the bottleneck, the whole graph should move its center of gravity from "how to write faster" to "how to judge correctness faster": verification moved earlier, automated, decoupled from generation; human review shifted from line-by-line to asynchronous triage; planning horizon shortened against signal. A clean test question is: "if writing code got ten times faster tomorrow, would your process flow more smoothly or jam harder somewhere?" A grafted process jams harder (the bottleneck did not move, it is just fed more aggressively); a redrawn process flows more smoothly (its load-bearing structure was built on the verification side to begin with). This echoes the organization volume's refrain that "the bottleneck does not vanish, it moves": the failure of grafting is essentially not admitting the bottleneck moved to verification, and still pushing on the place that is no longer the bottleneck. Falsifiable signal: if after your team "uses AI" output is clearly faster but delivery quality or cycle time did not improve in step, or got worse, it is almost certainly a graft — the faster output is all jamming at the downstream bottleneck that was never redrawn.
prompt → context → spec → harness → loop → fleet is one building; the elevator only goes up. Each model generation lifts the human's leverage point a floor — this is the timeline of the kernel's step ② "judgment retreats." Underneath, cybernetics reborn.
These are not competing "engineerings" but floors of one building: early on you write prompts word by word at the bottom; as models strengthen you move up to feeding context, writing specs, building the harness, and finally only designing the loop and scheduling the fleet. Whichever floor your leverage sits on this quarter is where the judgment bottleneck is — below it, hand off to agents or productize; above it is where you design next.
Prompt → Context:从逐字指令,到把"它需要知道的一切"做成可查询的上下文。
Prompt → Context: from word-by-word instructions to making "everything it needs to know" a queryable context.
Spec → Harness:从写清"什么算对",到搭起承载循环、能自动验的脚手架。
Spec → Harness: from stating "what counts as correct" to building scaffolding that carries the loop and verifies automatically.
Loop → Fleet:从设计单个自我改进的循环,到调度一支并行的 agent 队伍。
Loop → Fleet: from designing a single self-improving loop to scheduling a parallel fleet of agents.
核心图KEY FIGFIG. E1.0 / THE BUILDING · 杠杆的楼层看懂:你这季度的着力点在哪一层 = 瓶颈在哪一层Read: which floor your leverage sits on = where the bottleneck is
These are not six competing "engineerings" but six floors of one building. The judgment bottleneck always rests on the floor where your leverage currently sits: below it is already delegable or productized (prompt was eaten by autocomplete long ago), above it is where you design next. Underneath is cybernetics reborn — an agent is an inference client + tools + a while loop, a 50-line minimal kernel (HuggingFace Tiny Agents). [Source: this series' Genealogy synthesis + HuggingFace, grade Ⅳ. [R2]]
底层是控制论的复活。把那栋楼的每一层抽象掉,剩下的骨架古老得令人意外:一个 agent 就是推理客户端 + 一组工具 + 一个 while 循环——读状态、决定下一步、调工具改变状态、再读。这正是 Wiener 1948 年讲的反馈控制:感知、比较、作动、再感知。HuggingFace 的 Tiny Agents 把这个最小内核压进 50 行代码,说明楼层不是技术堆叠的产物,而是同一个控制回路在不同抽象层的展开。理解这一点有实际用处:当你不知道某个新工具该放哪一层,问它在这个回路里扮演感知、比较还是作动,就能定位——工具会换,回路不会。
Underneath is cybernetics reborn. Abstract away every floor of the building and the skeleton left is startlingly old: an agent is an inference client + a set of tools + a while loop — read state, decide the next step, call a tool to change state, read again. This is exactly Wiener's 1948 feedback control: sense, compare, actuate, sense again. HuggingFace's Tiny Agents compresses this minimal kernel into 50 lines, showing the floors are not a product of stacking technology but the same control loop unfolded at different levels of abstraction. This has practical use: when you do not know which floor a new tool belongs on, ask whether it plays sense, compare, or actuate in that loop, and you can place it — the tools change, the loop does not.
Why the elevator only goes up, never down. Because once a floor is "good enough and machine-checkable," it gets productized and absorbed by the next model generation, so a human's marginal value on that floor trends to zero — you are pushed upstairs not because you want to but because the payoff for standing still is collapsing. Prompt engineering was the hot skill of 2023 and is today largely eaten by autocomplete and system prompts; context engineering is being automated by RAG and context stores. This yields a falsifiable prediction: today's seemingly advanced spec / harness work will sink along the same path — if "writing a spec" is still a scarce skill rather than infrastructure three years out, this floor model is weakened. Conversely it marks the human's durable leverage: the top floors (loop design, fleet scheduling, and the judgment of "whether to build at all") sink slowest, precisely because they are the least machine-checkable, most constitutive judgments.
If the floor model is only a nice metaphor its value is limited; where it becomes operational is in giving "where is the bottleneck" a criterion you can self-test on the spot. Three questions locate your current leverage floor. First, what was the most effortful thing you last did for an agent? If it was agonizing word by word over how to phrase a request, you are on the prompt floor; if it was organizing "everything it needs to know," you are on context; if it was stating "what counts as correct," you are on spec; if it was scaffolding "verify and self-correct automatically on every run," you are on harness. Second, are the things below this floor basically off your plate now? — if you are still hand-writing prompts frequently, the context floor is not actually built and you only think you are above it. Third, are the things above this floor starting to make you feel "there should be a system for this"? That faint discomfort is precisely the signal that the elevator is due to climb one floor.
This self-test has a counter-intuitive but important corollary: different teams and different people standing on different floors right now is normal, and should not be forcibly leveled. A team still hand-writing many prompts should first stand up its context floor (make context queryable infrastructure), not skip levels to build fleet scheduling — high scaffolding built across skipped levels collapses often because the lower floor is unstable. This is two sides of the same discipline as the organization volume's "the bottleneck does not vanish, it moves": you cannot skip the current bottleneck to optimize the next, because the next is not yet a bottleneck. Falsifiable signal: if a team invests heavily in multi-agent orchestration (a high floor) while real output is still stuck on "having to re-explain context every time" (a low floor not stood up), that is evidence of level-skipping — retreat and make the lower floor solid first.
The floor model has a corollary very real for both individuals and teams: once a floor is productized and absorbed by the next model generation, a human cannot go back to it — not cannot in principle, but going back has no payoff. This explains why "let me first get my prompting tricks solid" is a loss-making investment direction today: the prompt floor is being rapidly absorbed by autocomplete, system prompts, and the model's own instruction-following, and the fine tricks you hone on it depreciate as the next model gets better at "guessing what you want." The same fate awaits the context floor (RAG and context stores are automating it), and by the floor model's prediction, sooner or later the spec and harness work that looks advanced today. This gives personal growth a less comfortable but clear-eyed guide: do not dig deep on a sinking floor; move up the elevator. To judge whether a skill is worth deep investment, ask whether on a three-year horizon it becomes scarce judgment or becomes infrastructure — the former is worth investing in, the latter gets productized eventually. This matches the kernel's step ④: the human's ultimately durable leverage is the slowest-sinking, least machine-checkable constitutive judgments (deciding what to build, what is correct, where the seams go), because it is precisely the "not machine-checkable" property that resists being absorbed by automation. Falsifiable signal: if you find the core skill you take pride in visibly depreciating a notch with each new model generation, it is most likely a sinking floor — time to move the center of your deep practice up one floor.
ENG
02
CONTEXT · 上下文即基设
CONTEXT AS INFRA
重画 · 原理
Redraw · Principle
上下文即基础设施——以及为什么纯文本赢
Context as infrastructure — and why plain text wins
旧办法是"找写代码的人去问";新办法是先问 Claude,再追问"这能不能自动化"。但更深的问题是:为什么 Markdown、本地知识图谱(Obsidian)、纯文本在 AI 优先下价值被放大?
The old way was "find the person who wrote it and ask"; the new way is ask Claude first, then ask "can this be automated." But the deeper question: why do Markdown, local knowledge graphs (Obsidian), and plain text get amplified under AI-first?
It is not the tools that win, but that they happen to satisfy four agent-friendly properties. Anything that meets these four gets amplified; opaque binary / proprietary formats get marginalized:
对 agent 可读(legibility):无需解析私有格式,模型直接能读能写。
Legible to agents: no proprietary format to parse; the model reads and writes it directly.
可 diff(diffable):逐行可比、可版本控制、可评审——改动是一次提交,不是一次覆盖。
Diffable: line-by-line comparable, versionable, reviewable — a change is a commit, not an overwrite.
可被检索(queryable):能被索引成上下文库,让"问 Claude"有据可依。
Queryable: can be indexed into a context store so "ask Claude" has something to stand on.
人机同源(same source):人和 agent 读写同一份源,不必维护两套真相。
Same source: humans and agents read and write the one source; no two truths to maintain.
旧Before
上下文住在人脑、聊天记录、私有格式里——只能靠"找作者"流动。
Context lives in heads, chat logs, proprietary formats — it flows only by "finding the author."
Boundary · amplification is not "more is better." These four properties amplify text, but one hard constraint holds: the effective context window is far smaller than the nominal one — overstuffing it lowers accuracy, raises cost, and dilutes attention (Anthropic, Effective Context Engineering for AI Agents). So the real value of legible / diffable / queryable is not "put everything in" but making it possible to retrieve the right small subset. Legible / queryable is necessary, not sufficient — the other half is curation. Text can pile up without limit on disk, but every piece fed into the window must be chosen, or amplification backfires into noise. [Source: Anthropic engineering practice, grade Ⅳ practitioner; via Graziano's AI-Native Engineering. [R3][R1]]
FIG. E2.0 / THE CONTEXT BUDGET · 有效窗口曲线看懂:为什么"塞更多"会反噬成噪声Read: why "stuff more in" backfires into noise
Plain text's four properties (legible / diffable / queryable / same-source) let context be amplified, but amplification is not "more is better." Accuracy is not monotonic: past the peak, piling more into the window lowers accuracy, raises cost, and dilutes attention. So the real value of those four is making it possible to retrieve the right small subset, not to put everything in. This curve stitches ENG·02 to ENG·04: context engineering asks "what should be in the window now," harness engineering asks "what system produces and checks that context on every run." [Source: Anthropic, Effective Context Engineering for AI Agents, grade Ⅳ; via Graziano. [R3][R1]]
Seen from another angle, this four-property principle is a shift of terms: the variable moves from "a single prompt" to "the entire state fed to the model at inference." Prompt engineering cares how to word one sentence well; context engineering cares about a larger question — for this inference, right now, what should be in the window, in what order, at what budget. Rules (constant constraints), Skills (callable capabilities), Commands (packaged actions), and Custom agents (sub-agents with their own context) are all ways of feeding that "entire state" in layers. They are not four new toys but four drawers of one problem: store "what the model needs to know" by "how often it changes" — constants in Rules, on-demand in Skills, one-offs in the prompt.
First-hand practice anchor · use files as the persistent source, not conversation history. The root of context rot (see ENG·09) is leaving "the truth" in a conversation history that dilutes, overwrites, and self-contradicts across a long session. The fix is to settle decisions, specs, and tasks into files — the SPEC / PLAN / TASKS trio is the concrete shape of that fix: persistent state on disk, from which each inference re-assembles the window, rather than hoping the model "remembers" what was said thousands of tokens ago. This turns "same source" from a principle into a copyable discipline: humans edit files, agents read files, and there is no second drifting truth. It naturally leads ENG·02 into ENG·03 — since the truth lives in files, the next question is: can these files be machine-checked? [Source: Graziano, AI-Native Engineering Day 2 / Day 4, grade Ⅳ; via Anthropic context engineering. [R1][R3]]
检验信号Test signal
上手爬坡时间下降——上下文是基础设施而非口口相传时,新人第一周就能交付真实代码。Onboarding ramp time drops — when context is infrastructure rather than word of mouth, newcomers ship real code in week one.
有效上下文窗口:为什么"喂得越多"反而越差
The effective context window: why "feed it more" makes it worse
"上下文即基础设施"容易被误读成"把一切都塞进窗口",那是把这一条引向反面。真实约束是:有效上下文窗口远小于标称窗口。一个模型标称能吃二十万 token,不等于这二十万 token 都被等权重地用上——注意力会随上下文变长而稀释,窗口里塞得越多,早期真正关键的约束越容易被淹没。所以上下文工程有两个对称的失败方向:喂太少,agent 缺关键事实只能瞎猜(幻觉的温床);喂太多,关键信号被噪音稀释,agent 分心、降准、还更贵。这条"少即是多"是 Anthropic《Effective Context Engineering for AI Agents》的核心论点,也是对"上下文越多越好"这个朴素直觉的直接修正。它让 ENG·02 从"为何文本/MD 被放大"的正面收益,补上了"上限与取舍"这一面——而正是这一面让这条原理更难被证伪、也更可操作。〔源 Anthropic《Effective Context Engineering for AI Agents》(经 Graziano Day 4 转引),证据级 Ⅳ(一手厂商工程文章)[R3][R1]〕
"Context as infrastructure" is easily misread as "cram everything into the window," which turns this principle into its opposite. The real constraint is: the effective context window is far smaller than the nominal one. A model nominally eating 200K tokens does not mean all 200K are used at equal weight — attention dilutes as context grows, and the more crammed into the window, the more the genuinely critical early constraints drown. So context engineering has two symmetric failure directions: too little, and the agent lacks key facts and can only guess (a breeding ground for hallucination); too much, and the key signal is diluted by noise, the agent distracted, less accurate, and more expensive. This "less is more" is the core argument of Anthropic's Effective Context Engineering for AI Agents, and a direct correction to the naive intuition that "more context is better." It rounds out ENG·02 from "why text / MD get amplified" (a positive payoff) with the side of "ceilings and trade-offs" — and it is precisely this side that makes the principle harder to falsify and more operational. [Source: Anthropic, Effective Context Engineering for AI Agents (via Graziano Day 4), grade Ⅳ (first-hand vendor engineering article). [R3][R1]]
This ceiling directly explains why "files as persistent source" beats "relying on conversation history." Writing SPEC / PLAN / TASKS as versioned files puts "everything it needs to know" in queryable infrastructure outside the window, retrieved on demand, rather than making it re-read repeatedly inside an ever-longer, ever-more-diluted conversation history. A file does not rot as the conversation lengthens; it can be diffed, reviewed, shared across agents. Conversation history is exactly the opposite — the longer it gets the more it rots, and it lives only in this one session. This is also the natural transition from ENG·02 to ENG·03: when you assemble context seriously, you will eventually write that assembly down as a spec, and the spec is the subject of the next sheet. Falsifiable signal: if your agent starts "forgetting" key constraints stated earlier, or repeats an error long since corrected, late in a long session, that is not the model getting dumber but the effective window crammed full of conversation history — move those constraints out of the conversation and into files.
为什么纯文本与 Markdown 在 AI 优先下被放大
Why plain text and Markdown are amplified in an AI-first world
"Context as infrastructure" also has a concrete form preference whose mechanism, not slogan, must be stated: in an AI-first workflow the value of plain text, Markdown, and structured plain-text knowledge (such as knowledge graphs) is systematically amplified, while proprietary, binary formats openable only by specific software are marginalized. The reason is not an aesthetic that text is "plainer and better" but three machine-checkable properties stacking. First, one shared source: plain text is the same thing a human reads and an agent reads directly, with no export/parse loss in between, so humans and agents truly "drink from one well" and there is no drift between "the doc humans read" and "the data machines read" as two separate sources of truth. Second, diffable: text changes are line-by-line visible, reviewable, revertible, so each agent edit can be treated as a reviewable commit rather than an untraceable overwrite — wiring ENG·02 straight onto the "diffable" through-line. Third, queryable and composable: text can be retrieved, chunked, and assembled into the window on demand, serving precisely the "the effective window is limited, assemble actively" discipline above. Put the three together and "files as source of truth" is not nostalgia but the optimal response to the hard constraint that "the agent must read it, diff it, and query it on demand." Falsifiable signal: if your team's key knowledge is still mainly locked in proprietary formats openable only by specific software that the agent cannot read in, then however much you stress "context matters," the context your agent can actually use is incomplete — your format choice has shut it out at the door.
Once generation is abundant, the bottleneck is "is it correct." To make generation converge on its own, "correct" must be written in a form a machine can check. Why are TypeScript and type systems amplified? Because types are machine-checkable specs and guardrails.
把意图外化成机器可检验的形式——类型、schema、测试、eval、lint——等于给生成循环一个目标函数:它在生成时就约束、在 CI 里自动验、把"对"从人脑搬进可执行的检查。规格越可机检,生成越能自我收敛、验证越能自动化、人越能只盯"只有人能定的对"。这与组织部分的 T1 同构:判断退守到稀缺节点,而"何为对"由人来定、由机器来查。
Externalize intent into a machine-checkable form — types, schemas, tests, evals, lints — and you give the generation loop an objective function: it constrains at generation, verifies automatically in CI, and moves "correct" from heads into executable checks. The more checkable the spec, the more generation self-converges, the more verification automates, and the more humans can watch only the "correct" that only humans can define. Isomorphic to the organization part's T1: judgment retreats to scarce nodes, while humans define "what is right" and machines check it.
Why type systems get amplified — the mechanism, not a language allegiance. It is not the TypeScript language that wins but the act of "writing a constraint in a form the machine can reject on the spot." A type signature like (user: User) => Result<Order, PaymentError> is three things at once: an intent brief a human reads, a generation guardrail for the agent (constrained to the signature as it generates), and a machine-checkable spec for the compiler (a violation errors out immediately). These three used to be maintained separately as docs, verbal convention, and code review; now they collapse into one line that cannot go stale. The more abundant generation is, the higher the value of this "reject on the spot" — because a human rejecting candidates one by one cannot keep pace with generation, and only a machine can say "no" on the same time scale as generation. Anything that presses more "correctness" into compile time, into types, into schemas amplifies along the same principle; conversely, constraints that only surface at runtime, found only by humans after the fact, are being marginalized.
trust-but-verify 由此长出来。把约束做成可机检的形式,等于声明:agent 的产出默认不可信,要由一个与生成分离的检查器判定才放行。这不是对 AI 的敌意,是工程纪律——人写的代码同样默认不可信,所以我们才有类型检查、测试、CI。区别只在:agent 生成快了几个量级,"事后人审"这条旧防线被冲垮了,必须把验证前移、自动化、并和生成解耦。下一节(ENG·04)讲这个独立验证器如何嵌进循环,ENG·06 讲它如何落成可照做的三档分工。
This is where trust-but-verify grows from. Making constraints machine-checkable is a declaration: the agent's output is untrusted by default and is released only when a checker, separate from generation, judges it. This is not hostility toward AI but engineering discipline — human-written code is untrusted by default too, which is why we have type checks, tests, and CI at all. The only difference is that the agent generates orders of magnitude faster, the old "humans review afterward" line of defense is overrun, and verification must be moved earlier, automated, and decoupled from generation. The next sheet (ENG·04) covers how this independent verifier embeds into the loop; ENG·06 covers how it lands as a copyable three-tier division of labor.
交给 ClaudeHand to Claude
风格与 lint
Style & linting
提交前抓 bug 并修
Catch & fix bugs pre-commit
补测试 / 跑 eval
Add tests / run evals
留给人 · 定何为对Keep with humans · define right
法务与风险容忍
Legal & risk tolerance
信任边界与安全敏感代码
Trust boundaries & security-sensitive code
产品品味与"够好"的判据
Product taste & the bar for "good enough"
核心图KEY FIGFIG. E3.0 / THE VERIFIABILITY GRADIENT · 内核②的分叉点看懂:判断在哪条线上一分为二Read: the line where judgment splits in two
Step ②'s "judgment retreats" is not a stair but a spectrum. Lay out each kind of correctness by "can a machine judge it cheaply and deterministically": at the left, types / tests / evals are machine-checkable; moving right through semantic diff and AI review, until the right end, the bar for "what counts as correct" — that only humans can set. Left of the fork joins ① abundance and automates; right of it sinks to ④ and stays with people. This gradient is the one ruler the whole series shares: the three delegation tiers (ENG·06), the spec ladder (ENG·07), and the boundary-as-judgment-node (ENG·10) are its projections on different faces.
检验信号 / 深潜Signal / deep dive
可机检比例上升、人审集中在只有人能答处。验证为何是唯一承重墙,见The machine-checkable share rises; human review concentrates where only humans can answer. Why verification is the one load-bearing wall is in 验证篇 ↗the Verification chapter ↗。
把规格写到"哪一档",决定了它的成色
Which "rung" you write the spec to decides its quality
"Machine-checkable" is not precise enough — it states only what form the spec takes, not what standing the spec holds in the team. Spec-driven development (SDD) lays this out as a maturity ladder of three rungs: Spec-First, write the spec before the code, but the spec is shelved once written and the source of truth remains the code; Spec-Anchored, spec and code coexist and the spec is treated as the authoritative reference, with code drifting from spec seen as something requiring explanation; Spec-as-Source, the spec is the sole source of truth, code is generated from or answerable to the spec, and to change behavior you change the spec first. This ladder answers a question ENG·03 ducks when it speaks only of "form": both teams "wrote a spec," so why is one team's spec alive while another's has gone untouched for three months? Because they sit on different rungs. Machine-checkability (this sheet) is the spec's form condition; the maturity ladder (ENG·07) is the spec's standing condition, and neither is dispensable — a machine-checkable spec no one treats as source, and a spec treated as source but all natural language that no machine can check, both fail. [Source: Graziano, AI-Native Engineering Day 5–6 SDD maturity Spec-First→Spec-Anchored→Spec-as-Source, grade Ⅳ practitioner. [R1]]
类型不是约束的全部:可机检的光谱
Types are not the whole of constraint: the machine-checkable spectrum
Saying "types win" is easily narrowed to "just use a statically typed language," which reads this sheet's mechanism too shallowly. What is actually amplified is not the single form of types but the whole spectrum of "writing a constraint in a form the machine can reject on the spot" — types are merely the cheapest segment of that spectrum, nearest compile time. Moving right along the spectrum come, in turn: schemas (constraining the shape of data), contracts / pre- and post-conditions (constraining the behavior of an interface), property-based tests (constraining "a property that should hold for any input" rather than a single example), invariant assertions (constraining state that must not be broken at runtime), and evals (constraining semantic-level "is it correct"). These are all different-strength versions of one principle: move "correct" forward into a check that does not depend on after-the-fact human review and can say "no" on the same time scale as generation. So the judgment an engineer should actually practice is not "which language to use" but "for this constraint, how far left on the spectrum can I press it" — what can go into a type, do not leave as a comment; what can be a property, do not leave as one example test; what a machine can reject on the spot in CI, do not leave to human review. The machine-checkable share measures exactly how far left your constraints sit overall. Falsifiable signal: if your key constraints are largely maintained by "code comments plus a reviewer remembering to mention it," they actually sit at the rightmost, weakest, most rot-prone end of the spectrum — those constraints quietly lapse in some review where no one remembers, while a type or schema does not.
Do not treat the agent as one-shot execution; treat it as an observable, self-correcting, self-improving loop. A loop's minimal mechanism is evolution's minimal mechanism: a generator + an independent verifier + external state = variation / selection / retention.
Harness is the scaffolding that carries the loop (heartbeat, isolation, knowledge, tentacles, checks-and-balances — doing separated from checking); spec is the loop's objective function; eval is the load-bearing wall — errors flow back as new evals and compound with output. And skills / MCP / CLI are the right abstraction because they package capability into composable interfaces exposed to agents — the underlying principle is the previous sheet's: legible, composable, plain-text protocols win, tools are context.
为什么这个最小机制就是进化的最小机制。把"自我改进"这个被滥用的词拆到底,它只需要三件东西凑齐:一个会产生多样候选的生成器(变异)、一个与生成分离、能判优劣的选择器(选择)、一个能把胜出者留到下一轮的外部存储(保留)。三件齐了,系统就会无监督地变好;缺任何一件,它就只会原地抖动。agentic 工作流恰好能凑齐这三件:agent 是生成器,eval / CI 是选择器,上下文库 / skills 库是外部存储。关键的、也最常被省掉的是第二件——选择必须与生成分离。让生成者给自己打分,等于让变异自己决定自己被不被选中,进化立刻退化成随机游走。这就是为什么"独立验证器"是承重墙:不是多一道保险,是这个机制能不能成立的充要条件。
Why this minimal mechanism is evolution's minimal mechanism. Take the over-used phrase "self-improving" down to the bottom and it needs only three things present together: a generator that produces diverse candidates (variation), a selector, separate from generation, that judges better from worse (selection), and an external store that carries winners into the next round (retention). With all three, the system improves unsupervised; missing any one, it merely jitters in place. Agentic workflows happen to supply all three: the agent is the generator, evals / CI the selector, the context store / skills library the external store. The crucial and most often omitted piece is the second — selection must be separate from generation. Letting the generator score itself is letting the mutation decide whether it survives, and evolution instantly degrades into a random walk. This is why the "independent verifier" is the load-bearing wall: not an extra safeguard but the necessary and sufficient condition for the mechanism to hold at all.
The concrete shape of self-improvement · the skills / commands / rules library is a compounding project asset. Run the steering loop long enough and something settles out: each failure's fix is hardened into a reusable skill, command, or rule. This library differs from the codebase — the codebase records "what the system is," this library records "how to get the agent to build the system right." It grows with every pothole the team hits, and it is reused across the whole team and every run, so it is one of the few engineering assets that genuinely compounds over time. A counterintuitive corollary: a team's moat is shifting from "the code itself" toward "the harness + library that tames the agent" — code can be regenerated, but the knowledge your team has accumulated about "what makes the agent err on your particular system" cannot be copied off you.
①
生成器Generator
agent 大量产出候选——变异。The agent produces many candidates — variation.
②
独立验证器Independent verifier
与生成分离地判对错——选择。这是承重墙。Judges correctness, separate from generation — selection. The load-bearing wall.
③
外部状态External state
把通过的留存进上下文库——保留,于是循环会复利。Retains what passes into the context store — retention, so the loop compounds.
Anatomy of the harness (a two-axis taxonomy, after Martin Fowler, Harness Engineering for Coding Agents). The scaffolding that carries the loop splits along two axes — guides (feedforward: lay the runway before generation) vs sensors (feedback: catch errors after); each axis splits again into computational (deterministic, cheap: lint / types / tests / schema) vs inferential (LLM reasoning: AI review / semantic diff / plan critique). The rule is simple: prefer computational; let inferential cover only the judgment layer machines cannot test; most teams overinvest in one cell and underinvest in another.
The harness is not a vague blob of "tools"; it splits cleanly along two axes into four cells: the time axis (feedforward guides / feedback sensors) × the mechanism axis (deterministic computational / LLM inferential). Each cell has its proper contents, and the rule is single — prefer computational over inferential, saving expensive, non-deterministic LLM reasoning for the judgment layer machines genuinely cannot test. Diagnose your own harness: which cell is stuffed, which is empty? The most common ailment is over-investing S×I (AI review) while under-investing S×C (tests) and G×C (specs); INSTRUMENT 08 below lets you check yourself. [Source: Martin Fowler, Harness Engineering for Coding Agents, grade Ⅳ; via Graziano. [R4][R1]]
steering loop · turning "self-evolution" into an executable move. Every time the agent fails → ask "which guide or sensor should have caught it" → add or sharpen that one → supervision demand falls over time; skills / commands / rules libraries thus become project-level assets that compound. The harness is not a fresh start but the engineering discipline of the previous sheet's context engineering: that one asks "what should be in the window right now," this one asks "what system produces and checks that context on every run, across the whole team" — the sentence that stitches ENG·02 to this sheet. [Source: Martin Fowler, grade Ⅳ practitioner; via Graziano's AI-Native Engineering. [R4][R1]]
INSTRUMENT 08 · 脚手架缺口诊断器INSTRUMENT 08 · Harness-Gap Diagnostic● LIVE
勾选你当前 harness 真正覆盖的格。诊断器据 FIG. E4.0 的法则给出你最可能的失败模式与最便宜的下一笔投入。Tick the cells your harness genuinely covers. Per FIG. E4.0's rule, it names your likely failure mode and the cheapest next investment.
检验信号Test signal
循环能在低人值守下自我纠偏;eval 覆盖随产出增长,而非靠人逐个把关。The loop self-corrects with little human babysitting; eval coverage grows with output instead of relying on humans to gate each one.
harness 是产品:guides 与 sensors 的二维分类
The harness is the product: the two-axis taxonomy of guides and sensors
"生成器 + 独立验证器 + 外部状态"给了循环的骨架,但还缺一张能让人当场盘点"我的脚手架到底有哪些、缺哪些"的分类表。Fowler 的《Harness Engineering for Coding Agents》给了这张表,它沿两条轴把围绕模型的整套脚手架干净地分成四格。第一条是时间轴:guides(前馈)在生成之前引导——规则、skills、commands、上下文装配,是你提前塞进去的"该怎么做";sensors(反馈)在生成之后探测——测试、lint、类型检查、AI 评审,是事后告诉你"做得对不对"的。第二条是机制轴:computational(确定性、廉价)是 lint / type / test / schema 这类不用模型推理、跑一次就有确定答案的检查;inferential(推理性)是 AI 评审 / 语义 diff / 计划批判这类要调模型来判断的检查。〔源 Martin Fowler《Harness Engineering for Coding Agents》(经 Graziano Day 4 转引),证据级 Ⅳ 一手从业者,ENG·04 的直接理论源[R4][R1]〕
"Generator + independent verifier + external state" gives the loop's skeleton but still lacks a table that lets you inventory on the spot "which scaffolding I actually have and which I lack." Fowler's Harness Engineering for Coding Agents gives that table, sorting the whole scaffolding around the model cleanly into four cells along two axes. First, the time axis: guides (feed-forward) steer before generation — rules, skills, commands, context assembly, the "how to do it" you put in ahead of time; sensors (feedback) detect after generation — tests, lint, type-checks, AI review, telling you afterward "whether it was done right." Second, the mechanism axis: computational (deterministic, cheap) are checks like lint / type / test / schema that need no model inference and give a definite answer in one run; inferential are checks like AI review / semantic diff / plan critique that call a model to judge. [Source: Martin Fowler, Harness Engineering for Coding Agents (via Graziano Day 4), grade Ⅳ practitioner, the direct theoretical source for ENG·04. [R4][R1]]
This two-axis table immediately yields two copyable disciplines. First, use computational first; let inferential only fill the judgment layer. Whatever a deterministic, cheap check can solve, never call a model to judge — catch type errors with the compiler, do not have another LLM "glance and feel whether it is right"; reserve inferential checks for what genuinely needs semantic judgment that computational cannot reach (e.g. "does this variable name express the intent," "did this refactor quietly change behavior"). Second, most teams over-invest on one side and under-invest on the other at the same time. A common over-investment is piling on inferential (have AI review everything — slow, expensive, uncertain); a common under-investment is thin guides (no "how to do it" distilled into reusable skills/commands, so every run starts steering from zero). Fill your existing scaffolding into these four cells once: the empty cells are your under-investment, the overflowing cells your over-investment. This is exactly what INSTRUMENT 08, the Harness-Gap Diagnostic, does — it turns this table into a checkable self-test. Falsifiable signal: if you cannot place each existing check accurately into one of the four cells, you do not actually have a clear picture of your scaffolding, and "the harness is the product" is still a slogan for you.
自改进 = steering loop:每次失败补一条护栏
Self-improvement = the steering loop: each failure adds a guardrail
"A self-improving loop" sounds abstract, but it has a fully concrete, copyable executable form Fowler calls the steering loop: each time the agent fails, ask "which guide or sensor should have caught this failure," then add or sharpen that one. This single move turns "self-improvement" from a vague wish into a definite machine: failure is the input, an addition to a guardrail is the output, and the output makes the same class of failure auto-caught next time, so supervision demand declines monotonically over time. Note this machine improves not the model (the model is the vendor's; you cannot change it) but the scaffolding around the model — your library of skills, commands, rules, evals. This is exactly why that library is "a project-level asset that compounds over time": it is not a heap of config files but the reusable judgment distilled from every pit this project has stepped in. It is the same loop as ENG·15's eval feedback, seen on two faces — eval feedback watches the sensor side (detect after the fact), the steering loop extends it to the guide side (steer beforehand): some failures should add a test that turns red (a sensor), some should add a "next time, do it this way" skill (a guide). Bring both sides into this loop and the harness becomes a living system that strengthens with the team's experience. Falsifiable signal: if your skills/rules/evals library is long unchanged, or written once early and never updated, the steering loop is not turning — failures happened but did not flow back into guardrails, so the same class keeps recurring and supervision demand never falls. [Source: Martin Fowler, Harness Engineering for Coding Agents steering loop / harness compounding + Graziano Day 4, grade Ⅳ practitioner. [R4][R1]]
harness 工程是上下文工程的工程纪律
Harness engineering is the engineering discipline of context engineering
Finally, stitch ENG·02's context engineering to this sheet's harness engineering, because they are often taken as two things when they are two time scales of one. A precise placement: harness engineering is not a fresh start but the engineering discipline of context engineering — the latter asks "what should be in this window right now," the former asks "what system, on every run and across the whole team, produces, verifies, and corrects that context." Context engineering is single-shot and present-tense: for this call, which facts, constraints, examples should I assemble into the window. Harness engineering is repeated and systematic: which guides automatically assemble the right context before every generation (instead of doing it by hand each time), which sensors automatically verify the output after every generation and flow what they find back into the context to add next time. In other words, a good harness is a machine that does context engineering automatically, turning the error-prone, unscalable "someone must remember to assemble the right context every time" into a discipline the system executes stably on every run. This explains why the two chapters sit adjacent and point at each other in this volume: in ENG·02 you learn "what to assemble once," in ENG·04 you learn "how to make a system assemble it right every time and get better at it with the team's experience." Falsifiable signal: if every time your team puts the agent to work, someone still has to manually recall "oh, this time I have to tell it that convention, give it those files," you are stuck at single-shot context engineering and have not upgraded it into a harness — once that person is away, the context is assembled wrong, exactly the cost of missing harness discipline. [Source: Martin Fowler, Harness Engineering for Coding Agents + Graziano Day 4 (harness engineering as the engineering discipline of context engineering), grade Ⅳ practitioner. [R4][R1]]
Waterfall front-loads planning; agile slices it into iterations: both assume "implementation is expensive, so ration it." When implementation is abundant, the right shape is not a line and not a sprint but a loop that stitches intent, generation, and verification into one closed cycle and learns from every run.
Force analysis · why a loop, not a line. A linear flow treats "plan to build to test" as a one-shot pipeline, betting that upstream judgment is sound and downstream execution is the slow part. Agentic coding inverts the bet: execution is near-free, yet errors snowball across steps. A linear pipeline has no return path to push drift back upstream, so one early misread floats all the way to production. The loop makes "learn" an explicit sixth step (each failure flows back as next round's spec and guardrail), so the process itself compounds with output. This lifts ENG·04's self-improving loop from a single agent to the scale of the whole pipeline.
Waterfall is a line, agile a shortened line; both bet "implementation is expensive." Once implementation is abundant, the right shape is this six-node closed loop. Two nodes are load-bearing: Verify is a gate guarded by an independent checker — only what clears the wall flows to Integrate, and it is the selection pressure; Learn settles this round's errors into a new eval / rule and feeds them back to Specify along the vermilion return arrow, so coverage compounds with output and the loop steadies as it runs. Bending the line into a ring and actually feeding Learn back lifts ENG·04's self-improving loop to the scale of the whole pipeline.
JIT planning · why not plan it all up front. A linear flow does the most planning at the moment of greatest ignorance (the project's start). The loop makes planning just-in-time: expand only the step at hand to executable detail, keep the spec stable and the plan fresh. The reason is ENG·02's same constraint (the effective window is finite): the more you pre-plan in detail, the staler and more window-hogging and attention-diluting it is by execution time. Spec-first is not plan-first: the spec is the durable source, the plan is a disposable working surface.
Why the planning horizon collapses. A linear flow's implicit assumption is "planning is cheap, implementation is dear, so plan more and rework less." When implementation is near-free, that trade-off flips entirely: rework is cheap now, and pre-planned detail becomes a liability — stale by execution time, and wasting the finite effective window while diluting the agent's attention (the same ENG·02 constraint). So the rational planning horizon collapses from "project-level" to "next-step-level": stable is SPEC.md (the durable source), volatile is PLAN.md (a use-and-discard working surface). This is the same mechanism as the organization volume's "planning cycle collapses from annual to real-time," shown on the engineering face — not because we got lazy but because the marginal return of pre-planning collapses as implementation cost falls. A copyable test: if your PLAN.md keeps thickening and one code change forces a sync of a large swath of plan, you have pre-hardened what should have been expanded just in time, pulling the loop straight into a line again.
旧Before
规划→实现→测试是一次性管道;早期误解一路漂到生产,没有回上游的路。
Plan to build to test is a one-shot pipeline; an early misread floats to production with no path back upstream.
新 · 原理After · principle
六步闭环:规格耐久、计划即时、验证承重、错误回流——流程随产出复利。
A six-step closed loop: durable spec, JIT plan, load-bearing verify, errors flowing back; the process compounds with output.
Evidence · grade ⅣThe canonical SDD loop Specify→Plan→Execute→Verify→Integrate→Learn and the SPEC / PLAN / TASKS trio come from Graziano's AI-Native Engineering 7-day path (Days 5–6), via GitHub Spec-kit (specs live with code, reviewed like code, with constitution.md as a hard-rule enforcement layer) and Martin Fowler's Exploring Gen-AI: SDD. Practitioner curation, not peer-reviewed; per the "tools are surface" discipline, we take only the loop's shape and feedback mechanism, not the toolchain.
旧 SDLC 的去向What Happens to the Old SDLC Stages
What Happens to the Old SDLC Stages
把"实现充裕→判断退守"套在传统研发的几个阶段上,可以逐项预言它们的去向。和组织卷对管理五职能的处理同构:没有一个阶段"被 AI 增强"或凭空消失,每一个都被沿可验证性梯度劈成两半,可机检的一半下沉为基础设施,构成性的一半上浮为判断。这张表是 FIG. E3.0 那条梯度在研发流程上的逐阶投影。
Apply "execution becomes abundant, judgment retreats" to the stages of the traditional dev cycle and you can forecast each one's fate. It is isomorphic to the organization volume's treatment of the five management functions: no stage is "augmented by AI" or vanishes; each is split along the verifiability gradient, with the machine-checkable half sinking into infrastructure and the constitutive half rising as judgment. This table is FIG. E3.0's gradient projected stage by stage onto the development process.
TABLE E5.0 · SDLC → AI NATIVE研发阶段去向表Fate of the dev stages
阶段Stage
旧实现Old implementation
下沉为基础设施的一半Half that sinks into infrastructure
上浮为判断的一半Half that rises as judgment
需求RequirementsSPECIFY
需求文档 · 评审会 · 一次写死Requirement docs · review meetings · written once
agent 把意图展开成可执行步骤与候选方案,写规格的机械部分自动化(ENG·05 ② Plan,JIT 展开)Agents expand intent into executable steps and option sets; the mechanical part of spec-writing automates (ENG·05 ② Plan, JIT)
定 intent 与非目标、为"什么值得造"负责(SPEC.md 这一步人持有)Set intent and non-goals, own "what is worth building" (humans hold the SPEC.md step)
CI 即选择压力,PR 即评审门;Continuous AI 工作流默认只读、写操作须显式 safe-output(ENG·08)CI is the selection pressure, the PR the review gate; Continuous AI workflows are read-only by default, writes declared safe-output (ENG·08)
守 不可逆处的确认门:对外发布、权限变更、数据迁移Guard the confirmation gate at the irreversible: external releases, permission changes, data migrations
遥测 + 事故回流成 eval / rule,沉淀为随时间复利的项目级资产;监督从实时改为异步分诊(ENG·09)Telemetry + incidents flow back as evals / rules, settling into project assets that compound; supervision shifts from real-time to async triage (ENG·09)
在结构标记的异常处接管,并问"哪条护栏本该拦住它"Take over at the anomalies structure flags, and ask "which guardrail should have caught it"
The two right-hand columns conceal the same conclusion as the organization volume: the "dev process" as a chain of serial stages yields wholesale, and what survives is not the stages but the constitutive-judgment half inside each one. One common misreading to avoid: this is not "automation eats the testing and ops roles," but the same person's leverage climbing the verifiability gradient — from writing implementation to setting what is correct, drawing seams, and guarding the irreversible. This is the stage-by-stage evidence for ENG·10's role fusion.
检验信号Test signal
先行:返工率随轮次下降、规格被复用而非每次重写。反指标:PLAN.md 越写越厚、规格沦为合规摆设没人回流——那是把环又拉直成了线。Leading: rework rate falls across rounds; specs get reused, not rewritten each time. Counter-signal: PLAN.md keeps swelling and the spec becomes compliance theater that no one feeds back into; that is the loop pulled straight into a line again.
ENG
06
DELEGATE · 评审矩阵
DELEGATE · REVIEW MATRIX
决策 · 可照做
Decision · Copyable
trust-but-verify 落成三档分工:交办 / 评审 / 自持
trust-but-verify becomes three tiers: Delegate / Review / Own
"Trust but verify" is not a slogan but a copyable division of labor. Each task drops into one of three tiers along a single question: fully delegable to the agent (Delegate), agent does it but a human reviews diff by diff (Review), or only a human can hold the judgment (Own). The ruler that sorts the tiers is the kernel's step-② verifiability gradient.
Force analysis · one question sets the tier. Ask: "can a machine judge this step's correctness cheaply and deterministically?" Yes: Delegate. Half (the machine checks form but not intent): Review. No (constitutive judgment: what counts as correct, risk tolerance, trust boundaries): Own. This is ENG·00's verifiability gradient as one ruler: the machine-checkable joins ① abundance and gets automated; the constitutive sinks to ④ and stays with people. The tiers split by verifiability, not by "importance" (the most commonly reversed point).
Delegate · 完全交办Delegate · hand off
样板代码、CRUD、格式与重命名
Boilerplate, CRUD, formatting, renames
有测试覆盖的重构
Refactors under test coverage
补单测 / 写文档 / 跑 lint
Adding unit tests / docs / running lint
Own · 只有人能持有Own · only humans hold
"何为对"的判据、产品取舍
The bar for "correct," product trade-offs
信任边界、权限、安全敏感接缝
Trust boundaries, permissions, security seams
不可逆 / 大爆炸半径的架构决策
Irreversible / large-blast-radius architecture
FIG. E6.0 / DELEGATE · REVIEW · OWN · 可逆性 × 爆炸半径看懂:任务落在平面哪个区,就归哪一档Read: where a task lands on the plane sets its tier
The three tiers are sorted not by "importance" but by two physical quantities: reversibility (can a mistake be cheaply undone) × blast radius (how far it spreads). Low radius × easy undo lands at lower-left, Delegate; high radius × irreversible lands at upper-right, Own; the diagonal band between is Review — the machine checks form but not intent, so a human must read it diff by diff. Note the dashed arrow: each model generation, with a few added evals, pushes the Review band down-left, handing tasks back to Delegate — ENG·01's climbing leverage seen on the review face. INSTRUMENT 07 below turns this plane into draggable sliders.
The middle tier is the hardest and the most valuable. Review = agent generates, human reviews diff by diff: new business logic, data migrations, external API contracts, performance-sensitive paths. Two rules. First, review the diff, not the artifact: watch what changed, not just whether it runs. Second, delegate toward a test: have the agent write the failing test first, then the implementation, and the human reviews whether the test locks the right intent. The Review tier shrinks as models strengthen: what needs diff-by-diff review today may drop to Delegate tomorrow with a few added evals; this is ENG·01's climbing leverage seen on the review face.
The three tiers precede tools and outlive them. This matrix is about "which judgments humans hold," not "which IDE." Practitioners split the human-held judgment into three: set intent (what, why), set constraints (architecture, standards, non-goals), and own verification (tests, review, quality gates). These three map exactly onto the organization volume's "judgment retreats to scarce nodes" — the agent takes over execution, the human holds these three non-outsourceable judgments. The accompanying three collaboration rules are equally copyable: plan first (moving from chat to plan is spec-driven's first step), keep context clean (do not let irrelevant history dilute the window, echoing ENG·02's effective window), and know when to stop (when the agent spins in place the problem is usually in the spec, not execution; stop and edit the spec rather than try another round).
证据 · 级 ⅣDelegate / Review / Own 三档分工矩阵,源 Graziano《AI-Native Engineering》(Day 1,团队层分工)转引 OpenAI《Build an AI-Native Engineering Team》;"逐 diff 评审、以测试为目标、保持上下文干净"为其 Day 3 协作守则。"评审 diff 不评审产物"与本系列验证篇的"独立 checker 是唯一承重墙"同源——见验证篇 ↗。
Evidence · grade ⅣThe Delegate / Review / Own matrix comes from Graziano's AI-Native Engineering (Day 1, team-level division) via OpenAI's Build an AI-Native Engineering Team; "review by diff, target a test, keep context clean" are its Day 3 collaboration rules. "Review the diff, not the artifact" shares a root with this series' Verification chapter ("the independent checker is the one load-bearing wall"); see the Verification chapter ↗.
INSTRUMENT 07 · 分档计算器INSTRUMENT 07 · Delegation-Tier Calculator● LIVE
REVIEW
DELEGATE完全交办hand off
REVIEW逐 diff 评审review by diff
OWN只有人能持有only humans hold
检验信号Test signal
先行:Review 档的任务逐季向 Delegate 迁移(说明你在补 eval、在上移杠杆)。反指标:什么都塞进 Review、人审带宽被吞——那是没在分档,是在用人肉追指数。Leading: Review-tier tasks migrate to Delegate quarter by quarter (you are adding evals and climbing the leverage). Counter-signal: everything lands in Review and human bandwidth is eaten; that is not tiering but chasing an exponential by hand.
三档是动态的:靠补护栏把任务往左推
The tiers are dynamic: push tasks left by adding guardrails
Delegate / Review / Own is most easily misused as a static "task list" — pinning each task to a tier once and dividing labor accordingly. But the tiers' real value is that they move: a task that today must be Reviewed (a human reading every diff) can, once you add a guardrail that auto-decides correctness (an eval, a type constraint, an independent checker), shift left to Delegate (handed fully to the agent). So this matrix is not for "assigning who does what" but for tracking where your guardrails are growing: in a healthy team tasks migrate quarter by quarter from Review toward Delegate because the guardrails are thickening; while the Own tier (constitutive judgment: what is right, risk boundaries, architectural trade-offs) barely moves, because it is not machine-checkable and should not be delegated. Mapped onto ENG·03's verifiability gradient, it is the same ruler: whether a task can shift left depends on "can its correctness be machine-checked cheaply and deterministically" — if yes, shift left; if no, keep it on the right for people. [Source: Graziano, AI-Native Engineering Day 1 Delegate/Review/Own matrix, grade Ⅳ practitioner. [R1]]
Own 那一档为什么永远不空
Why the Own tier is never empty
三档里 Delegate 在变大、Review 在收缩,一个自然的疑问是:随着护栏越补越厚,Own 那一档会不会也终将被清空、人最终无事可做?答案是不会,而且原因是结构性的,不是"现在还做不到"的暂时性。Own 那一档装的是构成性判断——决定何为对、定义风险容忍、划信任边界、做架构取舍——这些之所以留在 Own,不是因为机器暂时不够强,而是因为它们没有一个独立于人的判据可供机检。"这个产品该不该有这个功能""这个性能和复杂度的取舍我们能不能接受""这个对外契约一旦定了就要长期负责,我们认不认"——这些问题的"对"不是一个客观事实,而是一个由人的价值、处境、责任共同定义的判断。你可以用 eval 把"符不符合已定的标准"机检掉,但"标准本身该是什么"这件事,定义它的动作本身就只能由人来做——一旦让机器来定标准,你只是把判断换了个地方藏起来,没有消除它。这正是内核第④步在工程面的落点:充裕和护栏清空的是可机检的那半,留下的恰好是构成性的那半,而后者是人回到工程师本职的地方。可证伪信号:若有人声称把 Own 那一档也自动化了,去看他到底自动化的是什么——大概率是把"按已定标准判合不合格"自动化了(那本就该自动化),而"标准该是什么"这个真正的 Own 判断,要么还藏在某个人手里,要么被悄悄默认成了模型训练分布里的某个值(那是把判断让渡给了一个没人为之负责的来源)。
With Delegate growing and Review shrinking, a natural question: as guardrails thicken, will the Own tier eventually be emptied too, leaving humans nothing to do? The answer is no, and the reason is structural, not a temporary "we cannot do it yet." The Own tier holds constitutive judgments — deciding what is correct, defining risk tolerance, drawing trust boundaries, making architectural trade-offs — and these stay in Own not because machines are momentarily too weak but because they have no criterion independent of humans to machine-check against. "Should this product have this feature," "can we accept this performance-versus-complexity trade-off," "this external contract, once set, carries long-term responsibility — do we own it" — the "correct" of these questions is not an objective fact but a judgment defined jointly by human values, situation, and responsibility. You can machine-check "does it conform to the set standard" with an eval, but "what the standard itself should be" — the act of defining it can only be done by people; let a machine set the standard and you have merely hidden the judgment elsewhere, not eliminated it. This is the kernel's step ④ landing on the engineering face: abundance and guardrails clear the machine-checkable half and leave precisely the constitutive half, which is where people return to the engineer's true work. Falsifiable signal: if someone claims to have automated the Own tier too, look at what they actually automated — most likely "judging pass/fail against a set standard" (which should be automated), while the real Own judgment of "what the standard should be" is either still in someone's hands or quietly defaulted to some value in the model's training distribution (which is ceding the judgment to a source no one is responsible for).
FIG. E6.1 / TRUST-BOUNDARY ZONES · 信任边界的同心圈看懂:能力越往外圈走、回退越难、爆炸半径越大,批准权就越要收回到人手里Read: the farther out the ring, the harder to undo and the larger the blast radius — so approval is pulled back toward humans
权限不是一个"信不信任 agent"的开关,而是一组同心圈:能力按"回退难度 × 爆炸半径"分层,批准权随圈层向外逐级收回人手。这正是 INSTRUMENT 07 那把"可逆性 × 爆炸半径"的尺子画成空间——最外圈的特权动作不会因为模型更强而内移,因为"该不该按下这个不可逆按钮"是构成性判断,永远落在 Own 那一档。Permission is not one "do we trust the agent" switch but a set of concentric rings: capability is tiered by "cost-to-undo × blast radius," and approval is pulled back toward humans as you move outward. This is INSTRUMENT 07's "reversibility × blast-radius" ruler drawn as space — the outermost privileged actions do not migrate inward as models improve, because "should this irreversible button be pressed" is a constitutive judgment that always lands in the Own tier.
ENG·03 says "specs must be machine-checkable," but "writing a spec" is not an on/off switch. It has three rungs: the spec is written once (Spec-First), the spec stays alive in sync with code (Spec-Anchored), the spec becomes the single source from which code is generated (Spec-as-Source). The higher you climb, the more "correctness" can converge on its own.
Force analysis · why rungs. Treating "write a spec" as all-or-nothing forces teams onto Spec-as-Source before the verification infrastructure exists, and the spec rots into a document no one maintains. The ladder matches investment to payoff: each rung turns the spec from a one-shot intent brief into a regressible load-bearing artifact, and climbing a rung presupposes that the next rung's machine-checkable conditions are in place. This stitches to ENG·03: machine-checkability is the vertical axis (depth), maturity is the horizontal axis (durability); a spec becomes the generation loop's objective function only when it is both checkable and alive.
Ⅰ
Spec-First · 规格先行Spec-First
动手前先写规格、再生成。治 vibe-coding,但规格写完即与代码分叉漂移。Write the spec before generating. Cures vibe-coding, but the spec drifts from code once written.
Ⅱ
Spec-Anchored · 规格锚定Spec-Anchored
规格与代码同住同版、像代码一样被评审;CI 检查二者一致。Spec lives and versions with code, reviewed like code; CI checks the two stay consistent.
Ⅲ
Spec-as-Source · 规格即源Spec-as-Source
规格是唯一真源,代码是它的产物;改行为先改规格。门槛最高、回报最大。The spec is the single source, code is its output; change behavior by changing the spec. Highest bar, largest payoff.
Copyable test · which rung you are on. Ask three things. (1) To change a behavior, do you edit code first or the spec first? Code first = rung Ⅰ. (2) When spec and code disagree, does CI go red? No = not yet rung Ⅱ. (3) Is there a "hard-rule" layer (like a constitution.md) the agent may not violate and that an enforcement layer blocks? No = not yet at rung Ⅲ's door. Most teams sit between rungs Ⅰ and Ⅱ and often believe they are at rung Ⅲ (exactly why specs become compliance theater).
Evidence · grade ⅣThe three-rung spec maturity ladder (Spec-First → Spec-Anchored → Spec-as-Source) comes from Graziano's AI-Native Engineering (Days 5–6) via GitHub Spec-kit (specs live with code, the PR is the review gate, constitution.md as a hard-rule enforcement layer). This series stresses "machine-checkable" in ENG·03; the ladder adds its missing "durability evolution" axis; the two are orthogonal and complementary.
为什么大多数团队卡在阶 Ⅰ 到 Ⅱ 之间。升到阶 Ⅱ(规格锚定)需要一个常被低估的基建:一个能在"规格与代码不一致"时让 CI 变红的检查器。没有它,规格和代码会无声分叉——人改了代码忘了改规格,或反过来——而没有任何信号提醒。这正是 ENG·03 的"可机检"在阶梯上的作用:可机检是让规格活着的前提,因为只有可机检的规格才能被 CI 持续比对。所以这条阶梯和 ENG·03 是正交两轴:纵轴是"规格有多可机检"(深度),横轴是"规格活得多久"(耐久度)。一条规格只有同时在两轴上都够高,才真正成为生成循环的目标函数——可机检但没人维护的规格会腐烂,活着但不可机检的规格只是一篇没有约束力的作文。
Why most teams stall between rungs Ⅰ and Ⅱ. Climbing to rung Ⅱ (Spec-Anchored) needs an often-underestimated piece of infrastructure: a checker that turns CI red when "spec and code disagree." Without it, spec and code fork silently — someone edits the code and forgets the spec, or vice versa — with no signal to flag it. This is exactly where ENG·03's "machine-checkable" does its work on the ladder: machine-checkability is the precondition for the spec to stay alive, because only a machine-checkable spec can be continuously diffed by CI. So this ladder and ENG·03 are two orthogonal axes: the vertical is "how machine-checkable the spec is" (depth), the horizontal is "how long the spec stays alive" (durability). A spec becomes the generation loop's objective function only when it is high on both — a machine-checkable spec no one maintains rots, and a living but un-checkable spec is just a non-binding essay.
Rung Ⅲ's hard-rule layer promotes "what is correct" from suggestion to enforcement. The mark of Spec-as-Source is not "having a spec" but a layer of hard rules the agent may not violate, blocked by an enforcement layer (such as a constitution.md). The difference is teeth: rung Ⅰ/Ⅱ specs say "it should be this way," and the agent can violate them with humans noticing later; rung Ⅲ hard rules say "it cannot be this way," and a violation is intercepted automatically at generation or merge. This pushes ENG·03's machine-checkability to its limit — checking not only whether the output is correct but whether the process crossed an inviolable red line. And precisely because the bar is highest and the payoff largest, it is the most commonly misjudged: many teams believe they are at rung Ⅲ when they merely have a pile of unenforced convention docs, which is still rung Ⅰ.
检验信号Test signal
先行:改行为时人自然先去改规格。反指标:规格目录最后更新在三个月前,代码却天天变——规格已死,你掉回了阶 Ⅰ 之下。Leading: to change behavior, people instinctively edit the spec first. Counter-signal: the spec folder was last touched three months ago while code changes daily; the spec is dead and you have fallen below rung Ⅰ.
让规格像代码一样被评审:constitution 作硬规则层
Review the spec like code: a constitution as the hard-rule layer
For a spec to climb to the Spec-as-Source rung, the discipline of "write the spec first" is not enough — discipline slackens. What truly keeps a spec alive is placing it in the same reviewable, diffable, enforceable mechanism as code. One concrete copyable practice comes from Spec-kit: spec files live in the same repo as code, every spec change goes through a PR, and a spec change is reviewed like code; then a constitution.md serves as a hard-rule enforcement layer — write the "must not be violated under any circumstances" constraints (security red lines, architectural invariants, external contracts) into it, enforced by the pipeline on every generation/commit, bypassable by neither agent nor human. The elegance of this layer is that it separates two natures of a spec: most of the spec is an evolvable description of "how to do it," which a PR review suffices for; a few are "must never" invariants that must be machine-enforced as hard constraints. This stitches ENG·03's "machine-checkable" to this sheet's "standing": a constitution is both machine-checkable (a machine verifies violation) and of the highest standing (the non-negotiable part of the spec). Falsifiable signal: if your team "has specs" but not a single spec rule is hard-enforced by the pipeline, all relying on voluntary compliance, that spec set will most likely slide back along the "discipline slackens" path to Spec-First or below. [Source: Graziano, AI-Native Engineering Day 5–6 Spec-kit (PR as review gate, constitution.md as hard-rule layer), grade Ⅳ practitioner. [R5][R1]]
规格不是开关,是阶梯:怎么判断该爬到哪一档
A spec is not a switch but a ladder: judging which rung to climb to
Treating the maturity ladder as "higher is always better, every project should push to Spec-as-Source" is a common misuse — it makes teams force heavy specs onto low-risk, fast-iterating exploratory code, dragging down what should be light and quick. The ladder's correct reading is: different code should stop on different rungs, decided jointly by that code's "cost to change once" and "cost to get wrong once." An exploratory prototype that may be thrown away wholesale at any time stops at Spec-First or lighter — writing a heavy spec for it is waste, because it is short-lived and cheap to get wrong. But an external contract, a core module many downstreams depend on, a piece of security-sensitive code is worth climbing to Spec-Anchored or even Spec-as-Source — because changing it once moves a lot and getting it wrong once is costly, and the payoff of nailing "what is correct" into an authoritative spec far exceeds the cost of maintaining it. This criterion is the same ruler as ENG·10's boundaries and INSTRUMENT 07's blast radius: the closer to a system seam and the larger the blast radius, the more the code is worth climbing the ladder for. So "spec maturity" is not a single team-wide rung but a map that varies with code importance. Falsifiable signal: if your team either has no spec for any code (all on rung 0) or forces the heavy spec process onto all code (one-size-fits-all regardless of stakes), neither is tiering by cost — the former trips on core modules, the latter is dragged down on exploration by a self-imposed spec burden. [Source: Graziano, AI-Native Engineering Day 5–6 SDD "a spec is not a switch but a ladder," grade Ⅳ practitioner. [R1]]
Take spec maturity and machine-checkable form together and you get a compact criterion for this volume: a good spec must in form sit as far left toward machine-checkable as possible (the left end of ENG·03's spectrum) and in standing climb to the rung commensurate with the code's risk (this sheet's ladder). Only a spec that meets both can truly serve as the generation loop's objective function — form lets the machine verify, standing makes the team edit. Missing either and the spec degrades: form without standing is written then dropped; standing without form leans on people to read and review, with the machine unable to help. This is why "wrote a spec" is not the finish line; "the spec is alive and machine-checkable" is.
FIG. E8.0 / THE SPEC MATURITY LADDER · 可机检份额逐档抬升看懂:每爬一档,规格里"机器能自己验"的份额变大,人审的负担变小Read: each rung climbed grows the machine-checkable share and shrinks the human-review burden
同一句"对",从散文意图爬到形式化证明,并不是变得"更对",而是越来越多地被翻译成机器能独立复验的判据——蓝色份额每档抬升一截。这条阶梯不是"越高越好":该停在哪一档由代码的改动成本与出错成本决定(见上文判据),但只要往右爬一档,规格当生成循环目标函数的能力就强一分。The same "correct," climbing from prose intent to formal proof, does not become more correct — it gets translated, rung by rung, into criteria a machine can re-verify on its own; the blue share steps up each rung. The ladder is not "higher is always better": where to stop is set by the code's cost-to-change and cost-to-get-wrong (the criterion above), but every rung climbed strengthens the spec's power to serve as the generation loop's objective function.
ENG
08
TRUST BOUNDARY · 安全边界
TRUST BOUNDARY
重画 · 一等结构
Redraw · First-class
信任边界是结构里的一等元素,不是事后审查
The trust boundary is a first-class structural element, not an afterthought audit
Agents touch everything, fast. They read your codebase, call external tools, connect to third-party MCP servers; every one is an attack surface. Security-sensitive seams must be front-loaded, explicit, and human-guarded: least privilege, contained blast radius. This is the Architecture chapter's "trust boundary as first-class structure" landed on the engineering face.
Force analysis · where the new attack surface comes from. Exposing tools to an agent hands the "who may execute what" boundary to a system steerable by natural language. Three new risks: tool poisoning (instructions hidden in an MCP tool's description, luring the agent into out-of-scope calls), prompt injection (instructions smuggled in processed data, hijacking agent behavior), and credential leakage (the agent writes keys or tokens into logs or sends them back). All three share a root: wiring untrusted input to a privileged executor grows them structurally; they are not occasional model bugs. You stop what you can with boundaries, and what you cannot stop you contain by narrowing the blast radius.
Read-only by default: agentic workflows in the pipeline have no write access by default; writes must be declared safe-output explicitly and merge only after human review.
逐工具授权:每个 MCP server / 工具单独列权限,不给通配;新接一个工具当作新引一个依赖来审。
Per-tool authorization: list permissions per MCP server / tool, no wildcards; vet a newly connected tool as you would a newly added dependency.
Isolated execution: run the agent in a sandbox / container that bounds the files, network, and secrets it can reach, so the blast radius is contained structurally.
不可信数据隔离:把"被处理的数据"和"给 agent 的指令"在通道上分开,别让前者能改写后者。
Untrusted-data isolation: separate "data being processed" from "instructions to the agent" at the channel level, so the former cannot rewrite the latter.
人守在不可逆处:删除、转账、对外发布、权限变更——这些写操作设显式确认门,agent 不得自决。
Humans guard the irreversible: deletes, transfers, external publishes, permission changes set an explicit confirmation gate; the agent may not self-authorize.
Evidence · grade ⅣMCP security (tool poisoning / prompt injection / credential leakage / least-privilege list) and Continuous AI (agentic CI/CD workflows read-only by default, writes declared safe-output, merged after human review) come from Graziano's AI-Native Engineering (Day 4 MCP security, Day 7 Continuous AI) via GitHub's Continuous AI in Practice. "Trust boundary as first-class structure, least privilege, contained blast radius" shares a root with this series' Architecture chapter ↗.
Why the old "review afterward" line of defense fails. Traditional security treats the trust boundary as a checkpoint after the fact: build first, audit before launch. That does not work in the agent era, for two reasons. One is speed — an agent can call dozens of tools and change hundreds of files in seconds; by the time a human reviews, the over-privileged action has already happened. The other is that the kind of attack surface changed: the old surface was determinate code paths that could be statically scanned; the agent's surface is natural language, which cannot be exhaustively statically analyzed — an instruction hidden in a tool description or in processed data simply looks like ordinary text. So the trust boundary must move from "review afterward" to "a first-class element in the structure": before the agent can do anything, what it can do is already bounded structurally by permissions, sandbox, and channel isolation. This is two faces of the same thing as ENG·10's "boundary as judgment node" — an architecture boundary is both a correctness seam and a security seam.
把"接一个新工具"当成"引一个新依赖"来审。这是一条可照做的纪律。我们早就学会不随便 npm install 一个陌生包——会看下载量、看维护者、看它要什么权限。接一个 MCP server / 工具应该走同一道关:它的工具描述里有没有可疑指令(tool poisoning)?它要读写哪些资源、是不是远超它该有的范围?它会不会把数据回传到你不控制的地方?Continuous AI 给出流水线层的默认姿态——CI/CD 里的 agentic 工作流默认只读,任何写操作必须被显式声明为 safe-output、并经人审后才合并。默认只读这一条尤其关键:它把"出事"的默认方向从"已经写坏了再回滚"翻转成"想写必须先举手",爆炸半径在结构上就被限制在接近零。
Vet "connecting a new tool" as you would "adding a new dependency." This is a copyable discipline. We long ago learned not to npm install a stranger's package casually — we check downloads, maintainers, what permissions it wants. Connecting an MCP server / tool should pass the same gate: are there suspicious instructions in its tool descriptions (tool poisoning)? Which resources does it read and write, and is that far beyond what it should need? Could it send data somewhere you do not control? Continuous AI gives the pipeline-level default posture — agentic workflows in CI/CD are read-only by default, and any write must be explicitly declared safe-output and merged only after human review. The read-only default matters most: it flips the default direction of "something goes wrong" from "it already wrote damage, now roll back" to "to write, it must first raise its hand," nailing the blast radius structurally near zero.
检验信号Test signal
先行:你能一句话说出每个 agent 能碰什么、不能碰什么。反指标:图省事给了通配权限、agent 跑在你的主机账户上——一次 prompt injection 就是全盘失守。Leading: you can state in one sentence what each agent can and cannot touch. Counter-signal: wildcard permissions for convenience, the agent running under your main account: one prompt injection and the whole thing is lost.
能力接口要可组合,但每个接口单独授权
Capability interfaces must be composable, yet each authorized alone
"Composable capability interfaces" (skills / MCP / CLI) are how an agent extends its reach, but between "composable" and "secure" sits a tension to resolve head-on: you want the agent to flexibly compose several capabilities for a complex task, yet you do not want any one poisoned interface to run off with the whole permission set. The solution is not a compromise between the two but putting composability on the interface shape of a capability and authorization on each interface's boundary — interfaces compose freely, but the capability each carries is authorized and audited separately. Even a poisoned MCP tool's impact is nailed to the small slice of capability it was granted (read this directory only, call this read-only API only) and cannot cross that boundary. This is where ENG·04's "composable interfaces" and this sheet's "least privilege" converge on one object: an interface's capability face must be composable (for agent flexibility), its permission face minimal and independent (for a controllable blast radius). Hold the two faces apart and "flexible yet secure" stops being a contradiction and becomes two orthogonal properties of one interface. Falsifiable signal: if adding one capability to an agent forces you to open a bundle of unrelated permissions (because they share one key), your interface has coupled the capability face to the permission face — exactly the structural cause by which one poisoning loses everything.
可组合不等于失控:接口让 agent 安全地扩展自己
Composable is not uncontrolled: interfaces let the agent extend itself safely
Wiring this sheet to ENG·04's "composable interfaces" reveals a design stance crucial to AI-Native engineering: the correct way for an agent to extend its capability is through a set of clearly defined, separately authorized interfaces, not by giving it one omnipotent general entry. The temptation always exists — give the agent a tool that runs arbitrary shell commands and it seems to "do everything," sparing you the trouble of defining many fine-grained interfaces. But that builds composability on sand: an omnipotent entry means its capability face and permission face are fully coupled, and you cannot express "it may do A and B but not C" because A, B, C share one unconstrained channel. Conversely, making each capability an independent interface (a skill that reads only this data source, an MCP tool that calls only this API, a CLI that operates only under this directory) makes composability stronger — because only clearly defined interfaces can be reliably composed, and the permission boundary each interface carries keeps that composition from overreaching. This is why "composable" and "least privilege" are not opposed in good interface design but mutually enabling: a clear interface boundary serves both composition (making it predictable) and security (making authorization expressible). Falsifiable signal: if your main way to extend the agent is "a hole that runs arbitrary commands," you have neither real composability (composition is unpredictable) nor real least privilege (you cannot express fine-grained constraints) — you have merely mistaken "convenient" for "flexible."
ENG
09
FAILURE MODES · 失败模式学
FAILURE MODES
反例 · 为何非验不可
Anti-pattern · Why verify
为何非验不可:会猜的系统,错误会滚雪球
Why you cannot skip verification: a guessing system snowballs its errors
The sheets above covered "how to verify"; this one covers "why you cannot skip it." The cause behind trust-but-verify is a set of falsifiable, demonstrable failure modes: not occasional model glitches but structural by-products of a guessing system. Recognize them and you know where the guardrails go.
Force analysis · one common root. An LLM generates by "what token is most probable next"; it has no natural state of "I don't know," so its failures are mostly confidently wrong rather than silently empty. That root grows the five modes below. Of these, the snowball effect matters most: an early small misread is taken as a fixed premise by every later step and amplifies exponentially across them; it alone argues for "a verification checkpoint at every meaningful step," because correcting later is dearer.
幻觉Hallucination
Hallucination
凭空生成看似合理、实则不存在的 API、字段、引用。护栏:可机检的类型 / schema / 编译器,让"不存在"当场报错。Invents plausible-looking but nonexistent APIs, fields, citations. Guardrail: machine-checkable types / schema / compiler so "does not exist" errors out at once.
自信而错Confident wrongness
Confident wrongness
错误答案与正确答案语气一样笃定,没有不确定信号。护栏:独立 checker 判对错,别信生成者的自评。A wrong answer sounds as certain as a right one, with no uncertainty signal. Guardrail: an independent checker judges correctness; do not trust the generator's self-assessment.
上下文腐化Context rot
Context rot
长会话里早期信息被稀释、覆盖、自相矛盾,模型悄悄漂离原意图。护栏:用文件作持久真源,而非靠对话历史记事。Across a long session, early information gets diluted, overwritten, self-contradictory; the model quietly drifts from the original intent. Guardrail: use files as the persistent source, not conversation history.
隐藏假设Hidden assumptions
Hidden assumptions
把未言明的前提当事实补全,沿错误前提一路自洽地跑下去。护栏:规格显式写出非目标与边界条件。Fills in unstated premises as fact and runs on, internally consistent atop a wrong premise. Guardrail: specs state non-goals and boundary conditions explicitly.
雪球效应Snowball effect
Snowball effect
早期小错被后续每步当作既定前提,沿多步指数放大。护栏:每个有意义步骤设验证检查点,趁错小就拦。An early small error is taken as a fixed premise by each later step and amplifies exponentially. Guardrail: a verification checkpoint at every meaningful step, catching errors while small.
核心图KEY FIGFIG. E9.0 / THE SNOWBALL · 雪球 vs 检查点看懂:为什么每个有意义步骤都要一个验证检查点Read: why every meaningful step needs a verification checkpoint
Two curves leave the same origin and end worlds apart. The vermilion line is the trajectory without an independent verifier: an early small misread is taken as a fixed premise by every later step and snowballs exponentially — and because an LLM has no natural "I don't know," it stays confident the whole way and never brakes itself. The blue line is the trajectory with a verification checkpoint at every meaningful step: each time the error is caught and reset near zero while still small, accumulated drift is squeezed into a bounded sawtooth. This figure alone argues the cause behind trust-but-verify — not that the agent is dumb, but that a guessing system's drift compounds and correcting later is dearer. [Failure taxonomy from Graziano, grade Ⅳ; the snowball mechanism is this series' formalization of it.]
The irony boundary · watching harder is not the answer. The more reliable the system, the faster a supervisor's vigilance decays, and at the very anomaly that most needs a takeover, the human has already lost situational awareness (Bainbridge 1983, Ironies of Automation). So "a human on the loop watching the screen in real time" is doomed: the answer is not to watch harder but to shift human intervention from real-time supervision to asynchronous triage, letting structure (independent checkers, checkpoints, evals) block the vast majority while the human takes over only at the few anomalies structure has flagged. This stitches the failure-mode study to this series' Verification chapter.
Evidence · grade Ⅳ + ⅡThe failure taxonomy (hallucination / confident wrongness / context rot / hidden assumptions / snowball effect) comes from Graziano's AI-Native Engineering (Day 2 failure modes), grade Ⅳ practitioner curation. "Ironies of automation" (the more reliable the system, the faster supervisor vigilance decays) comes from Lisanne Bainbridge, Ironies of Automation, Automatica 1983, grade Ⅱ classic human-factors literature, cited the same way in this series' Verification chapter ↗.
The five modes share one root, so guardrails can be assigned systematically. Line the five up and you see they are not five independent defects but the same root ("guesses, and has no 'I don't know' state") showing itself at different junctures — so each maps to a class of structural guardrail rather than "be more careful": hallucination to types / schema / compiler (so "does not exist" errors out on the spot), confident wrongness to an independent checker (do not trust the generator's self-assessment), context rot to a file-based persistent source (do not bookkeep on conversation history), hidden assumptions to explicitly written non-goals and boundary conditions, the snowball to a verification checkpoint at each step (catch errors while small). This mapping is itself a copyable guardrail checklist: when a failure class appears, first identify which root it belongs to, then add the structure for that cell, instead of adding one more round of human re-checking.
The irony of automation is this volume's most counterintuitive and most important point. Bainbridge's 1983 finding: the more reliable you make a system, the harder it is for a human supervisor to catch it when it does fail — because long stretches of nothing-happening let vigilance decay naturally, and at the rarest, most-needs-a-takeover anomaly the supervisor has already lost situational awareness. This bears directly on AI-Native engineering: many teams' instinct is "since the agent errs, station a human to watch closely." Bainbridge tells you this is doomed — the longer the watch and the steadier the system, the less the human catches that critical moment. The right fix is not stronger real-time supervision but a change in its shape: from "a human on the loop watching the screen" to "structure blocks first, the human does asynchronous triage." Let independent checkers, checkpoints, and evals block the vast majority at the moment of occurrence, and push to the human only the few real anomalies structure has flagged — and push them with full context attached. Then the human faces not "a sudden anomaly inside hours of calm" but "an already-framed item with evidence to judge" — and the human-factors trap of vigilance decay is routed around structurally.
检验信号Test signal
先行:每次事故能回答"哪条护栏本该拦住它"并补上。反指标:靠"让人盯紧点"防雪球——那是在用衰减的注意力对抗指数的错误增长。Leading: every incident can answer "which guardrail should have caught it" and gets one added. Counter-signal: fighting the snowball by "having people watch more closely" (decaying attention against exponential error growth).
为什么"让人盯紧点"是错的方向
Why "have people watch more closely" is the wrong direction
Faced with an agent erring often, the most natural reaction is "then have people review more carefully." This path necessarily fails at agentic scale for a reason of magnitude, not attitude: human attention is decaying, finite, and does not grow with output, while error growth is exponential. The review backlog an agent produces overnight can easily exceed what one person can carefully review in a week; expecting humans to close that gap by "trying harder" is chasing an exponential curve with a flat line, and the gap only widens. Worse, the more fatigued, the more readily a human waves through self-consistent, fluent, confident errors — exactly the kind the agent is best at producing. So the right direction is not to crank up review intensity but to move human review off the hot path: distill machine-checkable correctness into automatic guardrails (let the machine filter out the vast majority before the human), keep the human on the few constitutive judgments the machine cannot reach, and change the form of human review from "watching the screen in real time" to "asynchronous triage" (handling items the machine flagged, not every item). This is a different face of the same move as ENG·06's tiers and ENG·15's eval compounding: the only thing that keeps pace with exponential output is another automatic verification that grows with output, not a human who has only so much bandwidth however hard they try. Falsifiable signal: if your team's main response to quality problems is more shifts and more reviewers rather than more automatic guardrails, you are fighting exponential error growth with decaying attention — the outcome of that fight is settled.
ENG
10
BOUNDARY · 边界即判断节点
BOUNDARY · JUDGMENT NODE
机理 · 角色融合
Mechanism · Role fusion
架构边界,就是非人不可的判断节点
An architecture boundary is exactly the judgment node only a human can hold
When implementation is abundant, "where the boundaries go" becomes one of the few decisions only a human can make. The boundaries of modules, services, interfaces are exactly where reversibility and blast radius live: the architect decides where the seams go, the agent fills in what is inside the modules. This collapses every sheet above into one line: the engineer fuses from implementer into orchestrator.
Force analysis · why a boundary is a judgment node. Whether a decision deserves a human turns on two things: reversibility (can a mistake be cheaply undone) and blast radius (how far a mistake spreads). A module boundary happens to set both at once: draw the seam wrong and errors spread across it and resist rollback. So an architecture boundary is inherently a high-blast-radius, low-reversibility load-bearing decision, exactly where judgment should go; the implementation inside the boundary is low-radius, high-reversibility and can be safely handed to the agent. Same ruler as ENG·07's tiers, on the structural face: the boundary is Own, inside the module is Delegate.
Role fusion · implementer → orchestrator → scheduler. Connect the above: once in-module implementation is delegable, the engineer's leverage climbs to designing boundaries, writing specs, building the harness, designing the loop, and finally scheduling a parallel fleet of agents. This is ENG·01's elevator (prompt → context → spec → harness → loop → fleet), leverage climbing floor by floor. Fusion is not a "merger of titles" but the level of judgment one person holds moving upward: writing each line less and less, deciding where the seams go, what counts as correct, and who may touch what more and more. This is the kernel's step ④ on the engineering face: people return to the judgment and taste only people can set.
FIG. E10.0 / ROLE FUSION · 实现者 → 编排者 → 调度者看懂:边界是那个不可交办的人类判断节点Read: the boundary is the non-delegable human judgment node
This collapses every sheet above into one line: once in-module implementation is delegable, the engineer's leverage climbs ENG·01's building — implementer → orchestrator → scheduler. Fusion is not a merger of titles but the upward move of the level of judgment one person holds: writing each line less and less, deciding where the seams go, what counts as correct, and who may touch what more and more. The vermilion node is the non-delegable core — an architecture boundary is inherently high-blast-radius, low-reversibility (ENG·06's Own tier on the structural face), while inside the module is low-radius, high-reversibility and safely the agent's. This is the kernel's step ④ on the engineering face.
Evidence · grade Ⅳ + internal synthesis"Boundary as judgment node (reversibility + blast radius), scope discipline, trust boundary as first-class structure" come from this series' Architecture chapter ↗; the "implementer → orchestrator" role frame from Graziano's AI-Native Engineering (Day 1) via OpenAI; "leverage climbing floor by floor, prompt→…→fleet" from this series' Genealogy chapter ↗. This sheet is the internal synthesis that stitches the three on the engineering face [exploration ledger: the connection is this series' reasoning, not an external direct assertion].
Why abundant implementation makes architecture scarcer, not less important. A dangerous misreading: "since the agent can generate any implementation, doesn't architecture stop mattering?" The opposite. When implementation is cheap, what drags a system down is no longer "can't write it" but "writes too much too fast with no structural constraint" — generation piles up technical debt at startling speed, and once a module boundary is drawn wrong, errors spread across it and resist rollback. So the architecture boundary is the scarce structure that keeps generation from collapsing into mud: it cannot get cheap, because it is constitutive judgment, not machine-checkable correctness. This resolves an apparent paradox — the stronger the agent, the higher the relative value of drawing seams, setting dependency direction, and exercising scope discipline (deciding what not to build), not the lower.
角色融合不是"人变少了",是"人持有的判断变高了"。把 ENG·01 的楼层、ENG·06 的三档、ENG·10 的边界放在一起看,会浮现同一个动作:同一个工程师,着力点从"写每一行"上移到"决定每一处接缝"。这不是裁掉实现者、保留架构师的人事故事,而是同一个人身上判断层次的迁移——他还在做工程,只是工程的重心从可机检的那一半(实现)移到了构成性的那一半(边界、契约、何为对、谁能碰什么)。本系列把这称作内核第④步在工程面的落点:人不做吞吐,回到只有人能做的判断与建造。它也给了一个可证伪的组织预测——如果一个团队"AI 化"之后,工程师的时间分配没有从写实现明显移向划边界、定契约、补护栏,那它多半只是把 AI 嫁接到了旧流程上,并没有真正重画研发图。
Role fusion is not "fewer people" but "the judgment people hold rises." Put ENG·01's floors, ENG·06's tiers, and ENG·10's boundary side by side and one move surfaces: the same engineer, leverage climbing from "writing every line" to "deciding every seam." This is not a headcount story of cutting implementers and keeping architects, but a migration of the level of judgment within the same person — still doing engineering, only with engineering's center of gravity moved from the machine-checkable half (implementation) to the constitutive half (boundaries, contracts, what is correct, who may touch what). This series calls it the kernel's step ④ on the engineering face: people do not do throughput; they return to the judgment and building only people can do. It also yields a falsifiable organizational prediction — if, after a team "goes AI," engineers' time has not visibly shifted from writing implementation toward drawing boundaries, setting contracts, and adding guardrails, the team has most likely only grafted AI onto the old process and not redrawn the development graph at all.
检验信号Test signal
先行:人的时间从写实现细节,转到划接缝、定契约、设权限。反指标:架构师还在逐行写模块内部,却没人盯边界——表面忙碌,承重决策无人持有。Leading: human time shifts from writing implementation detail to drawing seams, setting contracts, scoping permissions. Counter-signal: architects still hand-write module internals while no one watches the boundaries: busy on the surface, load-bearing decisions unheld.
边界即判断节点:哪些角色融合,哪些反而分化
Boundaries as judgment nodes: which roles fuse, which split apart
"Role fusion" is easily heard as "everyone becomes full-stack and boundaries disappear," which is a misreading. The accurate picture is: division of labor at the implementation level fuses, while division at the judgment level instead splits apart and is lifted higher. The old frontend/backend/test/ops split was largely cut along "who writes which implementation"; once implementation is absorbed by agents, the boundaries cut along implementation do collapse — one person, with an agent, delivers across the stack. But at the same time a new set of boundaries, cut along "who holds which judgment," emerges and grows more important: who sets this external contract, who judges this architectural trade-off, who designs this security boundary, who guards the standard of "what is correct." These are not implementation trades but judgment nodes, and they fall precisely at the seams of the system — between modules, between services, between trust domains. This is why the volume keeps saying "architectural boundaries are judgment nodes": when implementation is abundant, architecture (how boundaries are drawn) becomes the scarce structure that keeps generation from collapsing into tech debt, and drawing boundaries is itself not machine-checkable but a constitutive judgment, so it sinks to ④ and stays with people. Falsifiable corollary: after a team "goes AI," if what you observe is "implementation trades merge, but a clearer ownership of boundary/contract/security judgments appears," that is real fusion; if what you observe is "implementation trades merge, and no one clearly owns the boundaries," that is not fusion but load-bearing decisions left dangling — a few fewer trades on the surface, one more unheld risk underneath. Deep dive in the Architecture chapter.
ENG
12
FAILURE MODES · 失败学
FAILURE MODES
机理 · 为何非验不可
Mechanism · Why Verify
四种失败不是偶发,是结构产物
Four failures are not flukes but structural products
trust-but-verify says "how to verify"; this sheet says "why verification is non-optional." Hallucination, confident wrongness, the snowball, context rot — none is the model occasionally glitching; each is something this architecture necessarily produces under definite conditions. Recognize the structure that produces it first, and you know which guardrail damps it.
Treating verification as "a diligent good habit" is the easiest mistake in this volume. Verification is a load-bearing wall not because the model occasionally errs but because this class of error has a definite structural cause — it grows directly out of how a generative system works, independent of how smart the model is. The four below are the core failure modes Graziano lists in AI-Native Engineering Day 2; here each gets two layers added: "why the architecture necessarily produces it" and "which harness guardrail damps it." [Source: Graziano, AI-Native Engineering Day 2 failure modes, grade Ⅳ practitioner. [R1]]
为什么是这套架构的产物,而非模型的缺陷
Why a product of the architecture, not a defect of the model
An autoregressive language model does one thing at each step: under the given context, take the highest-probability continuation for the next token. It has no independent internal state of "am I sure," and no stage that checks output against external truth — unless you build one outside it. So "fluent" and "correct" are the same quantity inside it: what reads right and what is right share one scoring function. That single fact explains the first two below: hallucination is the model completing in the "most truth-like" way even with no evidence, and confident wrongness is that completion happening to wear an assured tone. They are not bugs but the default behavior of "continue by probability" when external anchors are absent. The latter two come from placing this single-step behavior inside a multi-step loop: error accumulates across steps (the snowball), and context degrades with length (context rot). So the four sort into two pairs: the first two are single-step epistemic defects, the latter two are multi-step dynamical defects.
失败模式Failure mode
结构成因(架构为何产出它)Structural cause (why the architecture produces it)
harness 阻尼它的那条护栏The harness guardrail that damps it
幻觉Hallucination
无证据时仍按"最像真话"续写;"流畅"与"正确"共用一套打分,模型分不清记得与编造。Completes in the "most truth-like" way with no evidence; "fluent" and "correct" share one score, so it cannot tell recall from invention.
把真源喂进上下文(sensor:检索 / grounding),并用 computational guard 校验引用真实存在(如 API 签名 / 文件路径过编译)。Feed the source into context (sensor: retrieval / grounding) and use a computational guard to check the citation really exists (API signature / file path compiles).
自信而错Confident wrongness
语气的确信度与答案的正确度由不同机制决定,二者解耦——错的答案可以毫无保留地自信。Tone-confidence and answer-correctness are set by different mechanisms and decoupled — a wrong answer can be utterly self-assured.
独立验证器(与生成分离的 checker)只读结果不读语气:测试是绿是红,与它说得多笃定无关。An independent verifier (a checker separate from generation) reads only the result, never the tone: tests pass or fail regardless of how certain it sounded.
雪球Snowball
多步循环里,第 k 步的输出是第 k+1 步的输入;早期小错被当作既定事实继续推演,误差沿步骤指数放大。In a multi-step loop, step k's output is step k+1's input; an early small error is taken as settled fact and compounds, error growing across steps.
在每个有意义的步骤插 HITL 检查点 + 频繁绿条:把循环切短,让错误在放大前被截断(见 FIG E9.0)。Insert a HITL checkpoint at each meaningful step plus frequent green bars: shorten the loop so error is cut off before it amplifies (see FIG E9.0).
上下文腐烂Context rot
有效注意力随上下文变长而稀释;窗口里塞得越多,早期关键约束越被淹没,输出反而漂移。Effective attention dilutes as context grows; the more crammed in the window, the more early key constraints drown, and output drifts.
用 SPEC / PLAN / TASKS 文件作持久真源、定期压缩对话、按需重新装配窗口——少即是多(见 ENG·02)。Use SPEC / PLAN / TASKS files as persistent source, compact the conversation periodically, reassemble the window on demand — less is more (see ENG·02).
核心图KEY FIGFIG. E12.1 / THE FAILURE-MODE MAP · 成因 × 护栏 · cause × guardrail看懂:每种失败落在哪格,就该配哪条护栏Read: where a failure sits is which guardrail it needs
Lay the five failures on two axes: the horizontal is cause (single-step epistemic / multi-step dynamical), the vertical is the layer where the damping guardrail lives (sensors as machine checks / guides written into context). Laid out, one discipline reads off — each failure points at a guardrail that ought to exist, and wherever a cell is empty, that failure recurs in your system. Hidden assumptions is drawn as its own dashed box because it straddles single- and multi-step and bypasses machine checks; only guides (conventions written explicitly into context) catch it. This figure is the failure taxonomy read in reverse: not a list of "which errors occur" but the requirements spec for designing the harness. [Source: Graziano, AI-Native Engineering Day 2 failure modes, grade Ⅳ practitioner [R1].]
The snowball is listed alone because it multiplies the harm of the first three. If a hallucination lands on step two of a thirty-step derivation, wears a confident tone, and faces no checkpoint, the remaining twenty-eight steps all build on that false premise — the agent will earnestly write tests for a function that does not exist, design a fallback for a wrong edge condition, keep adding detail to a plan that has already gone off the rails. It always reads self-consistent, because each step follows faithfully from the last; the rot is that the last step was itself wrong. This is exactly why "a human-in-the-loop checkpoint at every meaningful step" is not caution but a structural necessity: a checkpoint's sole job is to cut the snowball while it is still small. Drawn out, this is FIG E9.0 — one curve pulled back to baseline at each node by checkpoints, the other diverging exponentially without them. Falsifiable signal: if your agent is "right in the first half, then suddenly collapses wholesale in the second" on long tasks, it is almost certainly an early step that was wrong and uncaught, not the model getting dumber later.
"The model errs now and then, just check a few more times" — verification as an attitude, leaning on human vigilance and luck, with no structural answer to where it erred, why, or how to catch it next time.
Each failure is first sorted to its structural cause (single-step epistemic / multi-step dynamical), then matched to its guardrail: grounding for hallucination, an independent checker for confident wrongness, checkpoints for the snowball, context hygiene for rot. Verification becomes a designable system, not an attitude.
检验信号Test signal
证伪:若把同一类错误的成因归对后、补上对应护栏,该类错误的复发率没有下降,那"按成因配护栏"这个主张就是错的——很可能是成因归错了层(把多步动力学问题当单步认识论问题治)。Falsified if: after correctly attributing a class of error and adding its matched guardrail, the recurrence rate of that class does not drop, then "match guardrail to cause" is wrong here — most likely the cause was sorted to the wrong layer (treating a multi-step dynamical problem as a single-step epistemic one).
隐藏假设:第五种,最难被测试抓到的一种
Hidden assumptions: the fifth, the hardest for tests to catch
Beyond the first four, one more deserves its own listing because it bypasses most guardrails: hidden assumptions. While generating, the agent silently fills in a pile of premises it did not ask and you did not state — whether this API's pagination starts at 0 or 1, whether this amount is in cents or units, whether this timestamp is UTC or local, whether "user" means the logged-in user or the one being acted upon. Each assumption looks "reasonable" alone, the code runs, the tests pass — because the tests are often built on the same hidden assumption. Its danger is precisely that it does not error: it is not a failure that turns red but a silent semantic mismatch that surfaces only on an edge case in production. Why does this architecture necessarily produce it? Because the model's job is to complete the most probable continuation, and "most probable" is computed over the training distribution, not over your system's actual conventions — when your convention departs from the common distribution (say, this legacy system holds amounts in cents), it confidently fills in the common-distribution wrong. The guardrail that damps it is not on the computational side (a test cannot catch an error inside its own assumption) but on the guides side: write these conventions explicitly into context and spec and do not let the agent guess. This is the interface between ENG·02 "context as infrastructure" and this sheet — a hidden assumption is the projection of missing context onto failure taxonomy. Falsifiable signal: if your incident retrospectives repeatedly read "it thought it was X, but here it is Y," the model was not careless; your convention never entered context and the agent could only guess from the training distribution. [Source: Graziano, AI-Native Engineering Day 2 failure modes (hidden assumptions), grade Ⅳ practitioner. [R1]]
把失败学倒过来用:它是设计护栏的需求清单
Read the taxonomy in reverse: it is the requirements list for guardrails
This taxonomy is not only a list of "which errors will occur"; read in reverse, it is the requirements spec for designing the harness — each failure mode corresponds to a guardrail that ought to exist, and whether that guardrail is present and strong in your scaffolding decides whether the failure recurs. This turns "what should my harness add" from a gut feeling into a check you can run line by line: do you have grounding/retrieval for hallucination? An independent-of-generation checker for confident wrongness? A checkpoint at each meaningful step for the snowball? Context hygiene (files as source, periodic compaction) for rot? Key conventions written explicitly into context for hidden assumptions? Wherever the answer is "no," that failure will recur in your system — and it will recur in the guise of "the model occasionally glitching," fooling you into thinking it is a luck problem so that you never plug the structural gap. This is where ENG·04's guides/sensors taxonomy and this failure taxonomy stitch together: the taxonomy tells you which shapes of guardrail exist, the failure modes tell you which failure each guardrail blocks. Put the two tables together and "is my scaffolding enough" gets an answer independent of intuition. Falsifiable signal: sort your last ten agent incidents into the five failure modes; if they cluster heavily on one or two, that is no coincidence but your harness systematically under-investing in the corresponding one or two guardrails — adding that one is far more effective than vaguely "having people be more careful."
这四种为什么都跟"模型聪明不聪明"无关
Why none of the four is about whether the model is smart
A common misjudgment can void the whole taxonomy: attributing these failures to "the model is not strong enough yet, the next generation will fix it." This is a dangerous thought that makes you stop building guardrails, because it misreads a structural problem as a capability problem. The fact is: the cause of these four failures is how a generative system works, not its capability ceiling. A stronger model hallucinates less, is confidently wrong less often — but the fundamental mechanism of "continue by probability, with fluent and correct sharing one score" does not change, so the probability of hallucination drops but does not reach zero; and as long as it is non-zero, the snowball can still start from any early error a checkpoint failed to catch. Likewise, context rot is the product of effective attention diluting with length; however large the window, it cannot change the direction that "cramming too much dilutes," only move the inflection point. So betting on "wait for a stronger model" outsources your system's reliability to something you do not control and that in principle does not reach zero. The correct stance: admit these four are permanent properties of this architecture, not temporary defects, and build guardrails as permanent structure rather than "scaffolding to dismantle once the model matures." Falsifiable signal: if your team repeatedly defers building a guardrail on the grounds of "wait for the next model," look at history — after the last generation shipped, did that class of failure really vanish? Most likely it grew rarer but did not vanish, and bit you again in some setting you thought safe. That is the cost of treating a structural problem as a capability problem.
ENG
13
JIT PLANNING · 即时规划
JIT PLANNING
重画 · 流程
Redraw · Process
规划视野随执行变充裕而坍缩
The planning horizon collapses as execution gets cheap
The entire value of long-horizon planning rests on "execution is expensive and rework more so." When execution is near-free, the expected payoff of planning step seven to the bottom in advance collapses — the plan expires before it is executed. Planning shifts from a one-off upfront asset to an activity generated just-in-time against signal and rerun at will.
先把长视野规划的前提挖出来。瀑布、季度路线图、详尽的前期设计文档,它们合理的唯一条件是:执行昂贵、且返工的代价远高于规划的代价。在那个世界里,多想一周省下三个月的错误实现,是稳赚的。所以人类一百年的工程管理都在加厚前期:写更细的 spec、画更全的甘特图、把第七步都先规划好。这套逻辑没有错,它只是对一组特定参数最优——而 AI 把那组参数改了。当 agentic coding 让"实现一版"从三个月压到三小时,提前规划的算式就反转了:你为第七步做的精细规划,很可能在第三步执行完拿到真实反馈后就作废,因为真实代码会告诉你前面的假设哪条错了。规划得越远,作废得越多。这不是说规划没用,是说规划的最优视野缩短了。
First dig out the premise of long-horizon planning. Waterfall, quarterly roadmaps, exhaustive upfront design docs — their sole condition for being reasonable is that execution is expensive and the cost of rework far exceeds the cost of planning. In that world, a week more of thinking that saves three months of wrong implementation is a sure bet. So a century of engineering management thickened the front: finer specs, fuller Gantt charts, step seven planned in advance. This logic is not wrong; it is merely optimal for one parameter set — and AI changed that set. When agentic coding compresses "ship a version" from three months to three hours, the arithmetic of planning ahead inverts: your fine plan for step seven will likely be void once step three executes and hands back real feedback, because real code tells you which earlier assumption was wrong. The farther you plan, the more you void. This does not mean planning is useless; it means the optimal planning horizon shortens.
核心图KEY FIGFIG. E13.1 / PLANNING HORIZON COLLAPSE · 瀑布 → 即时 · waterfall → JIT看懂:执行越便宜,该规划的视野越短Read: the cheaper execution, the shorter the horizon worth planning
The entire value of long-horizon planning rests on "execution is expensive and rework dearer." Draw that premise as the horizontal axis: the farther right execution cost falls, the more the horizon worth planning collapses — the vermilion curve is that collapse. At the left is waterfall: rework is dear, a week more thinking saves three months, planning step seven ahead is a sure bet; at the right is JIT: execution is near-free and fine plans for far legs expire before they run. So the discipline is not "do not plan" but shorten the horizon and hand the re-plan trigger from the calendar to signal — the loop along the bottom: thin-plan the next leg → execute for signal → a red test / drift triggers a re-plan, back around. The anti-signal is in the prose: a PLAN.md that only grows and never revises is evidence the horizon has not collapsed with cheap execution. [Source: this series' engineering-practice synthesis + Graziano, AI-Native Engineering Day 3, grade Ⅳ [R1].]
即时规划:贴着信号生成,凭信号重来
JIT planning: generate against signal, re-plan on signal
JIT planning has just two disciplines. First, plan only the next leg to "enough to start" precision, never pre-planning far legs whose feedback you have not received — because the far leg's inputs do not yet exist. Second, treat each execution's real result as the trigger to re-plan: a red test, a missed performance target, an edge case surfacing are not "deviations from the plan" but "signals to re-plan." This resembles agile's "small iterations" but differs in spirit: agile shortens the delivery cycle while planning still runs on a fixed cadence (each sprint); JIT planning shortens the planning horizon itself and hands the re-plan trigger to signal rather than the calendar. A concrete contrast: waterfall asks "list everything we will do this quarter up front"; JIT planning asks "what is the smallest step that gets the next signal capable of changing later judgment." The former optimizes coverage, the latter information gain.
反信号:PLAN.md 越写越厚
The anti-signal: PLAN.md getting thicker
有一个干净的反信号能当场告诉你规划已经偏离轨道:PLAN.md 一直在变厚,却很少被执行打回去改。健康的即时规划里,计划文件是个薄的、活的、频繁被真实结果推翻重写的东西——它今天列三步,执行完第一步后可能整段重写,因为第一步的反馈改变了对后两步的判断。如果你的计划文件只增不改、越来越详尽地铺陈尚未验证的远期步骤,那说明团队在用"规划"代替"拿信号":把本该靠执行去证伪的假设,写成了看起来很周全的文档。这是瀑布心智在 agentic 时代的残留——它把规划的厚度误当成确定性,而真正的确定性只能来自执行反馈。可证伪信号:统计你的计划文件被"执行结果触发的重写"次数 vs "纯追加"次数;前者远低于后者,就是规划视野没有随执行变充裕而坍缩的证据。〔源 本系列工程实践综合 + Graziano Day 3"先计划 / 保持上下文干净 / 知道何时叫停",证据级 Ⅳ[R1]〕
One clean anti-signal tells you on the spot that planning has gone off: PLAN.md keeps getting thicker yet is rarely pushed back by execution. In healthy JIT planning the plan file is a thin, living thing, frequently overturned and rewritten by real results — it lists three steps today, and after step one may be rewritten wholesale because step one's feedback changed the judgment on the other two. If your plan file only grows and never revises, laying out ever more detail of unverified far steps, the team is substituting "planning" for "getting signal": writing assumptions that execution was supposed to falsify into a document that merely looks thorough. This is the waterfall mind's residue in the agentic era — it mistakes the thickness of the plan for certainty, when real certainty can only come from execution feedback. Falsifiable signal: count how often your plan file is rewritten-triggered-by-result versus pure-append; the former far below the latter is evidence the planning horizon has not collapsed with cheap execution. [Source: this series' engineering-practice synthesis + Graziano Day 3 "plan first / keep context clean / know when to stop," grade Ⅳ. [R1]]
①
薄规划Thin plan
只规划到"足够开始下一段"的精度,远段留白——它的输入还不存在。Plan only to "enough to start the next leg"; leave far legs blank — their inputs do not exist yet.
②
执行取信号Execute for signal
执行的目的不只是产出,更是拿到能改变后续判断的真实反馈。Execution aims not only to produce but to get real feedback that can change later judgment.
③
凭信号重规划Re-plan on signal
红测试 / 漂移 / 新边界 = 重规划触发器,不是"偏离计划"。计划是活的。Red test / drift / new edge = a re-plan trigger, not "off-plan." The plan is alive.
先计划,再放手:从 Chat 到 Plan 的那一步
Plan first, then let go: the step from Chat to Plan
JIT planning says "keep the horizon short," but short does not mean "do not plan" — quite the opposite: one of the highest-leverage habits in agentic coding is to have it (and you) lay out the plan for this leg before letting the agent act. This is the key shift from Chat mode to Plan mode: in Chat mode you say "go do X" and the agent immediately thinks-while-changing, with errors surfacing only after it runs and already snowballed; in Plan mode it first produces a plan of "I intend to cut this leg this way, in this order, verifying at these points," and you review that plan before it touches any code. This step is high-leverage because reviewing a plan is far cheaper than reviewing a pile of already-written diffs — a wrong assumption in the plan is caught before it becomes a change across thirty files. It does not conflict with JIT's "short horizon": short horizon says "do not plan far legs without feedback," plan-first says "for the near leg you are about to act on, make the plan explicit before executing." Together: plan the near leg then let go, leave the far leg blank and wait for signal. [Source: Graziano, AI-Native Engineering Day 3 "from Chat to Plan" + best practices with agents (plan first / keep context clean / know when to stop / target tests / review by diff), grade Ⅳ practitioner. [R1]]
This also reorders the ownership of "planning" in a team. Waterfall treats planning as a one-off thick asset produced early by a few; JIT planning makes it a thin activity produced continuously against each leg of execution by "whoever is closest to that leg's feedback." Who re-plans? Not the person who set the roadmap at quarter start, but the person (or agent) who received the last leg's execution result — because the input to re-planning (real feedback) is in their hands. This frees the "right to plan" from the calendar and the hierarchy and hands it to signal. Its falsifiable organizational corollary: a genuinely JIT-planning team's "who edits the plan" should be highly distributed and tightly tracking execution; if the plan can still only be edited by the few who set it at quarter start, on a quarterly rather than daily cycle, the team has merely swapped tools for the waterfall Gantt chart, and the planning horizon has not collapsed with cheap execution.
知道何时叫停:即时规划的另一半纪律
Knowing when to stop: the other half of JIT discipline
JIT planning is about "when to re-plan," but a symmetric, equally important judgment is often missed: when to stop. When a leg's feedback is repeatedly bad — the agent failed to solve this correctly a third time, the plan was overturned two rounds running, each re-plan spins in place — that is itself a signal, but it points not to "plan once more" but to "this path may simply not work; back out and cut it differently, or escalate this leg into a judgment a human must think hard about." Knowing when to stop is hard because the agent always looks like "one more try and it'll be fine": it does not tire, does not get discouraged, comes back confident with another version every time, and this never-give-up trait drags humans into a low-yield loop — the human re-plans, the agent retries, and no one steps back to ask "was this problem cut wrong?" So JIT planning's full discipline is actually three rules, not two: thin plan, re-plan on signal, and recognizing the meta-signal that "re-planning has stopped working" and escalating in time. This third puts human judgment where it most belongs — not doing the agent's work but judging "should this still be done this way at all right now." Falsifiable signal: if you find yourself re-planning the same leg three or four rounds, each time expecting "this time the agent will get it," it is most likely not time to plan again but to stop — back out and re-cut the problem, or admit this leg needs a human to think it through personally, rather than keep feeding a loop spinning in place. [Source: Graziano, AI-Native Engineering Day 3 "know when to stop," grade Ⅳ practitioner. [R1]]
即时规划如何不退化成"边想边改"
How JIT planning avoids degrading into "think-while-changing"
JIT planning has a slope to block head-on: misreading "short horizon" as "no planning, take it a step at a time," which degrades back into the worst vibe-coding — no spec, no checkpoints, the agent thinking-while-changing, errors snowballing beyond recovery. Short horizon and no planning are entirely different things, distinguished by a simple criterion: before each leg executes, is "what counts as done, and where to verify" explicit for that leg? If yes, it is healthy JIT planning (short, but each leg has clear acceptance and checkpoints); if no, it is degraded think-while-changing (both short and blind). So JIT planning's "thin plan" is thin in horizon (do not plan far legs), not in density (the near leg's acceptance criteria must be clear). Wire this to ENG·05's checkpoints and ENG·15's evals, and JIT planning actually fits each leg of execution with a pair of boundaries: an explicit "what this leg should achieve" at the start, and a set of machine-checkable verifications of "did it achieve it" at the end. Between them the agent improvises freely, but its room is framed by that pair, and errors cannot leave the leg. This is exactly how JIT planning gets both fast and stable: fast, because no will-be-voided planning for far legs; stable, because every near leg has a clear acceptance gate. Falsifiable signal: if in your "JIT planning" practice the agent often starts running before a leg has stated "what counts as done," and only long after it finishes do you discover the direction was wrong, that is not JIT planning but no-planning dressed as agile — it will punish you periodically by snowball.
Connect an executor that can be steered by outside text and that errs confidently to tools that change the world, and the security model shifts: the threat is no longer mainly an external attacker but your own authorized agent being induced to do what you never meant it to. Read-only by default, escalate on demand, keep irreversible actions with a human — the three foundations of this new boundary.
The new attack surface comes from wiring "steerable by text" to "can change the world." In traditional security a trusted process runs definite code; an agent runs an action it reasons out on the spot, and that reasoning can be rewritten by any text it reads — including content it pulls back from a tool. This is the root of two concrete threats: prompt injection is an attacker hiding instructions in data the agent will read (a web page, an email, a code comment), inducing it to execute that data as if it were your instruction; MCP tool poisoning is subtler — an innocuous-looking MCP tool smuggles instructions in its description or return value, quietly rewriting the agent's sense of "what to do now." What they share: the agent cannot distinguish "data to process" from "an instruction to obey," because to it both are just tokens in context. So the security boundary can no longer only guard the external entrance; it must reach down to each of the agent's own tool calls. [Source: Graziano, AI-Native Engineering Day 4 MCP security (tool poisoning / prompt injection / credential leakage / least-privilege manifest), grade Ⅳ practitioner. [R6][R1]]
最小权限:默认只读,按能力授权
Least privilege: read-only by default, scoped by capability
The first foundation of this boundary extends the principle of least privilege from people to agents and sets the default to the strictest. Three rules: first, read-only by default — the tools an agent receives default to "read and propose," with world-changing actions (write, delete, deploy, transfer) outside the default set. This is the same principle by which mature practice sets CI/CD agentic workflows to "read-only by default, write actions must declare safe-output explicitly" (Graziano Day 7 Continuous AI). Second, capability scoping — not handing the agent one "admin key" but a set of fine-grained, separately auditable capability credentials: this agent may read this repo and run tests but may not push, may not touch the production database. So a single tool-poisoning can affect at most the small slice of capability it was granted. Third, credentials never enter context — keys and tokens are never written into a prompt or conversation history, or one prompt injection reads them out and exfiltrates them; credentials are injected by the harness at the tool-call layer, and the agent itself never sees the plaintext.
不可逆与高权动作:把人留在这一道闸上
Irreversible and privileged actions: keep the human at this gate
The second foundation returns to INSTRUMENT 07's blast radius: which gate you keep the human at is set by "can the error be reversed cheaply" and "how far does the error reach." Reversible, low-radius actions (editing a local file, running a read-only query) can be handed fully to the agent; irreversible or high-radius actions (deleting production data, sending external email, merging to main, moving money) must have a human-in-the-loop explicit confirmation, because once such an action goes wrong there is no cheap undo. Note this is unrelated to "trusting the model" — even a fully trustworthy model can be made by prompt injection to execute an induced privileged action; the human at this gate does not make the decision for the model but adds a confirmation independent of the model's reasoning to irreversible actions. Put together with least privilege, the whole boundary is two halves of one sentence: capability defaults to the minimum, and the irreversible part of capability is always unlocked momentarily by a human. This is why a "fully autonomous agent fleet" holds up in engineering yet must keep a gate for irreversible actions — not a technology gap but the blast radius deciding this judgment cannot be delegated.
默认交给 agent(可逆 · 低半径)Default to the agent (reversible · low-radius)
只读查询、检索、读代码库
Read-only queries, retrieval, reading the codebase
改本地文件 / 在分支上提交(可回退)
Edit local files / commit on a branch (revertible)
证伪:若把所有写操作都设成默认只读 + 按能力授权后,团队的交付速度并没有可感的下降,那"安全边界拖慢一切"这个常见反对就被证伪了——多数高权动作本就不在热路径上。反过来,若你无法说清某个 agent 被授予了哪几条能力,说明它的爆炸半径不可控。Falsified if: after setting all write actions to read-only-by-default plus capability scoping, delivery speed shows no perceptible drop, then the common objection that "security boundaries slow everything" is falsified — most privileged actions were never on the hot path. Conversely, if you cannot state which capabilities a given agent was granted, its blast radius is uncontrolled.
数据与指令的混淆:注入攻击的根
Confusing data with instruction: the root of injection
Sorting prompt injection and tool poisoning to one root turns defense from "enumerate a pile of attack tricks" into "plug one structural hole": an agent cannot in principle distinguish "data to process" from "an instruction to obey," because to it both are just tokens in context. Traditional software has a clean data/code line — a piece of user input is never treated as an executable instruction unless you commit an injection-class bug; but an agent's "instructions" are themselves natural language, the same medium as the data it reads, so the boundary is natively blurred. This explains why such attacks are so hard to fully prevent: you cannot solve it once and for all with escaping or parameterization, because there is no syntactic dividing line to escape against. The operational mitigation is to admit the confusion and then structurally bound its consequences rather than hoping the model "learns to tell them apart": first, mark content from untrusted sources (external web pages, third-party tool returns) as data and down-weight it, so the model knows this part is not your instruction; second, fall back to least privilege — since you cannot guarantee the model is never induced, guarantee that even when induced, the privileged actions it can call have already been removed or must pass the independent human gate. In other words, injection defense centers not on "do not let it be fooled" but on "even when fooled, the blast is small." This is exactly the combined force of the three foundations — least privilege, capability scoping, irreversible-action gating — on one threat.
Continuous AI:把"默认只读"写进流水线
Continuous AI: write "read-only by default" into the pipeline
The most concrete landing of this trust boundary is turning it from a "convention" into a pipeline-enforced default. When agents enter CI/CD and run inside automated workflows (Continuous AI), the rule "read-only by default, write actions must declare safe-output explicitly" stops being a discipline someone remembers to keep and becomes a hard constraint at the pipeline level: an agentic workflow gets read-only capability by default, and any world-changing output (commit, release, external call) must be explicitly marked as a safe output and pass its own review. This moves the security boundary from "rely on humans being careful on each action" forward to "the system gives no dangerous capability by default" — the former depends on a human not erring, the latter on a human actively unlocking, and the act of unlocking is itself a natural checkpoint. Putting this into automation matters because the agent runs unattended in the pipeline — precisely because no one watches in real time, the default must be set to the safest rung, since when something goes wrong there is no present human to call a halt. Falsifiable signal: if agents in your automated pipeline can push, release, and call the production API by default with no "write actions must declare explicitly" gate, then that pipeline's security rests entirely on the premise that "the model is never induced," which you cannot guarantee — that is not security but luck. [Source: Graziano, AI-Native Engineering Day 7 Continuous AI (agentic workflows in CI/CD, read-only by default, write actions must declare safe-output), grade Ⅳ practitioner. [R1]]
This boundary has one operational-level red line worth listing alone: keys, tokens, and credentials must never enter context. The reason shares the root with injection — since the agent cannot tell data from instruction, once a credential appears in its context, one successful prompt injection can make it read the credential as "data it may output" and exfiltrate it through some innocuous-looking tool call. This is not hypothetical: any combination that lets the agent read a key and also emit content externally forms a leak channel. The correct approach structurally keeps credentials out of the agent's view — the harness injects credentials into the actual request at the tool-call layer, and the agent sees only the action "call this tool," never the credential in plaintext. So even fully induced, the agent has nothing exfiltratable in hand. This extends the same idea as least privilege: least privilege constrains "what it can do," credential isolation constrains "what sensitive data it can see," and together they make the consequence of "being induced" controllable. Falsifiable signal: check whether the context you give the agent (system prompt, tool definitions, conversation history) contains any plaintext key or a token directly exchangeable for access — if even one is there, your security rests on the assumption "no one injects successfully," which you cannot guarantee, rather than on structure. [Source: Graziano, AI-Native Engineering Day 4 MCP security (credential leakage / least-privilege manifest), grade Ⅳ practitioner. [R6][R1]]
ENG
15
EVALS · 承重墙
EVALS
重画 · 验证基础设施
Redraw · Verification Infra
评测套件是组织沉淀下来的判断
The eval suite is the organization's accumulated judgment
One-off human review is right today and wrong tomorrow under a new prompt, with no one the wiser. Distill each class of error into an eval and enter it into the regression suite, and verification turns from one-off labor into infrastructure that compounds with output. An eval suite is not a synonym for tests; it is this team's judgment of "what counts as correct," written down and re-runnable by machine.
First, the categorical difference between evals and traditional QA. One-off QA is an event: before release a human runs this version through, judges whether it is good enough, and then that judgment vanishes — it lived in that person's head at that moment and cannot regress. Next time, under a new prompt, a swapped model, a tuned parameter, what was confirmed last time may be wrong again, with no mechanism to tell you. An eval inverts this: each time a class of error is found, you solidify it into a machine-runnable verdict, enter it into the regression suite, and from then on the machine watches it for you after every change. This difference is of kind, not degree: QA verifies "this version," an eval accumulates "all of this team's judgment of what is correct." So over time the total QA stacks linearly with releases and evaporates just as linearly, while the eval suite only grows — it is sediment, each entry a once-painful lesson.
错误回流:每个 bug 离开时留下一条 eval
Errors flowing back: every bug leaves an eval behind
How the load-bearing wall is laid up comes down to one discipline: every fixed error must leave an eval behind when it goes. Fixing a bug without adding an eval that turns red when it recurs means the lesson was stored in one head and will be forgotten over time — next time the same class of error returns unchanged. Carry this discipline through and the suite becomes an asset that compounds with output: the more you produce, the more pits you have stepped in, the more evals flow back, the thicker the coverage; and the thicker the coverage, the higher the chance the next same-class error is auto-caught, the less a human must watch. This is the engineering isomorph of the organization volume's "self-improving loop" — distill each failure into a reusable asset, supervision demand falling over time (Graziano Day 4 calls it the steering loop: on each agent failure, ask "which guide/sensor should have caught this" and add that one). Add observability and the loop feeds itself: real error samples from production flow back automatically into new evals, so the suite compounds not only on internal pitfalls but on real-world feedback. This is why the trend is for the largest future engineering investment to flow into verification infrastructure — it is the one place where investment compounds over time.
为什么生成越廉价,eval 越是唯一瓶颈
Why cheaper generation makes evals the one bottleneck
Close the loop with ENG·00's "the bottleneck moves" here: once generation is near-free, the one thing that did not get cheaper is judging whether the generated thing is correct. An agent that can produce ten thousand lines overnight, lacking a wall that auto-decides whether those ten thousand are right, produces not value but review-debt — and a human's speed at reviewing ten thousand lines did not rise because the model got stronger. So the thickness of the eval suite directly sets how fast you can safely let generation run: the thicker the wall, the more output gets auto-judged, and the more a human can retreat from watching the screen in real time to handling only the few judgments the machine cannot. Conversely a team with a thin wall finds "the faster the agent, the more tired the human" — because every unit of output flows back to that human-review bottleneck that did not speed up. This yields a falsifiable resource prediction: a truly AI-Native engineering team's share of investment in verification infrastructure (evals / checkers / observability feedback) should rise, not fall, as production capacity rises; if a team "goes AI" and reinvests all freed labor into writing more features without thickening the verification wall in step, it will eventually be drowned by its own output — the team-scale version of the vibe-coding trap.
①
发现一类错Find a class of error
人审、生产事故或测试抓到一个真实的错——这是循环的输入。Human review, a prod incident, or a test catches a real error — the loop's input.
②
固化成 evalSolidify into an eval
写一条能在它复发时变红的判定,进回归套件——判断被写下来。这是承重墙在长高。Write a verdict that turns red on recurrence, into the regression suite — judgment written down. The wall grows taller.
③
机器替人盯The machine watches
此后每次改动自动重跑;监督需求随覆盖变厚而下降,循环复利。It re-runs on every change thereafter; supervision demand falls as coverage thickens, the loop compounding.
检验信号Test signal
先行:被修复的 bug 中"同时补了一条会变红的 eval"的占比在上升。证伪:若你的 eval 总数停滞、甚至同类 bug 反复回归,说明错误没有回流——验证还停在一次性人审,承重墙没有在长高。Leading: the share of fixed bugs that "also added an eval that turns red" is rising. Falsified if: your eval count stagnates or same-class bugs keep regressing — errors are not flowing back; verification is still one-off human review, and the wall is not growing.
独立验证器为什么必须和生成分离
Why the verifier must be separate from the generation
For an eval suite to be a load-bearing wall there is an often-overlooked but fatal structural condition: the thing that judges correctness must be independent of the thing that produced it. Letting the same agent that generated the code judge whether its own code is correct is roughly letting a candidate grade their own exam — the very "feels right" judgment it used to generate is the judgment it uses to self-assess, so it systematically goes blind to its own blind spots (confident wrongness is the textbook case of such a blind spot). So the load-bearing of the wall is not in "is there verification" but in "is the verification an independent force": an eval, a type checker, a test suite render their verdict without passing through the generating model's "feeling," checking against an external, definite criterion. This is why computational checks (compilers, tests) hold a special place in verification — they are natively independent of generation and cannot be carried off by its tone or self-consistency. When you must use an inferential check (say, another model doing a semantic review), independence must be manufactured by "a different context, a different criterion," not assumed to exist naturally. Falsifiable signal: if your "verification" is actually the same agent in the same context saying "I checked, it's fine," you have no load-bearing wall but a wall painted on paper — it fails to block exactly when it most needs to block confident wrongness.
eval 套件就是组织被写下来的判断
The eval suite is the organization's judgment, written down
Push this sheet's thesis to its end and you reach a conclusion a little counter-intuitive for organizations: a team's eval suite is its collective judgment of "what is correct," externalized, solidified, and re-runnable by machine — it is the organization's memory on the matter of quality. Much of a senior engineer's value lies in the set of judgments in their head — "this won't do," "this will blow up," "this boundary was not considered"; but that set could previously be copied only slowly and lossily through code review, mentoring, and verbal transmission, and was lost when people left. Distilling each class of judgment into an eval moves the judgment that lived in individual heads into infrastructure that does not quit, does not tire, and executes on every change impartially. Then something profound happens: quality judgment shifts from "depends on a certain senior person being present" to "settled in the suite, inheritable by all." This is the same principle as the organization volume's "context as infrastructure" unfolded on the quality dimension — turning tacit, word-of-mouth judgment into a legible, queryable, executable explicit asset. It also gives "why verification infrastructure deserves the largest investment" a non-efficiency reason: each eval you invest in converts a one-off human judgment into a permanent, inheritable organizational capability. Falsifiable signal: if a team's core quality judgments still rely on "asking the most senior person" and no part of that judgment has settled into runnable evals, the team's quality memory is fragile — a large piece collapses when that person leaves, exactly the cost of "judgment never written down."
覆盖随产出复利:为什么 eval 是少数会越投越省的投入
Coverage compounds with output: why evals are one of the few investments that get cheaper the more you make
Most engineering investment is consumed linearly: fix a bug and you save the trouble of that one bug; write a feature and you get that one feature. Evals are different — one of the few investments that compound, and the mechanism is worth stating. Each eval added does not only block this class of error this once; it makes the error auto-rechecked on every future change — one eval's value equals the sum of all future regressions it blocks. So the suite's total value grows not linearly with count but with "count × change frequency × time." This compounding is further amplified in the agentic era: as generation speeds up and changes get more frequent, the number of times each eval is re-run explodes, and its marginal value with it. This is why "the largest future engineering investment flows into verification infrastructure" is not a slogan but a prediction derived from the compounding structure — in a system where generation is near-free and changes extremely frequent, the one investment that compounds over time is the verification wall that auto-re-exercises judgment on every change. Conversely it explains why thin-wall teams fall into "the faster they run, the more tired they get": their every change is not auto-rechecked, so a rising change frequency converts directly into rising human-review burden rather than being absorbed by a compounding wall. Falsifiable signal: compute "how many times a day your eval suite is re-run × how many potential regressions each run blocks"; if this number rises with your team's production capacity, your verification investment is compounding; if it stagnates, your wall is not growing and verification is still one-off labor. [Source: this series' Verification-chapter synthesis — errors flowing into evals, observability feedback, verification infrastructure as the largest investment direction, grade Ⅳ.]
ENG
16
WORKED CASES · 走一遍
WORKED CASES
实例 · 内核落在真实现场
Cases · The Kernel on Real Ground
四个现场,同一个内核在每一处都留人一道判断
Four sites, one kernel — each leaves a human one judgment
The earlier sheets state the principles; this one walks them across four concrete sites. Each case carries the same interrogation: where did execution become abundant? How did the error start to roll? Which structure — verifier / boundary / eval / spec — caught it? And finally, the one judgment a human kept: what was it, and why could it not be handed off. The material is drawn from common shapes of engineering practice; the numbers are order-of-magnitude illustrations, not one shop's exact ledger.
案例一 · 一次交给 agent 队伍的重构,雪球如何被独立验证器拦住
Case 1 · A refactor handed to an agent fleet, and how an independent verifier stopped the snowball
A team needed to migrate a 300-plus-file service from callback style to async/await. In the old world this was two engineers' two weeks of mechanical labor. Now they cut it into eighty independent PRs and handed them to a parallel agent fleet — the execution side finished overnight. The trouble surfaced at PR seventeen: the agent made a plausible-looking but semantically wrong rewrite of a concurrency primitive, wrapping a write that had to be serial inside a Promise.all. It threw no error and the tests were green, because the existing tests never covered that race path. It was confidently wrong, and it carried that wrong premise into every later PR that depended on the module — exactly the snowball from the failure sheet (see ENG·12).
A human watches eighty PRs roll by — PR seventeen's race is invisible to the eye, and by the time PR forty throws a weird bug you backtrack, the snowball has rolled twenty-plus steps.
Insert a verifier separate from generation before merge: a concurrency property test (run the module's write ordering under a race detector) plus a rule that any PR touching a concurrency primitive is flagged red for human review. PR seventeen is red-carded on the spot; the snowball stops at step one.
The one judgment kept by a human: not "read all eighty diffs" — that stuffs the human back inside the amplified execution and is doomed (the irony of automation: the more reliable the system, the more the watcher drifts; see ENG·12). The kept judgment is constitutive: which primitives belong to the high-blast-radius zone where "touch it and a human must review." Written once, that judgment becomes the structure that issues the red card; it compresses the human's scarce attention from "look eighty times" to "look only at the one or two flagged red." This is kernel step ② — execution exits, judgment retreats onto the boundary that decides what counts as dangerous. [Mechanism from this volume's ENG·12 failure sheet plus Bainbridge, Ironies of Automation, Automatica 1983, grade Ⅱ.]
The ask is one sentence: "Add rate limiting to export so one user can't starve the whole export queue." In the old vibe-coding habit this becomes a prompt tossed at an agent, and you watch it emit a lump of "looks-right" middleware. Spec-driven development (SDD) walks it as a loop with an objective function — the six steps below are not boxes on a flowchart but six checkpoints, each with its own acceptance criterion.
①
Specify · 规格Specify
写清"什么算对":每用户每分钟 N 次、超限返回 429 + Retry-After、限流不得影响其他用户、计数器须在重启后存活。这份规格就是后面所有步骤的目标函数——没有它,生成无处收敛。State what counts as correct: N requests per user per minute, over-limit returns 429 + Retry-After, throttling one user must not affect others, the counter must survive a restart. This spec is the objective function for every later step — without it, generation has nowhere to converge.
②
Plan · 即时规划Plan
只规划到下一个能验证的检查点:先选算法(token bucket vs 滑窗)、定存储(进程内 vs Redis)。不写到第十步——执行变充裕后,远期计划在落地前就过期(见 ENG·13)。Plan only to the next verifiable checkpoint: pick the algorithm (token bucket vs sliding window), choose the store (in-process vs Redis). Do not plan to step ten — once execution is abundant, far-horizon plans expire before they land (see ENG·13).
③
Execute · 执行Execute
交给 agent:先写会失败的测试(红),再写实现到测试转绿。这是把"以测试为目标交办"落到位——agent 的输出有了可机检的靶子。Hand to the agent: write the failing test first (red), then the implementation until it goes green. This is "delegate toward a test" made concrete — the agent's output now has a machine-checkable target.
④
Verify · 验证Verify
独立验证器跑规格里的每一条:并发压测确认"限一个不伤其他"、杀进程重启确认计数器存活。这一步是承重墙——人审的是"测试是否锁住了规格的意图",不是逐行读实现。The independent verifier runs every clause of the spec: a concurrency load test confirms "throttling one doesn't hurt others," a kill-and-restart confirms the counter survives. This is the load-bearing wall — the human reviews whether the tests lock the spec's intent, not the line-by-line implementation.
⑤
Integrate · 合流Integrate
PR 即评审门:CI 跑全套,规格里的每条验收都是一个不可跳过的门。合不进去 = 还没满足规格,而不是"再 push 一次试试"。PR as the review gate: CI runs the full suite, and each acceptance clause is a gate that cannot be skipped. Failing to merge means the spec is not yet met — not "push once more and see."
⑥
Learn · 沉淀Learn
上线后真有一个客户触发了一条没想到的边界(批量导出绕过了 per-user 计数)。把这条 bug 反写成一条新的 eval,进套件。下次任何 agent 生成限流,都先撞这条 eval——错误变成了组织的记忆。In production a customer hits an unforeseen edge (batch export bypassed the per-user counter). Write that bug back as a new eval and add it to the suite. Next time any agent generates rate limiting, it hits this eval first — the error became the organization's memory.
The kept judgment concentrates in steps ① and ④: what "counts as correct" in the spec is set by a human — what the limit should protect, 429 versus queueing, how harsh on the abuser and how soft on the normal user, is a constitutive product-and-risk trade-off a machine has no standing to hold for you. And what the human reviews in step ④ is not how the code looks but whether the suite truly locks the intent of step ①. The other four steps — write the implementation, run the tests, pass the gate, log the bug — can and should be handed off once execution is abundant. [The six-step SDD plus "PR as gate / constitution as hard-rule layer" from GitHub Spec-kit [R5]; "write the failing test first" is classic TDD practice.]
案例三 · 一个权限过大的 agent,越界、回滚,与边界的重画
Case 3 · An over-privileged agent: the breach, the rollback, and the redrawn boundary
To "save effort," a team gave a scheduled agent that cleans expired data a full read-write production credential — the rationale being "it occasionally flips a few status bits too." One night, generating cleanup SQL, the agent filled a WHERE created_at < ? placeholder with null (the variable had rotted in context behind an unrelated exchange — the context rot of ENG·12), produced a statement equivalent to a full-table delete, and — because it held full privilege — executed it directly. Nothing structural stood between it and production data.
边界图FIGFIG. E16.0 / THE BLAST PATH · 权限即爆炸半径看懂:每多给一档权限,左边那条"错误能走多远"的路径就长一截Read: every extra privilege tier lengthens the left-hand path of "how far an error can travel"
爆炸半径不是模型属性,是结构属性。把 agent 默认降到只读、写操作走确认门,等于在错误和生产之间装一道限速器——同一个错误,半径从"系统级"压到"零"。Blast radius is not a model property but a structural one. Defaulting the agent to read-only and routing writes through a confirmation gate puts a speed limiter between error and production — the same error, radius compressed from "systemic" to "zero."
Rollback and redraw: that night a point-in-time recovery (PITR) rolled back forty minutes of data, losing the few writes inside the window. Afterward nobody set out to "blame the unreliable model" — that is pointless; a guessing system will occasionally guess wrong. The real fix was to redraw the boundary: the cleanup agent's credential dropped to read-only, the few status bits it needs to change go through a separate narrow interface with an explicit confirmation gate, and any DELETE / whole-table operation enters the "human must confirm" tier. This turns a post-hoc accountability problem into a structural one. [Least privilege / writes behind a confirmation gate / MCP security boundary from the Anthropic MCP specification and security guidance [R6]; PITR is a standard database recovery mechanism.]
留给人的那一道判断:哪些操作属于"不可逆 × 高爆炸半径"、必须设确认门。这不是模型能替你回答的——它取决于这份数据丢了赔多少、这家公司对停机的容忍度、这条接缝背后的合规约束。这正是分档计算器(INSTRUMENT 07)里 OWN 那一档:构成性判断,只有人能持有。
The kept judgment: which operations belong to "irreversible x high blast radius" and must sit behind a confirmation gate. A model cannot answer this for you — it depends on what losing this data costs, this company's tolerance for downtime, the compliance constraints behind the seam. This is precisely the OWN tier in the Delegation-Tier Calculator (INSTRUMENT 07): constitutive judgment, holdable only by a human.
案例四 · 一条 eval,怎样把一次性的踩坑变成组织的记忆
Case 4 · One eval, and how a one-time stumble became the organization's memory
A content-generation feature kept producing the same class of problem: when generating UI copy for non-English users, the model occasionally mixed Chinese punctuation (a full-width comma, say) into English strings — hard to spot by eye. QA caught it once every few weeks, fixed it once, and next time it returned. This is the classic "use a human as one-time labor to patch a structural hole" — each catch costs a fixed amount, the number of catches rises with capacity, and it never catches up.
旧 · 每次都用人当验证器Before · a human as the verifier each time
QA spot-checks copy each round — catch one, fix one, note it in the weekly report, then forget. The judgment never accretes; the next agent, next week, next person starts from zero and hits the same hole.
新 · 把这次判断写成一条 evalAfter · write this judgment as one eval
Encode "a CJK punctuation mark in an English string is a failure" as one assertion, add it to the eval suite, wire it into CI. From then on any copy from any agent hits this eval before merge — one stumble's judgment becomes a permanent, zero-marginal-cost gate.
This is the micro-scene of "the eval suite is the organization's accumulated judgment" (ENG·15). What matters is not how clever the regex is but that a state change happened: the judgment moved from "living in one QA's head, re-invoked each time" to "living in the suite, reused automatically." The eval suite is therefore a compounding asset — the regressions it blocks equal runs-per-day times potential-regressions-caught-per-run, a number that rises with team capacity at near-zero cost. This is the exact inverse of the patch-by-hand curve, whose cost rises linearly with capacity while its benefit never accrues.
The kept judgment: the standard of "what counts as correct" itself — a judgment a human made after once noticing "Chinese punctuation leaking into an English UI looks bad." A machine can cheaply and deterministically enforce that standard a million times, but it will not set the standard for you; the standard is always a product of human taste and context. So this eval's value is not in automation per se but in how it amplified one scarce human judgment into unbounded cheap machine enforcement. [The eval-as-accumulated-judgment / verification-infrastructure-as-the-largest-investment claim is from this series' Verification synthesis, grade Ⅳ.]
Fit your most recent real "the agent helped enormously and nearly caused harm" episode into the four interrogations: where did execution become abundant, where did the error start to roll, which structure caught it, which judgment did the human keep. If you cannot answer "which structure caught it," you are still using a human as a live verifier — the doomed position in Case 1. If you cannot answer "which judgment did the human keep," you have either handed off a constitutive judgment too (dangerous) or not yet noticed you were holding it all along (un-accreted).
ENG
17
LEGACY STRUCTURES · 旧结构的失效
LEGACY STRUCTURES
机理 · 旧结构为何在充裕下断裂
Mechanism · Why Old Structures Break
六种被供奉的工程结构,为何在执行充裕时反而成了瓶颈
Six revered engineering structures, and why they become the bottleneck when execution is abundant
These are not straw men — they are real structures enshrined as "best practice" over the past thirty years. Each was a correct optimization for a world of scarce execution: when writing a line cost a lot, changing one was slow, and humans were the output bottleneck, every one of them made sense. The problem is that the constraint inverted — once execution is abundant and judgment is scarce, the same structure begins to crush the very thing it meant to protect. Below, each is named, with mechanism, not mood.
① 瀑布与"大设计先行"(BDUF)——规划视野的赌注下错了地方
① Waterfall and big-design-up-front (BDUF) — the planning-horizon bet placed in the wrong era
First, a fact often forgotten: Royce's 1970 paper, revered as waterfall's origin, actually warned that the single-pass sequential model carried inherent risk and argued for iterating at least twice — posterity took a "cautionary diagram" as scripture [R9]. Set that aside and look at the mechanism: BDUF front-loads a mass of judgment to the moment of least information (project start), betting that "the value of early planning > the loss from plans expiring." That bet often held under scarce execution — if shipping takes months, thinking it through early saves rework. But once execution is abundant, shipping collapses from months to hours, and the plan's shelf life is shorter than the build cycle: the detailed design you spent three weeks on expires before it lands, in a world where an agent can trial-and-error five approaches in three days (see ENG·13, the collapsing planning horizon). BDUF is not "insufficiently detailed" — it bet on the wrong era, spending judgment at the moment of least information and fastest expiry.
② 代码评审当守门——批量、排队,与逐行读 diff 的注意力破产
② Code review as gatekeeping — batches, queues, and the attention bankruptcy of reading diffs line by line
Treating "a senior engineer reads every PR line by line" as the sole quality gate works when output is small — a human can keep up. The mechanism breaks once output is amplified: when an agent fleet ships fifty PRs a day, the "a human must read every line" gate instantly becomes a single-point bottleneck in a queue. Queueing theory is blunt about this — as arrival rate approaches service rate, queue time explodes non-linearly (Reinertsen, The Principles of Product Development Flow, on batch size and queues) [R10]. Two outcomes follow: either PRs pile up and delivery dies at this gate, or reviewers start rubber-stamping to clear the queue, and line-by-line reading degrades into a glance-and-approve — the gate stands but blocks nothing. The deeper problem: reading diffs line by line checks "what the code looks like," whereas the thing actually worth guarding in the agent era is "do the tests lock the intent, are the boundaries right, do the seams make sense." Spending scarce senior judgment on reading syntax puts the human back inside the amplified execution.
The fix is not abolishing review but changing what it reviews. Move the gate from "read the artifact line by line" to "review the spec and the seams" — review the diff not the artifact, delegate toward a test (see ENG·06), let independent verifiers and evals guard syntax and regressions, and have the human take over only where the structure flags red. This is not lowering the bar; it is investing the same attention where it is irreplaceable.
③ QA 当事后工序——把验证从产线末端搬回每一个节点
③ QA as an afterthought — moving verification from the line's end back into every node
"Let developers pile up features first, then hand off to QA to test in bulk at the end" is a product-line mindset: verification is a separate, late station. This was tolerable under scarce execution — output was slow and the backlog at the end was small. Under abundant execution the assumption collapses outright: the front of the line (generation) gains an order of magnitude of throughput, the human-QA station at the end does not, and the work-in-progress between them (unverified code) piles up explosively. Worse is Case 1's snowball: an early unverified error gets amplified as a settled premise by every later step; deferring verification to the end lets the error roll twenty steps before discovery, and the later you correct, the more it costs.
The mechanical fix is to change verification from an "end station" to a "built-in checkpoint at every node" — exactly the argument of the failure sheet's twin-curve figure (ENG·12): the checkpointed trajectory resets the error to near zero while it is small, the un-checkpointed one amplifies exponentially into a snowball. QA should not be a department at the end of the line but an independent verifier woven into each loop. Quality is not inspected in at the end; it is locked in, intent by intent, at every step.
④ "10x 工程师"神话——它量错了对象,而它量的那个对象正在被廉价化
④ The "10x engineer" myth — it measured the wrong thing, and that thing is now being made cheap
The "10x engineer" phrase traces all the way back to the 1968 experiment by Sackman, Erikson, and Grant: they found coding-time differences between programmers of about 20:1 and debugging-time differences of about 25:1 — but the same data carried a finding often dropped: individual output had no relationship to years of experience [R8]. The myth has two structural defects. First, it was always methodologically suspect (it pooled assembly and high-level-language subjects), and while later studies repeatedly confirm that "order-of-magnitude differences exist," personifying that into "there exists a 10x genius individual" is over-reading. Second, and more fatal: the "fast" it measured back then was mainly individual implementation / typing throughput — precisely the capability that agentic coding is now making cheap and un-scarce.
Mechanism: once the bottleneck moves off "types fast," the advantage of "the person who types 10x faster" depreciates. The amplification now lands elsewhere — a person who can get judgment, taste, context, and boundaries right can, by directing an agent fleet, produce ten or a hundred times the output. But this is no longer the 1968 "individual hero" story; it is the story of kernel step ②: what is scarce is no longer who types fast but whose judgment is right. To keep hiring, grading, and mythologizing "the 10x individual's implementation speed" is to optimize for a vanishing bottleneck.
⑤ 工单工厂——把工程师当吞吐单元,正好优化掉了唯一还稀缺的东西
⑤ The ticket factory — treating engineers as throughput units optimizes away the one thing still scarce
工单工厂式团队把工程组织成一条流水线:需求拆成颗粒度均匀的工单,工程师是可互换的吞吐单元,绩效=单位时间关掉的工单数。这套结构服务的是"执行稀缺、产出量是瓶颈"的世界——最大化人均代码吞吐确实合理。问题:它系统性地优化掉了在 AI 充裕下唯一还稀缺的东西。把人当吞吐单元,意味着不奖励、甚至惩罚那些"慢下来想清楚什么算对""花时间画对一道接缝""停下来质疑这个工单本身要不要做"的行为——而这些恰恰是 agent 替不了、且正在变成全部价值所在的判断节点。
A ticket-factory team organizes engineering as an assembly line: requirements split into uniformly grained tickets, engineers as interchangeable throughput units, performance measured as tickets closed per unit time. This structure serves a world of "scarce execution, output as the bottleneck" — maximizing per-head code throughput is genuinely reasonable there. The problem: it systematically optimizes away the one thing still scarce under AI abundance. Treating people as throughput units means not rewarding — even penalizing — the behaviors of "slowing down to get clear on what counts as correct," "spending time to draw a seam right," "stopping to question whether this ticket should be built at all" — which are exactly the judgment nodes an agent cannot replace and which are becoming where all the value sits.
Mechanism: you get what you measure. Grade on ticket-closing speed and you get faster generation of more "looks-done" code, and a team where nobody owns the whole and nobody holds the constitutive judgment. When execution becomes nearly free, the marginal value of "close more tickets faster" trends to zero, while the judgment of "should this ticket be built at all, and was it built right" becomes everything. The ticket factory is not replaced by the agent; it is hollowed out by its own scoring function — the very thing it rewards is what the agent now gives away for free.
⑥ 每一步都人工审批——把人海塞回放大了一万倍的执行洪流里
⑥ Manual approval at every step — pouring a crowd of humans back into a flood of execution amplified ten-thousandfold
"每个变更、每次部署、每条命令都要人点确认"是用审批密度换安全感的旧反射。它的隐含模型是"变更很少,所以每个都值得人看一眼"。执行充裕把这个前提炸了:当 agent 一天发起一万次操作,"每一步都人审"在算术上不可能——人不是不够勤奋,是吞吐量差了三四个数量级。强行维持的结果只有两条,都坏:要么审批成为吞吐瓶颈,把 AI 的全部速度优势抵消干净(你买了辆跑车却规定每过一个路口都要下车推);要么人为了跟上而进入"无脑点同意"模式——审批框还在弹,但已经没有任何判断发生,纯粹是仪式。这又一次撞上自动化的反讽:绝大多数操作都没问题,人的警觉性在第一千次点同意时早已归零,恰在第一万零一次那个真该拦下的操作上,人照样点了同意。
"Every change, every deploy, every command needs a human to click confirm" is the old reflex of trading approval density for a sense of safety. Its implicit model is "changes are rare, so each is worth a human glance." Abundant execution detonates that premise: when an agent initiates ten thousand operations a day, "a human reviews every step" is arithmetically impossible — not for lack of diligence but because throughput is off by three or four orders of magnitude. Forcing it yields only two outcomes, both bad: either approval becomes the throughput bottleneck and cancels out all of AI's speed (you bought a sports car but mandated getting out to push it through every intersection), or the human, to keep up, enters "mindlessly click approve" mode — the dialog still pops but no judgment happens, it is pure ritual. This again hits the irony of automation: the vast majority of operations are fine, the human's vigilance hit zero around the thousandth approval, and on the ten-thousand-and-first — the one that truly should be stopped — the human clicks approve all the same.
The mechanical fix is to switch from "density" to "tiering": grade by reversibility x blast radius (exactly what INSTRUMENT 07 does): low-radius, easily-reverted operations are handed fully to structure (types / tests / lint) and never bother a human; only the irreversible x high-radius tier gets an explicit confirmation gate. This refocuses the human's scarce confirmation from "click every step into numbness" to "click a few times a year, but actually judging each time." More approval is not more safe — excessive approval density manufactures unsafety through vigilance decay. Safety comes from putting the human at the right few nodes, not at all nodes.
核心图KEY FIGFIG. E17.0 / THE INVERSION · 同一结构,约束反转前后看懂:横轴是约束(执行稀缺→充裕),每条线是同一个旧结构从"最佳实践"滑向"瓶颈"的轨迹Read: the x-axis is the constraint (execution scarce → abundant); each line is one old structure sliding from "best practice" to "bottleneck"
这六种结构不是"愚蠢"——它们曾经是对的。它们的失效是同一个机制:都为"执行稀缺"做过正确优化,而约束反转后,为旧瓶颈做的优化在新瓶颈面前变成了阻碍。批判它们不是为了嘲笑过去,是为了认出"哪些做法的有效期已经过了"。These six structures are not "stupid" — they were once right. Their failure shares one mechanism: each was a correct optimization for scarce execution, and after the constraint inverted, an optimization for the old bottleneck became an obstacle in front of the new one. Critiquing them is not to mock the past but to recognize which practices have passed their expiry.
INSTRUMENT 12 · 错误复利模拟器INSTRUMENT 12 · Error-Compounding Simulator● LIVE
Take the most sacred, untouchable engineering process on your team and ask three questions: was it optimizing for "scarce execution"? Does that constraint still hold today? If not, is it still protecting quality, or has it become the gate that jams everything once your capacity rose? If two of the three make you uncomfortable, you are probably enshrining an expired structure. Note: the answer is almost never "abolish it" but "switch what it guards from execution to judgment."
ENG
12
SPECULATION · 推演幕
SPECULATION · The Speculation Act
推论 · 外推,非事实
Inference · Extrapolation, Not Fact
2026-2032:工程这门手艺的下一道折痕
2026 to 2032: The Next Fold in the Craft of Engineering
This act does not predict which line occurs; it opens a possibility space — which curves are converging, the leading indicators and falsification conditions of each, and what shape the engineer's job gets folded into if they run to completion.
Nature of this chapter · InferenceWhat follows is extrapolation from the public trajectory of 2024-2026, not a statement of fact. The falsification conditions this volume will accept are listed with each curve — when reality overturns the extrapolation, this chapter should be the first thing rewritten.
Speculation is not daydreaming; it pushes the volume's kernel forward. The first eleven sheets established it: execution becomes abundant, judgment retreats along the verifiability gradient to intent, constraints, and verification, context becomes queryable infrastructure, and people return to meaning. The speculation act asks only one thing — if that kernel runs another six years, where does the leverage point move from the floor it stands on today (context / harness) up to next, and which floor does the engineer then stand on? What this sheet offers is not an answer but a map with falsification conditions written on it. [Source: a timeline extrapolation of this volume's four-step kernel (SHEET 01), grade Ⅴ inference.]
先钉一个量过的锚:能力涨,体感骗人
First, an anchor that was measured: capability rises, the felt sense lies
推演要诚实,得先承认一个被实测反驳过的事实。2025 年 METR 做了一项随机对照试验:让 16 名资深开源开发者在自己熟悉的成熟仓库上完成 246 个真实任务,随机分成"允许用 AI 工具"与"不许用"两组。结果与几乎所有人的预期相反——用 AI 那组平均慢了约 19%,而开发者自己事前预计会快 24%、事后仍以为快了约 20%。也就是说,在一类特定场景(资深者 × 高度熟悉的大型代码库)里,当下这代工具实际拖慢了人,却制造出"我更快了"的强烈错觉。这一条不削弱本卷的命题,反而是它最锋利的脚注:生成变快 ≠ 交付变快,省下的打字时间会被找回、读懂、纠正生成物的判断时间吃掉——除非把判断也搬到验证基础设施里。把这条放在推演幕开头,是为了让后面所有"会更快、会更强"的外推都带着这条体感校正阅读。〔源 METR 2025《Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity》,随机对照试验,证据级 Ⅱ 受控实测;单项研究、特定人群与仓库,尚未广泛复现,故不外推到全部场景[R7]〕
For speculation to stay honest, it must first own a fact that got slapped down by measurement. In 2025 METR ran a randomized controlled trial: 16 experienced open-source developers completed 246 real tasks on repositories they knew well, randomized into "allowed to use AI tools" versus "not allowed." The result ran against almost everyone's expectation — the AI group was on average about 19% slower, while the developers had forecast a 24% speedup beforehand and still believed they had been about 20% faster afterward. In one specific setting (experienced developers × large, highly familiar codebases), this generation of tools actually slowed people down while manufacturing a strong illusion of "I am faster." This does not weaken the volume's thesis; it is its sharpest footnote: faster generation ≠ faster delivery, and the typing time saved is eaten by the judgment time of finding, reading, and correcting the generated output — unless that judgment is also moved into verification infrastructure. Nailing this at the top of the speculation act forces every downstream "will be faster, will be stronger" extrapolation to be read with this felt-sense correction attached. [Source: METR 2025, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," randomized controlled trial, grade Ⅱ controlled measurement; a single study on a specific population and codebase type, not yet widely replicated, so not extrapolated to all settings. [R7]]
The future of AI-Native engineering is more than "models getting smarter." Behind it are three curves maturing independently and now converging — each loosening one set of today's constraints; their superposition sets the boundary of the speculation space. Each carries a falsification condition: if that observation appears, the curve stalls and this chapter's corresponding extrapolation is void.
解锁Unlocks单位代码生成的边际成本趋近一次推理的电费;长时程自主任务(数小时连续运行、自带回滚)从演示走向常驻。当"试一个方案"几乎免费,工程的瓶颈彻底从"能不能写出来"移到"这个方向对不对、改错了赔多少"。The marginal cost of a unit of generated code approaches the electricity of one inference; long-horizon autonomous tasks (hours of continuous running with built-in rollback) move from demo to resident. When "try an approach" is nearly free, the engineering bottleneck shifts wholesale from "can it be written" to "is the direction right, and what does getting it wrong cost."
TRL早期商用 2025 已有按任务计价的编码 agent;多小时无人值守仍不可靠、回滚与归因未成熟。Early commercial Per-task-priced coding agents exist by 2025; multi-hour unattended runs are still unreliable, and rollback and attribution are immature.
证伪Falsified if若推理单价停止下降(算力/电力封顶),或长时程任务的错误率不随上下文工程改善而下降,则"生成近免费"反转,本曲线停在按需调用的助手态。If inference unit price stops falling (compute or power hits a ceiling), or long-horizon error rates do not fall as context engineering improves, then "near-free generation" reverses and this curve stalls at the on-demand-assistant stage.
脚手架标准化 · HARNESS STANDARDIZATION
Harness standardization · HARNESS STANDARDIZATION
解锁Unlocks围绕模型的脚手架(guides/sensors、上下文装配、工具协议)从各家自建走向有公共契约:MCP 类协议、可移植的 skills/commands 包、跨厂商的 agent 评测基准。harness 一旦像编译器工具链那样标准化,团队级的"会自我改进的循环"就能被打包、继承、交易,而不必每队从零搭。The scaffolding around the model (guides/sensors, context assembly, tool protocols) moves from each shop building its own toward public contracts: MCP-class protocols, portable skills/commands packs, cross-vendor agent benchmarks. Once the harness standardizes the way a compiler toolchain did, a team-level "self-improving loop" can be packaged, inherited, and traded rather than rebuilt from scratch by every team.
TRL协议萌芽 2025 MCP 等协议落地、采用快增;公共评测与可移植 harness 包仍稀缺。Protocol nascent Protocols like MCP landed in 2025 with fast adoption; public benchmarks and portable harness packs remain scarce.
证伪Falsified if若主要厂商各自圈地、协议碎片化且互不兼容,harness 永远绑死在单一平台,则"可移植、可交易的循环"不成立,团队仍困在重复造脚手架。If major vendors each fence off their own turf and protocols fragment incompatibly, the harness stays welded to a single platform, "portable, tradeable loops" fail to materialize, and teams stay trapped rebuilding scaffolding.
持续式 AI 流水线 · CONTINUOUS-AI PIPELINES
Continuous-AI pipelines · CONTINUOUS-AI PIPELINES
解锁UnlocksCI/CD 之后是 "Continuous AI":agent 常驻在仓库里,对每次提交自动分诊 issue、补测试、起 PR、做评审。工程组织的产出从"人写、机器跑测试"翻转为"机器写、机器先验、人定方向与守门"。验证基础设施(SHEET 11 的承重墙)从可选项变成这条流水线能不能合上的前提。After CI/CD comes "Continuous AI": agents reside in the repository, auto-triaging issues, backfilling tests, opening PRs, and reviewing on every commit. An engineering org's output flips from "humans write, machines run tests" to "machines write, machines pre-verify, humans set direction and hold the gate." Verification infrastructure (SHEET 11's load-bearing wall) turns from optional into the precondition for whether this pipeline can close at all.
TRL早期试点 2025 已有仓库挂常驻 agent 做 issue 分诊与 PR;规模化下的信噪比与责任归属未解。Early pilot By 2025 some repos run resident agents for issue triage and PRs; signal-to-noise and accountability at scale are unsolved.
证伪Falsified if若常驻 agent 产出的 PR/评审噪音持续淹没真信号、人审负担不降反升,或独立验证无法跟上生成速度,则流水线合不上,回到人工把关的批处理节奏。If resident-agent PR/review noise keeps drowning real signal and human-review burden rises rather than falls, or independent verification cannot keep pace with generation, the pipeline fails to close and reverts to a human-gated batch rhythm.
沿这三条曲线推的时间轴A Timeline Along Those Curves
A Timeline Along Those Curves
FIG. 12.0 / 2026-2032 · ENGINEERING SPECULATION ARC看懂:四个时间桩,看判断瓶颈逐层上移到哪一层。Read: four time-stakes; watch which floor the judgment bottleneck climbs to.
This rising staircase is not "engineers replaced floor by floor" but the same judgment bottleneck climbing floors: in 2026 you still review diffs; by 2028 the loop itself becomes a packageable asset and you judge "should this loop be installed"; by 2030 you manage a fleet of resident agents, with judgment moved from single outputs to fleet-level direction and gate-keeping; by 2032, if all three curves run to completion, the human holds only the three things from kernel steps ②③④ — intent, constraints, verification — and the rest the system grows. Every stake carries the METR correction at the foot of the figure: capability climbing does not make the felt sense honest; the higher you go, the more you must confirm you are actually faster by measurement, not by feel.
Speculation made only of assertions would feel abstract. The two pieces below are design fiction — explicitly fictional future artifacts that make the abstract claim "an engineer measured by judgment density" touchable. They are not predictions; they project the kernel onto 2032 as something you can hold and leaf through.
Hold intent, constraints, and verification for a fleet of roughly 40 resident coding agents. You no longer implement line by line; you write specs, set gate-keeping criteria, and adjudicate directional conflicts among agents.
Able to translate a class of quality judgment into an eval that turns red; able to sense, in a domain you don't fully understand, "where to stop and ask a human"; a physiological alertness to confident wrongness. We do not ask how many lines of code you write per day.
Judgment density (the number of judgment nodes you hold) and directional correctness, not output volume — the 2032 form of SHEET 06's "implementer → orchestrator."
Candidates who pitch "I can use agents to produce a lot of code fast" as their selling point. We've seen the end of that curve: high-speed output without independent verification is debt, not an asset.
SPECULATIVE · 虚构 · Fiction
ARTIFACT 02 · 2032 事故复盘 · A 2032 Incident Postmortem
事故复盘:一支 agent 舰队的静默回归
Postmortem: A Silent Regression Across an Agent Fleet
In February 2032, a shared "quick-fix" skill used by multiple agents carried a buried wrong assumption: when retrying failed network calls it silently swallowed a class of timeout exceptions. Over four days, 23 resident agents reused this skill into more than 1,900 PRs, all of which passed the existing evals — because no eval covered "a swallowed timeout." The incident was not any one agent "getting dumber" but a textbook gap in the load-bearing wall: generation compounded at high speed while the verification wall happened to be empty on exactly this class of error.
The shared skill treated a class of judgment ("timeouts must not be silent") as an implementation detail, never distilled into an independent eval. What compounded was not only the skill but the blind spot it carried.
Add an eval that turns red; following the steering loop, ask "which sensor should have caught this" and grind the answer into the shared harness — this skill now carries its own check across the whole fleet.
Honest speculation records the strongest argument against itself too. This volume's thesis: execution becomes abundant, judgment retreats to verification, so verification infrastructure becomes the largest investment direction and the engineer's value climbs to judgment. The counter-bet rebuts it like this — this "judgment climbs" picture may be only a local truth of one specific kind of work (experienced people, mature codebases, judgment already scarce), not the general direction of engineering. The METR measurement already shows the crack: in that setting, current tools not only failed to make people faster but manufactured the illusion of "faster"; if this "felt-fast, actually-slow" holds across more settings, then "judgment retreats, verification fills in" may never get the chance to happen — teams will first be drowned by a wave of generation that looks efficient but injects mountains of unverified debt, collapsing before the verification wall has grown tall. More sharply: perhaps the bottleneck was never "execution is scarce" but "judgment is scarce," and AI happens to amplify execution, not judgment — in which case this volume's prescription to "reinvest judgment bandwidth into verification" is a bad check to a team that was short on judgment to begin with, because what they lack is exactly the judgment needed to write that check. How this counter-bet gets confirmed: if, in three to five years, the teams most aggressive in adopting agentic coding show systematically higher production-incident rates, rework rates, and tech-debt metrics than restrained teams, and the gap does not converge as they "add verification," then the volume is wrong, not them. It is written here because a methodology that dares not record its own falsification condition does not deserve to be followed. [Source: counter-argument synthesis — the METR 2025 RCT (grade Ⅱ) plus the rival hypothesis that judgment, not execution, is the scarce factor (grade Ⅴ inference). [R7]]
Do not copy this act as a roadmap; treat it as a betting table with a scale on it: the three curves are wagers, each labeled with "what observation makes me concede." What you can do is watch those few leading indicators — the slope of inference unit price, whether harness protocols converge or fracture, whether your own team's "share of changes that get auto-rechecked" rises or stalls — and use them to calibrate which stake of the staircase you stand on, rather than letting the felt sense judge for you. The only certain thing about this act is that it will be rewritten; this volume volunteers to be the first to rewrite it.
ENG
11
PLAYBOOK · 落地
PLAYBOOK
行动 · 可执行
Action
落地 · 工具是表层,原理是底层
Rollout · tools are surface, principles are the floor
Every "why this tool gets amplified" above reduces to five through-lines. Hold the principles and you are unfazed when tools change — when the next Markdown or the next TypeScript appears, the same ruler recognizes it.
These five are not parallel slogans; they have a dependency order, and chained together they are the compression of every sheet above. Legibility is the foundation — what an agent cannot read makes the other four moot (ENG·02). On top of it, diffable / versionable / reviewable turns a change into a reviewable commit rather than an untraceable overwrite (ENG·02 / 06). A layer up, machine-checkable specs give generation an objective function so correctness can converge on its own (ENG·03 / 07). Composable capability interfaces (skills / MCP / CLI) let the agent safely extend its reach, each interface authorized separately (ENG·04 / 08). Finally, self-improving loops stitch the first four into a system that compounds with output (ENG·04 / 05). Running through all five is one line: humans and agents read and write the one source. Recognize these five and you need not chase tool news — when the next Markdown or the next TypeScript appears, the same ruler tells you whether it will be amplified.
三原则 + 三指标
Three principles + three metrics
01 / ↓
死磕狗粮 · 上手爬坡↓Dogfood · ramp↓
人人用自己的产品;新人多快有效产出。Everyone uses the product; how fast a newcomer becomes effective.
02 / ↓
尽量扁平 · PR 周期↓Stay flat · PR cycle↓
经理先做 IC;PR 周期暴露管线短板。Managers start as ICs; PR cycle time surfaces pipeline strain.
03 / ↑
杀死死流程 · Claude 提交↑Kill dead process · Claude commits↑
不断追问"为何还这么做"。但别把吞吐当成功。Keep asking "why still this way." But don't mistake throughput for success.
Each of the three metrics carries a counter-signal, because metrics get gamed. Ramp-time↓ is a true signal — when context is infrastructure, a newcomer ships in week one; but shortening the ramp by lowering the delivery bar makes the metric pretty and the quality gone. PR-cycle↓ surfaces pipeline strain, but speeding it up by bypassing the review gate is tearing out a load-bearing wall. "Claude-commit share↑" is the most dangerous: it is easily taken as a synonym for success, but throughput is not success — generating ten thousand lines no one needs and no one verified is worse than writing a hundred correct ones. So the real reading of each metric is "is it rising the right way," not "is it rising."
Finally, read this volume in reverse. The twelve sheets above all answer "how to build" — legible, verifiable, composable, self-improving. But as in the organization volume, before "how to build" sits an earlier question: "what to build it for." For four centuries efficiency was assumed to be the goal itself; AI makes efficiency abundant for the first time, so it need no longer be treated as the goal at all. If verification, specs, and the harness are optimized to the limit while the engineer is reduced to a human rubber stamp clicking "approve" on a generation pipeline, engineering has merely been pulled back into throughput logic. The entire technique of this volume exists to free people from typing and throughput and return them to the judgment and building only people can do and that is worth doing: deciding what to build, what is correct, where the seams go. Pushing verifiability to the limit is in service of returning the engineer to the judgment that is actually theirs — deciding what to build, what is correct, where the seams go.
Place this volume's four new sheets (failure taxonomy, JIT planning, trust boundary, the eval wall) back into the five through-lines and you see they are not new parallel entries but those five principles unfolded at more concrete points — an internal-consistency test of whether the volume coheres. The failure taxonomy (ENG·12) is the cause behind "machine-checkable specs" and "self-improving loops" — it makes clear why verification is non-optional and gives every earlier "how to verify" its failure-mode floor. JIT planning (ENG·13) is the "self-improving loop" disciplined on the time dimension — it lands "iterate on signal" from an attitude into a falsifiable way to plan. The trust boundary (ENG·14) is the security face of "composable capability interfaces" — it shows why each interface must be authorized alone, or composable becomes uncontrolled. The eval wall (ENG·15) is the load-bearing structure of the "self-improving loop" itself — it makes clear how the wall is laid up and why it compounds. In other words, all four new sheets hang on one trunk: legible → diffable → machine-checkable → composable → self-improving, and each deepens the mechanism at one node on that trunk. This is exactly where the volume differs from a tool list — tools expire, but the principle of "why these five get amplified, how they depend on each other, and where each most easily fails" does not. Falsifiable signal: if you can find any sheet in this volume whose mechanism cannot be sorted to one of the five through-lines and is not explained by them, then either that sheet is unmoored (delete it or wire it in) or the five principles are themselves incomplete (add one) — this self-check is itself the volume's load-bearing claim being falsifiable.
Every blueprint above covers "why build it this way, in what order"; this piece actually ships the software with you — not "design an engineering org" (that is the architect's job) but the execution layer itself: it does the work the kernel frees, letting agents generate / transform / refactor / migrate code and tests by default, and reserving human judgment for the few nodes that cannot be offloaded. Give it a feature, a service, or a loop that keeps reworking; it first runs the redraw-vs-graft gate (delete the agents and it collapses back to "a human typing every line, reviewing every step" = enablement, not native), scopes honestly (greenfield / one-loop brownfield / out-of-scope enablement told so plainly / a safety-or-livelihood boundary where it only assists), then runs the Specify → Plan → Execute → Verify → Integrate → Learn loop.
三档分工,按可验证性梯度路由:Delegate · agent 自跑Review / Own · 人判断policy · 信任边界
Three tiers, routed by the verifiability gradient:Delegate · agent runsReview / Own · human judgmentpolicy · trust boundary
# 在 Claude Code 里调用invoke inside Claude Code
$ /skill ai-native-engineering
> "用 AI 把这个功能可靠地做出来:……""build this feature the AI-native way, reliably: ..."→ 可运行代码 + SPEC.md(活规格)runnable code + SPEC.md (living spec)→ eval / 验证套件(承重墙)an eval / verification suite (the load-bearing wall)→ JUDGMENT.md(判断节点图)· PERMISSIONS.md(默认只读信任边界)JUDGMENT.md (judgment-node map) · PERMISSIONS.md (read-only-default trust boundary)
What this is · the engineering executable companionThe architect piece designs the org; this piece makes the engineering surface runnable as real output: code, specs, eval suites, and permission boundaries. Engineering is naturally execution-facing, but the point here is its companion role in the seven-piece system: one kernel, mutually coupled, with no fixed reading entry. Judgment node + stop-line: grade every action on reversibility × blast radius, and route irreversible / adverse ones to a human confirmation gate. The stop-line, stated exactly: never let an agent's classification be the sign-off on an irreversible or adverse action — its confidence score is an input to the human's decision, never the decision; and the "no" (a wrongful delete / reject / lockout) is gated with equal care, the adverse path designed as carefully as the positive one.
SPEC.V / AI NATIVE METHODOLOGY / OWL METHODOLOGY SERIES
SCOPE /一套方法论 · 完整组织光谱 N=1 → N=众多(一人公司至 agent 网络,同一套第一性原理)One methodology · the full organizational spectrum N=1 → N=many (from the one-person company to the agent network, on a single set of first principles)
SERIES /六卷同一内核 · 本卷是其中一个面,完整接线见上方「方法论系列」。Six volumes, one kernel · this volume is one surface; the full wiring is above under "The Series."
APPENDIX · SOURCES /证据与引用登记 —— 分级口径:Ⅰ 审计级实证(监管文件交叉验证)· Ⅱ 同行评审 · Ⅲ 理论模型/工作论文(引用须写"模型预测",不得写"已证明")· Ⅳ 从业者一手陈述 · Ⅴ 咨询预测(是预测,不是事实)。引用条目以本表为准;本轮 3 票对抗复核未发现被驳倒条目。Evidence and citation registry; grading key: Ⅰ audit-grade empirics (cross-checked against regulatory filings) · Ⅱ peer-reviewed · Ⅲ theoretical model / working paper (citations must read "the model predicts," never "proven") · Ⅳ practitioner first-hand account · Ⅴ advisory forecast (a forecast, not a fact). Citation rows are authoritative in this table; the current 3-vote adversarial review found no overturned source.
REF
级GR
SOURCE
承重论断Load-bearing claim
R1
Ⅳ
Alfonso Graziano《AI-Native Engineering》(Day 1–7 从业者课程 / 实践笔记,本卷主干理论源 · alfonsograziano.it/book)Alfonso Graziano, AI-Native Engineering (a Day 1–7 practitioner course / field notes; the spine theory source of this volume · alfonsograziano.it/book)
执行充裕≠放任、implementer→orchestrator、失败模式学(幻觉/自信而错/雪球/上下文腐烂/隐藏假设)、SDD 三级成熟度、Delegate/Review/Own、即时规划与"知道何时叫停"、MCP 安全、Continuous AI——本卷绝大多数承重论断的一手出处"Abundant execution is not licence," implementer→orchestrator, the failure-mode taxonomy (hallucination / confident-wrongness / snowball / context-rot / hidden assumptions), SDD's three maturity rungs, Delegate/Review/Own, JIT planning and "knowing when to stop," MCP security, Continuous AI — the first-hand origin of most load-bearing claims in this volume
R2
Ⅳ
Hugging Face《Tiny Agents》(开源参考实现 + 博文,2025)· Hugging Face, Tiny Agents (open-source reference implementation + blog post, 2025) · huggingface.co/blog/tiny-agents
agent = 推理客户端 + 工具 + while 循环,最小内核约 50 行——"控制论复活"与底层楼层的可运行佐证An agent is an inference client + tools + a while loop, a minimal kernel of ~50 lines — runnable evidence for "cybernetics reborn" and the building's ground floor
R3
Ⅳ
Anthropic《Effective Context Engineering for AI Agents》(一手厂商工程文章;本卷经 Graziano Day 4 转引)· Anthropic, Effective Context Engineering for AI Agents (a first-hand vendor engineering article; cited in this volume via Graziano Day 4) · anthropic.com/engineering
上下文工程问"此刻窗口里该有什么";准确率随上下文长度非单调,过峰后堆得越多越降准——"少即是多"的检索式上下文装配Context engineering asks "what should be in the window now"; accuracy is non-monotonic in context length, and past the peak more crammed in lowers accuracy — the retrieval-style "less is more" context assembly
R4
Ⅳ
Martin Fowler《Harness Engineering for Coding Agents》(一手从业者文章;本卷经 Graziano Day 4 转引,ENG·04 的直接理论源)· Martin Fowler, Harness Engineering for Coding Agents (a first-hand practitioner article; cited in this volume via Graziano Day 4, the direct theoretical source for ENG·04) · martinfowler.com/articles
harness = 每次运行都生产并校验上下文的系统;steering loop / harness 复利;computational×inferential / guides×sensors 二维分类——"脚手架即产品"The harness is the system that produces and verifies context on every run; the steering loop / harness compounding; the computational×inferential / guides×sensors two-axis taxonomy — "the harness is the product"
R5
Ⅳ
GitHub《Spec-kit》(开源规格驱动开发工具:PR 即评审门、constitution.md 硬规则层;本卷经 Graziano Day 5–6 引用)· GitHub, Spec-kit (open-source spec-driven-development tooling: PR as review gate, constitution.md as the hard-rule layer; cited in this volume via Graziano Day 5–6) · github.com/github/spec-kit
把"规格不是开关而是阶梯"落到可运行工具:规格作权威参照、PR 作评审门、constitution.md 作不可越的硬规则层Lands "a spec is not a switch but a ladder" on runnable tooling: the spec as authoritative reference, the PR as review gate, constitution.md as an inviolable hard-rule layer
R6
Ⅳ
Anthropic《Model Context Protocol (MCP)》(开放协议规范 + 安全指引;本卷经 Graziano Day 4 转引)· Anthropic, Model Context Protocol (MCP) (open protocol specification + security guidance; cited in this volume via Graziano Day 4) · modelcontextprotocol.io
工具接入面即攻击面:tool poisoning / prompt injection / 凭证泄露与最小权限清单——"默认只读、写操作显式声明"的协议级依据The tool-attachment surface is the attack surface: tool poisoning / prompt injection / credential leakage and the least-privilege manifest — the protocol-level basis for "read-only by default, write actions declared explicitly"
R7
Ⅱ
METR《Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity》随机对照试验 · 2025-07 · arXiv:2507.09089 · METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," randomized controlled trial · 2025-07 · arXiv:2507.09089 · arxiv.org/abs/2507.09089 · metr.org(16 名资深开源维护者;作者警告勿外推至 greenfield) (16 senior open-source maintainers; the authors warn against extrapolating to greenfield work)
资深开发者用 AI 实测慢 19%、自感快 20%——"合成自信"的刻度,也是"判断而非执行才是稀缺项"反向论点的实证锚(单项研究、特定人群,不外推全部场景)Senior developers measured 19% slower with AI yet felt 20% faster — a gauge of "synthetic confidence," and the empirical anchor for the counter-argument that judgment, not execution, is the scarce factor (a single study on a specific population, not extrapolated to all settings)
R8
Ⅱ
Sackman, Erikson & Grant《Exploratory Experimental Studies Comparing Online and Offline Programming Performance》· Communications of the ACM, 1968 · "10x"说法的实证源头Sackman, Erikson & Grant, "Exploratory Experimental Studies Comparing Online and Offline Programming Performance," Communications of the ACM, 1968 · the empirical origin of the "10x" claim
程序员间编码时间差约 20:1、调试约 25:1——但同一数据中产出与经验年限无关;方法有瑕(汇编与高级语言被试混计)。被本卷用于拆解"10x 个体"神话:它量的是实现/打字吞吐,而这正是 agentic coding 在廉价化的能力Coding-time differences of about 20:1, debugging about 25:1 — yet in the same data, output had no relationship to years of experience; the method was flawed (it pooled assembly and high-level-language subjects). Used in this volume to dismantle the "10x individual" myth: it measured implementation / typing throughput, the very capability agentic coding is making cheap
R9
Ⅱ
Winston W. Royce《Managing the Development of Large Software Systems》· Proceedings of IEEE WESCON, 1970 · 被误读为"瀑布"起源的论文Winston W. Royce, "Managing the Development of Large Software Systems," Proceedings of IEEE WESCON, 1970 · the paper misread as waterfall's origin
Royce 实际警告单趟顺序模型有内在风险、主张至少迭代两遍——后世把一张反面教材图当成圣经。被本卷用于"BDUF 是赌错时代"的论证:判断被前置到信息最少、且最快过期的时刻Royce actually warned that the single-pass sequential model carried inherent risk and argued for at least two iterations — posterity took a cautionary diagram as scripture. Used in this volume for the "BDUF bet on the wrong era" argument: judgment front-loaded to the moment of least information and fastest expiry
R10
Ⅳ
Donald G. Reinertsen《The Principles of Product Development Flow》· Celeritas, 2009 · 论批量大小与队列Donald G. Reinertsen, The Principles of Product Development Flow · Celeritas, 2009 · on batch size and queues
到达率逼近处理率时排队时间非线性爆炸;大批量放大延迟与方差。被本卷用于"代码评审当守门"的失效机制:agent 队伍的高到达率使"逐行人审"成为单点瓶颈,排队时间爆炸或评审退化为橡皮图章As arrival rate approaches service rate, queue time explodes non-linearly; large batches amplify latency and variance. Used in this volume for the failure mechanism of "review as gatekeeping": an agent fleet's high arrival rate makes "line-by-line human review" a single-point bottleneck, queue time explodes, or review degrades to rubber-stamping
登记口径:本卷为 AI-Native 工程方法论分卷,引用以 Graziano《AI-Native Engineering》为主干,旁及其直接援引的一手工程文献(Anthropic / Fowler / Hugging Face / GitHub Spec-kit)与一项受控实测(METR)。凡正文中"本卷/本系列推导"字样为内部推论,不另列外部来源;带"待溯源至原始出处再行终评"字样者,评级以原始出处为准。Registry scope: this is the AI-Native Engineering volume of the series; citations center on Graziano's AI-Native Engineering as the spine, alongside the first-hand engineering literature it directly draws on (Anthropic / Fowler / Hugging Face / GitHub Spec-kit) and one controlled measurement (METR). Phrases like "this volume's / this series' derivation" in the body are internal inferences and carry no external source row; where a marker reads "trace to the original for final grading," the grade follows that original.
REV. 2026-06 / END OF VOLUME · AI-NATIVE ENGINEERING