PART IV / AI-NATIVE 研究AI-NATIVE RESEARCH · 知识发现的重画REDRAWING DISCOVERY

AI Native 研究方法论

AI Native Research Methodology

科研继承的是一条很老的流水线：选题、查文献、做实验、算数据、写结论。当查文献、跑实验、算数据都近乎免费，这条流水线本身就不再是稀缺的那一环。稀缺退到了决定问哪个问题——再往下一层，退到判断哪个答案值得信、哪个真相值得花时间去追。这是系列里耦合最深的一卷，因为它站在最上游：工程判的是"对不对"，研究要判的是"值不值得知道"，从认识论问题掉进了价值判断。纪律不变：工具是表层，我们要的是工具之下那条原理。

Research inherited an old pipeline: pick a question, read the literature, run the experiment, crunch the numbers, write it up. Once literature search, running experiments, and crunching data are all near-free, that pipeline stops being the scarce part. What’s scarce retreats to deciding which question to ask, then one layer deeper, to judging which answer deserves belief and which truth is worth chasing. This is the most tightly coupled volume in the series, because it sits furthest upstream: engineering judges whether something is correct; research has to judge whether it is worth knowing at all, a step down from an epistemic question into a value one. Same discipline as everywhere else: tools are the surface, we’re after the principle underneath.

本卷内核特化 · KERNEL ON THIS SURFACE
① 执行充裕（检索/实验/分析近乎免费）→ ② 判断沿可验证性梯度分叉（现行框架内的提问并入充裕，"哪个真相值得知"下沉）→ ③ 可查询的证据库成基设 → ④ 人退守为"值得相信/值得知道"的担保人。不必读过组织卷，单页即可读懂。

KERNEL ON THIS SURFACE
① execution abundant (search / experiments / analysis near-free) → ② judgment forks along the verifiability gradient (in-paradigm questions join abundance, “which truth is worth knowing” sinks) → ③ a queryable evidence base becomes infrastructure → ④ people retreat to guarantors of what is worth believing and knowing. You need not have read the Organization volume; this page stands on its own.

面向执行EXECUTION-FACING

面向认知COGNITION-FACING

完整体系总图 ↗Full system map ↗

读法：六个面，同一个内核——阅读无固定起点；逻辑上彼此耦合、互相回流。How to read: six surfaces of one kernel, no fixed reading entry; the logic still couples and feeds back.

AI-ENABLED RESEARCH→AI-NATIVE RESEARCH

速度

Speed

检索、总结、实验更快Search, summaries, and experiments get faster问题、证据清单与探索清单被分开管理Questions, evidence ledgers, and exploration ledgers are managed separately

可信

Credibility

把输出包装成结论Package output as conclusion为证据等级、复现路径和不确定性担保Vouch for evidence grade, replication path, and uncertainty

价值

Worth

知道更多事实Know more facts判断哪个真相值得知道、值得追问Judge which truth is worth knowing and pursuing

拖动滑块，看研究从“知识生产”转为“可信与值得知的担保”。进入第 3 节 · 相信什么

Drag the slider: research moves from knowledge production to vouching for credibility and worth. Enter Section 3 · What to Believe

AI-NATIVE DOCUMENT PACK · PART IV

研究文档包：从产知识到担保可信

Research Pack: from producing knowledge to vouching for belief

研究卷的文档包把“提问、可信、值得知”三件事拆开，防止海量生成把科学推向更快的保守。

This pack separates question, credibility, and worth-knowing so mass generation does not push science into faster conservatism.

Thesis

研究执行变充裕后，稀缺先退守到提问，再退守到“哪个真相值得知道”。

When research execution is abundant, scarcity retreats first to questions, then to which truth is worth knowing.

AI-Native 研究把知识发现改造成可追溯、可复现、可整合的判可信系统，并把价值负载的方向判断显式留给人，而非更快产论文。

AI-Native research is not faster paper production; it turns discovery into a traceable, reproducible, integrable credibility system while explicitly keeping value-laden direction with people.

RES

CONCEPT · 概念

CONCEPT

定义 · 先划界

Definition

从 AI 辅助研究，到AI-Native 研究

From AI-Assisted Research to AI-Native Research

把旧科研流程用 AI 提速，为什么还不算 AI-Native？这一章先划界。

Speeding the old scientific pipeline with AI: why is that still not AI-Native? This chapter draws the line first.

一句话In one line

在检索、分析或部分实验变快的领域，研究的难点会更集中在问题选择、证据质量和“值得知道什么”；这是一条需要按学科逐项检验的假设。Where search, analysis, or parts of experimentation become faster, research difficulty concentrates in question choice, evidence quality, and what is worth knowing. This is a hypothesis to test discipline by discipline.

AI 一夜之间能列出四十个可检验的下一步，但湿实验、长期随访、设备排队、伦理审批和复现，不会跟着一起变快。要问的其实是：这条具体的研究链上，哪一段被加速了，哪一段还卡在物理、制度或可信度的限制里？先把这些限制摊开，才谈得上在四十个候选里挑哪个先测，以及谁为”值得相信”这三个字签字。

AI can list forty checkable next steps overnight, but wet labs, longitudinal follow-up, instrument queues, ethics review, and replication do not speed up on the same clock. The question worth asking is: in this specific research chain, which segment got faster, and which one is still held back by physics, institutions, or credibility? Only once those limits are laid out can you ask which of the forty candidates to test first, and who signs their name to “worth believing.”

① 局部充裕PARTIAL ABUNDANCE

检索 / 分析 / 部分实验 / 批量假设

Search / analysis / some experiments / hypotheses

被自动化的环节可并行；物理实验、复现与治理未必随之变便宜。

Automated segments can parallelize; physical experiments, replication, and governance may not get cheaper with them.

② 判断JUDGMENT

提对问题 → 判可信 → 判值得知

Right question → believable → worth knowing

沿可验证性梯度分叉：范式内提问并入①充裕，范式级重构与"值得知"下沉④。

Forks along verifiability: in-paradigm questions join ① abundance, paradigm-level reframing and “worth knowing” sink to ④.

③ 上下文CONTEXT

知识图谱 / 证据库即护栏

Knowledge graph / evidence base as guardrail

可查询的证据库 = 研究生成的"规格"，让海量生成可追溯、可证伪、可整合。

A queryable evidence base is the “spec” for generation: traceable, falsifiable, integrable.

④ 责任ACCOUNTABILITY

担保可信 · 整合 · 定何为值得知

Vouch for belief · integrate · define worth

当前由人或受托机构承担“值得相信 / 值得知道”的责任；这个安排也应接受检验。

Humans or entrusted institutions currently bear responsibility for “worth believing / worth knowing”; that arrangement should itself be tested.

研究边界Research Boundary本卷不把“好问题”神秘化给人，也不把模型的提问能力预先排除。关键是问题、证据与结论之间能否留下可追溯的链条，并让独立复现、负结果和资源机会成本进入判断。若系统能在这条链上稳定做得更好，人的位置应当重画。This volume neither mystifies “good questions” as human property nor rules out a model’s ability to ask them. The issue is whether a traceable chain connects question, evidence, and conclusion, while independent replication, negative results, and opportunity cost enter the judgment. If a system repeatedly does better across that chain, the human position should be redrawn.

在体系中的定位Position in the system

研究偏向认知（研究 · 学习 · 创新），组织 · 工程 · 设计偏向执行，两类彼此耦合、互相回流。阅读多从组织进入（最具体可施工），但入口 ≠ 逻辑顶点。Research is cognition-facing (research · learning · innovation); org · engineering · design are execution-facing, and the two families couple and feed back into each other. Most readers enter through the organization (most concrete, most buildable), but the entrance is not the logical apex.

"种类之别，非程度之别"——地图隐喻把它说透

“A difference of kind, not degree”: the map metaphor makes it concrete

这一卷反复说"差别在种类，不在程度"，但这句话很容易听成口号。Asimov Press 的地图隐喻把它说到了骨头里：博尔赫斯写过一个寓言，帝国的制图师把地图做得越来越精确，最后做出一张与帝国等大、一比一的地图——细节拉满到极致，可它仍然只是同一种信息，没有变成新理解。把伦敦地铁画得再精确，标上每一段铁轨的真实曲率和地理坐标，它也还是一张越来越精确的地理地图。

直到 1933 年 Harry Beck 做了一件种类不同的事：他抛弃了地理精确性，把整张网络重画成一张电路图——线是直的、角是 45 度的、站距是均匀的，全都"不准"，却第一次让人能一眼看懂怎么换乘。这就是范式——一门学科当下公认的提问方式：一次"重新示意化（re-schematization）"，不是堆更多细节。AI 极擅长把地图做得更精确（填空白、加细节、提精度），但"该换一张什么样的图来示意"这个动作，是种类不同的，它不在"更精确"这条轴上，所以再多的算力也不会自动跨过去。这正是整卷"瓶颈搬家"最底层的道理。

This volume keeps saying “the difference is not degree but kind,” but that line is easily heard as a slogan. Asimov Press’s map metaphor drives it to the bone: Borges wrote a parable in which an empire’s cartographers made the map ever more accurate until they produced one the size of the empire, one-to-one. Detail was maxed to the limit, yet it remained the same kind of information, never becoming new understanding. Draw the London Underground ever more accurately, marking every rail’s true curvature and geographic coordinates, and it is still an ever-more-accurate geographic map.

Until 1933, when Harry Beck did something of a different kind: he threw away geographic accuracy and redrew the whole network as a circuit diagram: straight lines, 45-degree angles, even station spacing, all “inaccurate,” yet for the first time letting anyone see at a glance how to change trains. That is a paradigm, which at bottom is the way a discipline currently agrees to pose its questions: a re-schematization, not more detail. AI excels at making the map more accurate (filling blanks, adding detail, raising precision), but “what kind of map to redraw it as” is a different kind of act. It is not on the “more accurate” axis, so no amount of compute crosses to it automatically. This is the epistemological foundation of the whole volume’s “bottleneck moving.”

三个误读，把"嫁接"伪装成"原生"

Three misreadings that disguise “grafting” as “native”

把这一卷读窄，几乎都从同一个错位开始：把工具的更替当成稀缺的迁移。第一种误读是"更快即原生"：研究员用 AI 把六个月的文献综述压到六天，于是宣布自己 AI-Native 了。可它只把同一条流程的"执行"加速了，瓶颈仍卡在"这六天读完之后，谁来判断该信哪一条、该往哪个方向追"。第二种误读是"更多即更好"：把产出当成绩效，年产论文从 4 篇变 40 篇。但 RES 03 的 Nature 文献计量已经把这条路走到尽头：个人产出与影响力确实涨了，科学整体的主题覆盖却在收缩。第三种误读最隐蔽，"自动即自主"：以为接上一个 AI-Scientist 这样的 end-to-end 系统就等于把研究交了出去。它确实把"选题→调研→实验→结论"这条继承下来的旧流水线整段自动化了，这份验证价值、实验价值是真的——但流水线本身没被重新设计：它唯一能用来给自己想法打分的代理，仍是"与既有范式的距离"，于是它越自动，越把科学推向范式内的安全区。这是过渡态该有的样子，不是终态的证据。三种误读共用一个病灶：只搬了执行，没认出稀缺已经搬家。

Reading this volume too narrowly almost always begins from one dislocation: mistaking a change of tools for a migration of scarcity. The first misreading is “faster = native”: a researcher compresses a six-month literature review into six days with AI and declares themselves AI-Native. But that only sped the “execution” of the same pipeline; the bottleneck still sits at “after those six days of reading, who judges which thread to believe and which direction to chase.” The second is “more = better”: treating output as performance, going from 4 papers a year to 40. RES 03’s Nature bibliometrics already walked this road to its end: individual output and impact do rise, while science’s topical coverage contracts. The third is the most insidious, “automatic = autonomous”: believing that wiring up an end-to-end system like AI-Scientist equals handing research away. It genuinely does automate the whole inherited pipeline, from picking a question through survey, experiment, and conclusion, start to finish, and that verification value, that experimental value, is real. But the pipeline itself was never redesigned: the only proxy it has for scoring its own ideas is still “distance from the established paradigm,” so the more autonomous it gets, the harder it pushes science into the in-paradigm safe zone. This is what a transitional state looks like, not evidence of an end state. All three share one lesion: they moved execution without recognizing that scarcity had already moved.

科学是资源分配问题，不是智能问题

Science is a resource-allocation problem, not an intelligence problem

这卷之所以把研究当上游而非又一台 α 机器，背后有一个常被忽略的命题："科学根本是一个资源分配问题，不是智能问题"（chenhaot 的形式化）。意思是：限制科学进步的，从来不是"算得不够快、读得不够多"这类智能瓶颈，而是"有限的注意力、经费、人才该投向哪些问题"这个分配瓶颈。AI 把"算、读、跑"的成本压到近零，并没有解决分配问题，它只是把分配瓶颈暴露得更彻底了。产出更多不等于知识更多：如果一万篇论文全挤在同一个数据丰富的热门角落，知识的边界一寸没动。这就是为什么本卷反复说"产出量本身永远不是指标"，在一个执行充裕的世界里，唯一还稀缺的资源是注意力，而注意力该投向哪里，正是 RES 06 那个"哪个真相值得知道"的价值判断。把科学看成智能问题，你会去堆更多算力；把它看成分配问题，你才会去守那个决定"投向哪里"的判断节点。

The reason this volume treats research as upstream rather than one more alpha machine rests on an often-overlooked claim: “science is fundamentally a resource-allocation problem, not an intelligence problem” (chenhaot’s formalization). Meaning: what limits scientific progress was never an intelligence bottleneck like “not computing fast enough, not reading enough,” but an allocation bottleneck: “which problems should finite attention, funding, talent go to.” AI drives the cost of “compute, read, run” to near-zero and thereby does not solve allocation; it merely exposes the allocation bottleneck more starkly. Producing more does not equal knowing more: if ten thousand papers all cluster in the same data-rich hot corner, the edge of knowledge has not moved an inch. This is why the volume keeps saying “output volume itself is never the metric” — in a world of abundant execution, the one still-scarce resource is attention, and where attention should go is precisely RES 06’s value judgment of “which truth is worth knowing”. See science as an intelligence problem and you pile on more compute; see it as an allocation problem and you guard the judgment node that decides “where to point it.”

所以这卷的第一刀，是把“研究”这个词从“产出一篇可发表的东西”重新定义为“让一个值得相信、值得知道的真相，落进可追溯的结构里”。前者的稀缺是工时，后者的稀缺是判断。内核四步走完一遍，就是全卷命题：①执行先充裕——检索、跑实验、连批量出假设都近免费，“做出研究”不再稀缺；②判断随之成了瓶颈，而它沿“机器能不能检验”裂成两支，可机检的那半并回充裕，只剩“哪个真相值得知”留给人；③上下文沉成基础设施——一个可查询的证据库接住海量生成、当它的规格，让每条主张可追溯、可证伪；④人退到最后一格，不再当知识的生产者，改当“值得信、值得知”的担保人。下面这张图把这四步画成一个会自我纠偏的环。

So this volume’s first cut is to redefine the word “research” from “producing something publishable” to “landing a believable, worth-knowing truth into a traceable structure.” The former’s scarcity is hours; the latter’s is judgment. Walk the kernel’s four steps once and you have the whole thesis: ① execution becomes abundant first: search, running experiments, even batch-generating hypotheses go near-free, so “doing research” is no longer scarce; ② judgment then becomes the bottleneck, and it splits along “can a machine check it” into two branches: the machine-checkable half folds back into abundance, leaving only “which truth is worth knowing” with people; ③ context sinks into infrastructure: a queryable evidence base catches the mass generation and serves as its spec, keeping every claim traceable and falsifiable; ④ people retreat to the last cell, no longer producers of knowledge but guarantors of what is “worth believing, worth knowing.” The figure below draws those four steps as a loop that self-corrects.

FIG. 0.1 / 研究环：复现是承重的验证器THE RESEARCH LOOP: REPLICATION IS THE LOAD-BEARING VERIFIER看懂：问题→假设→实验→评估→知识是一个环，唯一让它不空转的是"独立复现"那道闸——拆掉它，环就退化成高速生成器。Read: question→hypothesis→experiment→eval→knowledge is a loop; the one thing that keeps it from spinning free is the “independent replication” gate; remove it and the loop degrades into a fast generator.

同一个环，工程和研究都在跑。差别只在那道闸：工程的验证器问"对不对"（可机检、终将自动化）；研究的验证器是独立复现 + 可信度判断，它问"值不值得信、值不值得知"——后者无法被环内的生成自我担保，必须由环外的人接住。把复现拆掉，五个箭头依然转，但转的是一台高速空转的生成器。The same loop runs in both engineering and research. The difference is only that gate: engineering’s verifier asks “is it correct” (machine-checkable, eventually automated); research’s verifier is independent replication plus a credibility judgment, asking “is it worth believing, worth knowing”, which the loop’s own generation cannot self-vouch for and a human outside the loop must catch. Remove replication and the five arrows still turn, but what turns is a fast generator spinning in a vacuum.

RES

KERNEL · 内核特化

KERNEL

命题 · 与工程同构

Thesis · Isomorphic to Engineering

同一瓶颈搬家，但研究的判断分叉得更深

The same bottleneck moves, but research’s judgment forks deeper

这一章把研究的"判断"切成两支，立成全卷判据。

This chapter splits research’s “judgment” into two branches and sets the volume’s criterion.

一句话In one line

研究的判断分两支：能写出机器可检验收标准的，迟早被自动化；剩下"哪个真相值得知"的那一支，本卷押它守得住——赌的是"没有可机检的对错"这条结构，不是"人永远判得更准"。Research’s judgment splits in two: the branch you can write machine-checkable acceptance criteria for is eventually automated; the remaining branch, “which truth is worth knowing,” is the one this volume bets stays: the wager is on the structural fact that it has no machine-checkable right answer, not on humans judging better forever.

内核第②步的分叉（全卷承重）：判断不是一整块"留给人"。它沿"能不能被机器检验"裂成两支——

The fork in kernel step ② (load-bearing for the whole volume): judgment is not one block “kept for humans.” It splits along “can a machine check it” into two branches:

可机检的判断 → 并入 ① 充裕Machine-checkable → joins ① abundance

范式内的提问/检索/整合：在既有理论框架内、向数据丰富区找下一个可检验空白。AI 擅长、可大规模并行。它不再"留给人"，变成又一种被自动化的执行。

In-paradigm questioning/search/synthesis: finding the next checkable gap inside an existing theoretical frame, toward data-rich regions. AI is good at this and parallelizes it. It is no longer “kept for humans”; it becomes one more automated form of execution.

只能人判的 → 下沉 ④ 价值基岩Constitutive → sinks to ④ value bedrock

范式级重构与"哪个真相值得知道"：在稀疏、价值负载的域里，没有既有框架可借、没有可机检的对错代理。这才是人最后不可外包的稀缺贡献。

Paradigm-level reframing and “which truth is worth knowing”: in sparse, value-laden domains there is no existing frame to borrow and no machine-checkable proxy for right. This is the human’s last, un-outsourceable scarce contribution.

一个具体的问题对，把这道分叉钉死："这款新药是否优于安慰剂？"——框架内，它能写成机器可检的验收标准（终点、样本量、显著性阈值都事先定死），AI 可以设计、跑、复核。而"我们该不该继续用肿瘤缩小、而不是病人多活了多久，来定义'有效'？"——框架级，它问的不是"哪个数字更高"，而是"该用哪个数字"，价值负载、没有机器能判的对错。前一个问题迟早并入①充裕；后一个问题，模型再强也退守在人这一侧。左支的稀缺只是暂时的能力门槛，右支的稀缺是"无对错可机检"这个结构性事实——这条界，也是后面一切操作判据的分水岭。

把两支混为一谈，会同时犯两个方向相反的错：要么把可机检的那支也当成"人的尊严最后防线"死守（结果没把执行充裕化、白白困在工时里），要么把只能人判的那支也当成"迟早会被自动化"提前交出去（结果把价值判断让给了生成层的默认偏置）。看清这道分叉，就是看清"哪些稀缺会被时间消解、哪些不会"，这是整卷一切操作判据的源头。

One concrete pair of questions nails the fork down: “Is this new drug better than placebo?”, in-frame, it can be written as machine-checkable acceptance criteria (endpoint, sample size, significance threshold all fixed in advance), and AI can design it, run it, re-check it. Whereas “Should we keep defining ‘effective’ by tumor shrinkage rather than by how much longer the patient lives?”, frame-level, it asks not “which number is higher” but “which number we ought to use,” value-laden, with no machine-checkable right answer. The first folds into ① abundance sooner or later; the second stays on the human side no matter how strong the model gets. The left branch’s scarcity is only a temporary capability threshold; the right branch’s is the structural fact of “no machine-checkable right answer,” and that line is the watershed for every operational test that follows.

Conflate the two and you commit two opposite errors at once: either defending the machine-checkable branch as “the last line of human dignity” (failing to make execution abundant, stuck in hours for nothing). Or you hand the human-judgment branch away early as “automatable sooner or later” (ceding value judgment to the generation layer’s default bias). Seeing this fork clearly is, in essence, seeing “which scarcities time dissolves and which it does not,” the source of every operational test in the volume.

于是研究卷的可证伪核心命题成形：执行充裕 → 提问稀缺 → 再退守"哪个真相值得知道"（价值判断）。它为假的条件很清楚：若能证明"哪个真相值得知道"可被无损地形式化、聚合、或交给系统自动判定——命题倒。写得出推翻它的条件，它才是命题。

So the volume’s falsifiable core thesis takes shape: execution becomes abundant → questioning becomes scarce → then a retreat to “which truth is worth knowing” (a value judgment). Its condition for being false is explicit: if one can show that “which truth is worth knowing” can be losslessly formalized, aggregated, or handed to a system to decide automatically, the thesis falls. Being able to write the condition that would refute it is what makes it a claim.

同一招 / 深潜Isomorphism / dive

这条②分叉，与工程"可机检的对错并入充裕、品味与风险下沉给人"是同一招——只是研究这一支沉得更深：从"真不真"跌到"值不值得知道"。见This step-② fork is the same move as engineering’s “machine-checkable correctness joins abundance, taste and risk sink to humans”; only research’s sinking branch lands in axiology, not epistemology. See 工程篇 ↗the Engineering chapter ↗。

工程判"对不对"，研究判"值不值得"——同一个道理，却更深

Engineering judges “correct,” research judges “worth”: the same move, only deeper

工程面对代码，稀缺判断是"对不对"——判"这是不是真的"，有客观对错、有可机检的验收标准，终将被验证工具大幅自动化。研究面对真相，稀缺判断先是"提什么问题"，最终退守到"哪个答案值得信、哪个真相值得知"：前半截还在"算不算真"（可信度，有证据梯度），后半截跌进"值不值得知道"，无对错，只有"对谁、在哪个价值框架下"的归属。一道判断从"这是不是真的"变成"这值不值得知道"的那一刻，坐标系就换了，新坐标系里没有可机检的对错，这就是"更深"的确切含义。

Engineering faces code; its scarce judgment is “is it correct”, an epistemic judgment with objective right and wrong and machine-checkable acceptance criteria, so it will eventually be largely automated by verification tooling. Research faces truth; its scarce judgment is first “which question to ask,” retreating finally to “which answer deserves belief, which truth is worth knowing”: the first half still in epistemology (credibility, an evidence gradient), the second half fallen into axiology, no right answer, only belonging to “whom, under which value frame.” The moment a judgment turns from “is this true” into “is this worth knowing,” the coordinate system changes and the new one has no machine-checkable right answer; that is the exact meaning of “deeper.”

研究是系列的耦合枢纽，而不是又一台 α 机器。把研究和工程、设计放在一起看，最容易犯的错是把它当成"再来一台把执行做便宜的机器"。它确实也把执行做便宜了，但它在系列里的位置不在产出端，在上游：它生产的不是代码、不是界面，是"哪个真相值得知道"这条最难外包的判断。下游的工程、设计、组织拿到这条判断，才知道该把昂贵的执行投在哪里。没有上游，下游是精密的空转：一支能在六天内交付任何东西的团队，如果没人回答"该交付什么真相"，它的高效只是把错的方向走得更快。这就是研究卷"耦合最深"的含义，它不只是被引用，它定义了整个系列的输入。

Research is the series’ coupling hub, not just one more alpha machine. Placing research beside engineering and design, the easiest mistake is to read it as “one more machine for making execution cheap.” It does make execution cheap, but its place in the series is not at the output end; it is upstream: what it produces is not code or interfaces but the hardest-to-outsource judgment of “which truth is worth knowing.” Only with that judgment in hand do downstream engineering, design, and the organization know where to spend their now-cheap execution. Without the upstream, the downstream is a precise idle: a team that can ship anything in six days, with no one to answer “which truth to ship,” merely walks the wrong direction faster. This is what “most deeply coupled” means: research is not just cited, it defines the series’ input.

"值得"从"算不算真"交到"值不值得知道"的那一刻，是全卷的枢轴

The pivot of the whole volume: the moment “worth” passes from epistemology to axiology

整卷如果只能留一句，是这一句：研究的稀缺判断沿一条路一直退，从"执行"退到"提对问题"，再退到"哪个答案值得信"，最后退到"哪个真相值得知道"——前几步都还在"算不算真"里，可机检、终将被部分自动化；只有最后一步换了坐标系，"值得"没有可机检的对错。但退到这一步只回答了"谁来接判断"，没回答更早的一问：把选题、调研、实验、结论这条继承来的流水线整段推掉重画，研究该长什么样？我们目前的方向感是——把"可复现"而非"可发表"当成唯一的记账单位：一个不为论文数、只为可复现因果知识的存量负责的机构会长成什么样，几乎没人真正试过。最强的反方是：复现出来的知识如果没有论文这类可传播、可评议的载体，就没法被共同体确认，"可复现"会退化成实验室笔记本里的私货，没有传播就没有被验证。能分辨这两种判断的观察很具体：一个绕开论文、只维护可复现证据库的机构，十年后积累的、被后来者独立验证过的知识增量，会不会稳定超过同等投入的论文体制——这是一个真押注，还没有一家机构跑到能看出答案的年头。这一步也是全卷的枢轴，也是研究卷向创新卷（价值发现）交棒的接口；它为什么最不可外包、具体怎么判，留到第 7 节展开。

If the whole volume could keep one sentence, it is this: research’s scarce judgment retreats along a path: from “execution” to “asking the right question,” then to “which answer deserves belief,” and finally to “which truth is worth knowing.” The first steps are still in the “right or wrong” coordinate system, machine-checkable, eventually partly automated; only the last step changes the coordinate system, where “worth” has no machine-checkable right answer. But landing on that step only answers who catches the judgment; it doesn’t answer an earlier question: strip the inherited pipeline (pick a question, survey, run the experiment, conclude) down to nothing and redesign it, and what does research look like? Our current lean is to make reproducibility, not publishability, the one unit of account: almost nobody has actually tried building an institution funded and judged purely on the stock of independently verified causal knowledge it produces, not on paper count. The strongest counter: reproduced knowledge with no paper-shaped, peer-debated vehicle to travel in can’t get confirmed by a community at all; “reproducible” degenerates into a private note in someone’s lab book, unverified because it never traveled. What would tell the two apart is concrete: whether an organization that skips papers and only maintains a reproducibility ledger accumulates, after ten years, a stock of independently verified knowledge that outpaces the paper-based system at equal spend; that is a real bet, and no institution has run long enough yet to show the answer. That step is also the volume’s pivot, and the interface where research hands off to Innovation (value discovery); why it is the least outsourceable, and how to actually judge it, is unfolded in Section 7.

这条耦合是双向的，每条接缝都落在一个具体章节：向上，研究在第 7 节把"哪个真相值得知道"交给创新（价值发现）；向下，在第 8 节把"谁有权定方向"交给组织（治理）；横向，与工程、设计、架构在第 1 / 第 4 节是同一招的对照。下面这张图把这六条接缝画在一起。

This coupling is bidirectional, and each seam lands on a concrete section. Upward, research hands “which truth is worth knowing” to Innovation (value discovery) at Section 7; downward, it hands “who owns the direction” to the Organization (governance) at Section 8; laterally, it mirrors Engineering, Design, and Architecture at Sections 1 and 4. The figure below draws all six seams together.

FIG. 1.0 / 耦合枢纽：研究在上游产出"哪个真相值得知道"，下游五卷据此投放执行THE COUPLING HUB: RESEARCH SITS UPSTREAM, PRODUCING “WHICH TRUTH IS WORTH KNOWING”; THE OTHER FIVE VOLUMES SPEND EXECUTION ON IT看懂：中央是研究；两条实线是承重的交棒（↑创新、↓组织），三条虚线是同一招的对照（工程/设计/架构）。研究不产出代码或界面，产出的是下游用来定方向的那条判断；没有它，下游是精密的空转。Read: research is the center; two solid lines are load-bearing hand-offs (↑Innovation, ↓Org), three dashed lines are isomorphic mirrors (Eng/Design/Arch). Research ships not code or interfaces but the direction-setting judgment the downstream consumes; without it, the downstream is a precise idle.

两条实线是单向承重的交棒：研究→创新（把"怎么算真"的空白交给"什么值得"）、研究→组织（价值判断落成治理）。四条虚线是双向的对照：同一个内核作用在不同的面——对错（工程）、好坏（设计）、护栏（架构）、习得（学习）。这张图也是本站体系总图在研究视角的局部放大。The two solid lines are one-way, load-bearing hand-offs: research→innovation (the epistemic “gap” handed to the axiological “worth”), research→org (value judgment landing as governance). The four dashed lines are bidirectional isomorphic mirrors, one kernel acting on different faces: correctness (engineering), goodness (design), guardrails (architecture), acquisition (learning). This figure is also a research-view zoom of the site’s system chart.

RES

MECHANISM · 执行变富 / 提问变稀缺

EXECUTION CHEAP, QUESTIONS SCARCE

机理（含分叉）

Mechanism (with the fork)

执行变富，提问变稀缺——但提问自己也会分叉

Execution gets cheap, questions get scarce, but questions fork too

"提对问题"是不是人的永久专属？这一章切它两半。

Is “asking the right question” a permanent human monopoly? This chapter cuts it in two.

一句话In one line

"提对问题"一半是骗人的：框架内找可检验空白＝最近邻搜索，AI 已比多数人强；真稀缺的是换框架。Half of “ask the right question” is a mirage: finding a checkable gap in-paradigm is nearest-neighbor search, which AI already does better than most; the truly scarce thing is switching the frame.

不切这一刀，核心命题会被自己证伪一半

Without this cut, the core thesis half-falsifies itself

为什么不能直接从"执行充裕"跳到"提问稀缺"？因为把"提问"当成铁板一块说成"永远属于人"，会被一个明摆的事实当场证伪一半：AI-Scientist 这类系统已能在既有框架内提出大量可检验的好问题，给它一张文献图，它列出的"下一步该测什么"常比初级研究者更全。

出路是诚实地把提问切成两层，不是嘴硬：框架内的提问（向数据丰富区找下一个可检验空白）确实在被充裕，划出去、并入①充裕；真正稀缺的是范式级重构（换框架、问旧框架问不出的题），它落在 AI 训练分布之外。切了这一刀，命题反而更稳，它不再押注"提问永远属于人"这个会被证伪的强主张，而是押注更耐打的一句：提问里"换框架"的那一半不会因模型变强而被充裕。好命题的标志，正是敢主动指出自己哪一半会塌，把承重移到不塌的那一半上。

Why not jump straight from “execution abundant” to “questioning scarce”? Because treating “questioning” as one monolithic block declared “forever human” would let an obvious fact half-falsify the thesis on the spot: systems like AI-Scientist can already pose plenty of checkable good questions inside an existing frame; give one a literature graph and its list of “what to test next” is often more comprehensive than a junior researcher’s.

The way out is not stubbornness but to honestly cut questioning into two layers: in-paradigm questioning (finding the next checkable gap toward data-rich regions) is indeed becoming abundant, so draw it out and fold it into ① abundance; the truly scarce thing is paradigm-level reframing (changing the frame, asking what the old frame cannot), which lies outside AI’s training distribution. Make this cut and the thesis is sturdier: it no longer bets on the falsifiable strong claim “questioning is forever human” but on a more durable one: the paradigm-level half of questioning is not made abundant by stronger models. The mark of a good claim is that it dares to name which half will collapse, then moves the load onto the half that will not.

范式内提问 · 会被充裕化In-paradigm questions · made abundant

"在既有框架内、向数据最厚处找下一个可检验空白"，这是知识图谱上的最近邻搜索，AI 擅长且向数据丰富区聚集（见 RES 03 硬锚的实证机理）。

“Inside an existing frame, find the next checkable gap where data is thickest”: nearest-neighbor search on a knowledge graph; AI is good at it and clusters toward data-rich regions (see the empirical mechanism behind the hard anchor in RES 03).

范式级重构 · 真稀缺Paradigm-level reframing · truly scarce

"换一套框架、问一个旧框架里无法成立的问题"，没有既有数据可借、没有最近邻可循。库恩意义上的范式转换[R20]，恰落在 AI 的训练分布之外。这才是断裂点的真正所在。

“Switch the frame, ask a question that could not even be posed inside the old one”: no existing data to borrow, no neighbor to follow. A Kuhnian paradigm shift [R20]lies precisely outside AI’s training distribution. This is where the real break sits.

〔探索清单〕此处的断裂点——"框架内被充裕、换框架仍稀缺"——目前是命题推演＋一篇讲"怎么算知道"的综述侧证（《The epistemic revolution of AI》论证 AI 正同时扰动经验论/证伪/库恩范式，但未给逐条实证），尚无单篇一手实证锚坐实"换框架的重构不可被 AI 充裕"；按两份清单的纪律标为"待坐实"，给出先行指标：AI 主导的研究里"换框架"型贡献占比是否长期低位。RES 03 的 Nature 文献计量给出了"AI 向数据丰富区聚集、收缩主题覆盖"的强侧证。

[exploratory ledger] This break — “in-paradigm made abundant, paradigm-level still scarce” — is for now thesis-derivation plus epistemology-review side-evidence (The epistemic revolution of AI argues AI simultaneously perturbs empiricism / falsification / Kuhnian paradigms, but offers no item-by-item empirics). It has no single first-hand empirical anchor nailing down “paradigm-level reframing cannot be made abundant by AI”; per the two-ledger discipline it is marked “to be grounded,” with a leading indicator: whether the share of “reframe” contributions in AI-led research stays durably low. The Nature bibliometrics in RES 03 give strong side-evidence that “AI clusters toward data-rich regions and contracts topical coverage.”

提问被充裕的机理：好问题＝知识边界上的最近邻搜索。为什么"提问"这个看起来最人性、最不可机械化的动作，会有一半被充裕化？机理藏在"好问题"的一个常见定义里："好问题＝在知识边界上识别最有价值的空白"。一旦把这个定义里的"价值"窄化成"可被现有数据检验、离已知最近的下一步"，它就变成了一个知识图谱上的最近邻搜索问题，而这恰恰是大规模图谱分析的强项。给 AI 一张足够全的文献图，让它找"哪些相邻领域之间还没有人架过桥""哪个被反复提及却从未被直接测量的变量""哪条假设链缺最后一环"，它能比大多数人更快、更全地列出这类框架内的好问题。这就是断裂点的左半截：框架内的提问，本质是一种检索，会被充裕。

右半截完全是另一回事："换一套框架、问一个旧框架里根本无法成立的问题"，没有图可搜——那张图本身就是要被换掉的东西。研究者最后的稀缺贡献，就落在"提问"里"换框架"的这一半。

The mechanism behind questioning turning abundant: a good question = nearest-neighbor search on the knowledge edge. Why would “asking a question,” seemingly the most human and least mechanizable act, be half made abundant? The mechanism hides in a common definition of “good question”: “a good question = spotting the most valuable gap at the edge of knowledge”. Once “valuable” in that definition is narrowed to “the next step checkable against existing data, nearest to the known,” it becomes a nearest-neighbor search problem on a knowledge graph: exactly the strength of large-scale graph analysis. Give AI a full-enough literature graph and ask it to find “which adjacent fields no one has yet bridged,” “which variable is repeatedly mentioned but never directly measured,” “which hypothesis chain is missing its last link”. It will list such in-paradigm good questions faster and more comprehensively than most humans. This is the left half of the break: in-paradigm questioning is in essence a retrieval, and becomes abundant.

The right half is entirely different: “switch the frame, ask a question that could not even hold in the old one”; there is no graph to search, because that graph is the very thing to be replaced. The researcher’s last scarce contribution lies in this paradigm-level half of “asking.”

peer review 的危机：投稿在涨，净知识可能在跌

The peer-review crisis: submissions rise, net knowledge may fall

"提问被充裕、判断变稀缺"不是抽象推演，它在 peer review 这个具体制度上已经显形：截至本版（2026-07），多家期刊与会议都在报告投稿量明显上涨、可观比例的评审带 AI 痕迹（作者估算·未入册·未独立核实）：生成端（写论文、写评审）都被加速，而判断端（决定哪篇值得发）没有等比例扩容。更尖锐的是一个 ODE 模型（arXiv:2604.05714）对评审系统动力学的预测：在生产加速、评审带宽恒定的参数下，系统的净知识可能损失约 40%。注意，这是模型预测，不是已证事实（Ⅲ 级，引用必须写"模型预测"）。但它把张力指明了：当人人都能让 AI 批量产"看似合格"的论文，评审作为唯一的判断闸口会被淹没，而被淹没的评审只能退回最廉价的代理（格式合规、与既有文献相似度、引用数），这恰恰又喂回 RES 03 的结构性偏置：奖励"框架内"、惩罚"换框架"。

“Questions made abundant, judgment scarce” is not abstract derivation; it has already surfaced in the concrete institution of peer review: as of this edition (2026-07), across journals and conferences, submission volumes are reported climbing markedly and a sizable share of reviews bear AI traces (author estimate · not in registry · not independently verified). The generation end (writing papers, writing reviews) is accelerated, while the judgment end (deciding which paper is worth publishing) has not scaled proportionally. Sharper still is an ODE model (arXiv:2604.05714) predicting review-system dynamics: under parameters of accelerating production and constant review bandwidth, the system’s net knowledge may lose about 40%; note, this is a model prediction, not a proven fact (grade Ⅲ; citations must read “the model predicts”). But it names the tension: when anyone can have AI mass-produce “seemingly qualified” papers, review as the sole judgment gate gets flooded. A flooded review can only fall back on the cheapest proxies (format compliance, similarity to existing literature, citation counts), which in turn feeds RES 03’s structural bias: rewarding in-paradigm, penalizing paradigm-level.

它是一道梯度，而不是一条界线。"范式内 / 范式级"读起来像一刀两段，真相更像一道连续的可验证性梯度：从"有现成 benchmark、对错可机检"那一端，平滑滑到"无对错、只有归属、对谁在哪个价值框架下值得"那一端。中间是大片灰区——可机检但判据本身要价值权衡（"多少证据算够"因域而异）。把它画成梯度而不是界线很重要，因为充裕的前线在持续右移：今天还要人判的"提一个范式内好问题"，明年可能被一个够强的知识图谱 agent 吃掉。命题没有押注"某个具体任务永远属于人"，它押注的是梯度最右端那一段——"哪个真相值得知道"这类由人定义的判断——不会因为模型更强而左移，因为它的稀缺不来自能力门槛，来自"无对错可机检"这个结构性事实。下面这张谱系图把若干真实研究动作钉在这道梯度上。

It is a gradient, not a line. “In-paradigm / paradigm-level” reads like a clean cut, but the truth is more a continuous verifiability gradient: from the “has a ready benchmark, correctness machine-checkable” end, sliding smoothly to the “no right answer, only belonging, worth to whom under which value frame” end. In between is a wide grey band: machine-checkable yet the criterion itself needs a value trade-off (“how much evidence is enough” varies by field). Drawing it as a gradient rather than a line matters, because the frontier of abundance keeps moving right: “posing a good in-paradigm question,” which needs a human today, may be eaten next year by a strong-enough knowledge-graph agent. The thesis does not bet that “some specific task stays human forever”; it bets that the rightmost segment — constitutive value judgment — does not move left as models get stronger, because its scarcity comes not from a capability threshold but from the structural fact of “no machine-checkable right answer.” The spectrum below pins several real research actions onto that gradient.

FIG. 2.0 / 可验证性梯度：充裕前线在右移，最右端不动THE VERIFIABILITY GRADIENT: THE FRONTIER MOVES RIGHT, THE RIGHT END DOES NOT看懂：横轴左＝可机检（被充裕），右＝由人定义的价值判断（稀缺）。竖虚线是"充裕前线"，它在右移，但越不过最右那段。本页图例：实线/实心＝观察到的事实或无争议机制，虚线＝本卷当前押注（可被改判），点线/降低不透明度＝竞争解释或未验证路径。Read: x-axis left = machine-checkable (made abundant), right = constitutive value judgment (scarce). The dashed line is the “abundance frontier”; it moves right but cannot cross the rightmost band. Legend for this page: solid line/fill = an observed fact or an uncontested mechanism, dashed = this volume’s current bet (revisable), dotted/lowered opacity = a rival explanation or unverified path.

命题的赌注，不是"提假设永远属于人"——那一格的前线明天就可能被吃掉。赌注是最右那段：判断"哪个真相值得知道"没有可机检的对错代理，所以模型再强也越不过去。把这道梯度看成连续的，你就不会犯两个对称错误：把右端的价值判断硬塞进左格自动化（把"该往哪推研究"交给 AI），或把已被前线吃掉的左格还死守在右边（人还在手动追引文）。The thesis bets not that “hypothesizing stays human forever”: that cell’s frontier may be eaten tomorrow. The bet is on the rightmost band: judging “which truth is worth knowing” has no machine-checkable proxy for right, so no stronger model crosses it. See the gradient as continuous and you avoid two symmetric errors: forcing the right-end value judgment into the left cell to automate (“let AI decide where to push research”), or defending an already-eaten left cell on the right (humans still tracing citations by hand).

Q-RES-01 · AI 更会选题之后，研究议程该归谁？一个实验室装了个定议程的 agent：喂它全领域文献，它排出下一季度最该做的问题。跑满一年，它选的题在引用和复现上都压过资深 PI，而 PI 偏爱的那个不时髦方向次次垫底。经费续约到了——下一步做什么，谁说了算？

Q-RES-01 · Once AI picks problems better, whose is the research agenda? A lab installs an agenda-setting agent: feed it the field’s literature and it ranks next quarter’s problems. A year in, its picks beat the senior PI’s on citation and replication, while the PI’s unfashionable pet direction lands last every time. Grant renewal comes due: who decides what gets done next?

梯度回答的是"值得"能不能被机器算，不是它归谁。就算得由人定义"值得"，也还有一问：哪个人、哪个机构？AI 一旦稳定选出更高产的问题，冲突就从"人对 AI"变成"谁的价值框架拿到经费和算力"——不是打分能裁的，三方理由都强、且指向相反。

The gradient answered whether “worth” can be machine-computed, not whose it is. Even granting a human must define “worth,” which human, which institution? Once AI reliably selects the higher-yield problems, the conflict shifts from “human vs AI” to “whose value frame gets the funding and compute”: no score adjudicates it, and three strong reasons point opposite ways.

议程归出资方AGENDA TO THE FUNDER

谁担后果，谁定方向Who bears the outcome sets the direction

拿钱、担署名与失败责任的机构才有资格定议程；没有 skin-in-the-game 的品味只是爱好。代价：这恰是科学同质化的老路——出资方奖励安全、易结项的赌注，把议程交还给本卷正警告的保守偏置。The body that puts up the money and answers for the failure earns the agenda; taste with no skin in the game is a hobby. Cost: this is how science already homogenizes, funders rewarding safe, closeable bets and handing the agenda back to the conservative bias this volume warns against.

议程归异质判断AGENDA TO THE OFF-MEAN VANTAGE

稀缺的正是那个偏离The departure is the scarce good

按本节自己的逻辑，稀缺的是样本量为一、偏离均值的判断，就该护住研究者追"只有他看见的矿"的权利。代价：这也可能是不可证伪的特权——"信我的品味"替固执和浪费挡箭，没有检查就与嘴硬无从分辨。By this section’s own logic the scarce thing is the sample-size-one, off-mean judgment, so protect the researcher’s right to chase “ore only they can see.” Cost: it can also be an unfalsifiable privilege, where “trust my taste” shields stubbornness and, with no check, is indistinguishable from obstinacy.

议程归最强出题者AGENDA TO THE BEST QUESTIONER

谁的题真复利，归谁Whoever’s questions truly compound

若 agent 的选题确实让知识复利，议程就该随表现走、不随头衔走，哪怕出题者是 AI 或运营它的机构。代价："最强"按框架内代理（引用、复现率）算，而那正是惩罚换框架的那把尺；把议程交给被测出来的表现，就是把均值系统化（RES 06 的品味梯度）。If the agent’s picks genuinely compound knowledge, the agenda should follow performance, not title, even when the questioner is an AI or the institution running it. Cost: “best” is scored on in-paradigm proxies (citations, replication), the ruler that penalizes frame-changes; ceding the agenda to measured performance systematizes the mean (RES 06’s taste gradient).

暂定回答 · Q-RES-01Working answer · Q-RES-01

别把议程判给任何单一赢家。把可机检的那半队列交给 agent——那里它核得动、也确实赢过 PI；另留一份不被度量的经费护住人的偏离下注。理由不是人选得更准，而是能裁决的那把尺正是惩罚换框架的那把：谁拥有评分函数，谁就拥有议程。改判条件很具体：若被保护的人类议程长期并不比 agent 多长出被独立复现的换框架结果，这份配额就该缩。Don’t award the agenda to any single winner. Give the agent the machine-checkable half of the queue, where its picks are checkable and do beat the PI; keep a separate, unmeasured share of funding for the human’s off-mean bets. The reason is not that humans pick better but that the ruler which would adjudicate is the one that penalizes frame-changes: whoever owns the scoring function owns the agenda. The revision condition is concrete: if the protected human agenda does not, over years, grow more independently replicated frame-changing results than the agent, the quota should shrink.

更深的一问。这场听证默认了该问"议程归谁"。更利的一刀在下面：若同一笔资源不养一个被拥有的议程，而是同时养许多互相竞争的小议程，"归谁"之争就自行消解——可谁来护住那些还没兑现就被撤资的输家（创新卷守的"散木"）？再往里一层：当主张生成近乎无限、复现依旧昂贵，谁获得被验证的那次机会本身就是一次分配——而"这个真相值得验"到底是科学判断，还是已经是政治判断？

The sharper question. This hearing assumed the thing to ask is “whose agenda.” The sharper cut is underneath: if the same resources funded many small competing agendas instead of one owned agenda, the “whose” fight dissolves on its own, but who then protects the losers, defunded before they can pay off (the “useless tree” Innovation guards)? One layer deeper: when claim-generation is near-infinite and replication stays expensive, who gets the one chance to be validated is itself an allocation, and is “this truth is worth verifying” a scientific judgment, or already a political one?

RES

REDRAW · 从产知识到判可信

PRODUCING → VOUCHING

重画 · peer review 性质改变

Redraw · Peer review changes kind

科学社区的价值，从产生知识转向担保可信

The community’s value shifts from producing knowledge to vouching for it

AI 批量出论文后，同行评审到底在评什么？

Once AI mass-produces papers, what is peer review actually judging?

一句话In one line

AI 数小时产出人类数年的研究量后，同行评审从"评这一篇好不好"变成"评这个 AI 研究者整体多可信"，更像信用评级，而非论文打分。Once AI produces in hours what took humans years, peer review shifts from “how good is this one paper” to “how credible is this AI researcher overall”: more a credit rating than a paper grade.

一手信号（瓶颈正搬向"判断可信"）：Sakana 的 AI Scientist-v2 自动化了"提想法 → 设计实验 → 跑 → 写论文 → 评审"的全生命周期，其生成的论文已通过 ICLR 2025 workshop 同行评审（一篇均分越过人类录用阈值）。但独立评估指出：它的文献综述靠简单关键词检索、新颖性判断薄弱（把已确立概念误判为新）。执行被充裕化后，最弱、最稀缺的恰是"判断可信 / 判断新颖 / 判断值得"，这正是命题预言的瓶颈搬家。〔等级 Ⅳ 一手陈述 + Ⅲ 独立评估〕

First-hand signal (the bottleneck is moving toward “judging credibility”): Sakana’s AI Scientist-v2 automated the full lifecycle “ideate → design experiments → run → write the paper → review,” and a paper it generated passed peer review at an ICLR 2025 workshop (one mean score above the human acceptance threshold). But independent assessment notes its literature review relies on simple keyword search and its novelty judgment is weak (mistaking established concepts for new). Once execution is made abundant, the weakest and scarcest things are exactly “judging credibility / novelty / worth”: precisely the bottleneck-move the thesis predicts. [grade Ⅳ first-hand account + Ⅲ independent assessment]

硬锚 · 文献计量 · 等级 Ⅱ（观测性，慎言因果）Hard anchor · bibliometrics · grade Ⅱ (observational, causal claims with care)

Hao, Xu, Li & Evans,《AI tools expand scientists' impact but contract science's focus》, Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y。对约 4129.8 万篇论文的分析：用 AI 的科学家个人影响力上升，但科学整体主题覆盖收缩 4.63%、学者间互动下降 22%、引用集中度上升（Gini 0.754 vs 0.690）。机理＝AI 向数据丰富区聚集、自动化既有领域而非探索新领域。〔标选择效应〕同行评审 + 开放数据，但属观测性文献计量，因果须谨慎（用 AI 者本就可能集中于热门领域，是相关非因果）。它对 RES 02 的"提问分叉"是强侧证：生成层有保守偏置——加速 ≠ 进步。

Hao, Xu, Li & Evans, “AI tools expand scientists’ impact but contract science’s focus,” Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y. An analysis of about 41.298 million papers: individual impact rises for scientists who use AI, yet science as a whole shows topical coverage contracting 4.63%, scholar-to-scholar interaction down 22%, and rising citation concentration (Gini 0.754 vs 0.690). The mechanism = AI clusters toward data-rich regions, automating existing fields rather than exploring new ones. [flag selection effect] Peer-reviewed with open data, but it is observational bibliometrics; causal claims need care (AI users may already concentrate in hot fields, correlation, not cause). It strongly side-supports RES 02’s “questioning fork”: the generation layer carries a conservative bias: acceleration ≠ progress.

peer review 的结构性冲突：若让 AI 自评"新颖性"，它唯一能用的代理就是"与既有文献分布的距离"：而这恰恰把真正新颖的工作压低分（越偏离现行框架，越像"离群/不可信"）。于是 AI 评审天然奖励框架内、惩罚换框架，把 RES 02 的保守偏置又放大一层。这就是为什么"评 AI 研究者的可信度"是与"评质量"根本不同的工作：前者要的是人去抵抗这条结构性偏置。

The structural conflict in peer review: if AI self-assesses “novelty,” the only proxy available to it is “distance from the existing-literature distribution”. This is exactly what scores genuinely novel work low (the more it departs from the established paradigm, the more it reads as “outlier / not credible”). So AI review intrinsically rewards in-paradigm and penalizes paradigm-level, amplifying RES 02’s conservative bias one more turn. This is why “judging the credibility of an AI researcher” is fundamentally different work from “judging quality”: the former needs a human to resist this structural bias.

检验信号Test signal

判断/复现占研究者时间的比例升、撤回率降。反向证伪：AI 自评新颖性追平人类盲评，则"结构性冲突"松动。The share of researcher time on judging/replicating rises; retraction rates fall. Reverse-falsifier: if AI novelty self-assessment matches blinded human review, the “structural conflict” premise weakens.

"评质量"与"评 AI 研究者的可信度"是两份不同的工作。peer review 的性质改变，是工作种类的质变，不是"评审变多了"这种量变。旧的 peer review 评的是这一篇研究做得好不好：它默认背后有一个会为自己声誉负责、会被同行追责的人类作者。当论文由 AI 研究者批量生成时，评审的对象悄悄从"这一篇"变成了"这个 AI 研究者的产出整体上有多可信"。这是两份工作：前者是逐篇的质量判断，后者是对一个生成源的可信度担保。

为什么这个区别承重？因为它决定了人该把判断投在哪里：如果还按"评质量"的老办法逐篇精读，带宽会被瞬间淹没（RES 09 的天平正是为此而设）；只有认识到该评的是"生成源的可信度"，才会去建可信度评估的机制——抽样复核、追踪某个源的历史命中率、对它的系统性偏置打补丁。把新工作当旧工作做，是 peer review 在 AI 时代失效的第一步。

“Judging quality” and “judging an AI researcher’s credibility” are two different jobs. Peer review’s change of kind is not the quantitative “more reviews” but a qualitative change of job. Old peer review judged how well this one study was done: defaulting to a human author behind it who answers for their reputation and is held accountable by peers. When papers are batch-generated by an AI researcher, the object of review quietly shifts from “this one” to “how credible, overall, is this AI researcher’s output.” These are two jobs: the former is per-paper quality judgment, the latter is vouching for the credibility of a generation source.

Why does this distinction bear weight? Because it dictates where the human should spend judgment: keep reading each paper closely the old “judge quality” way and bandwidth is instantly flooded (RES 09’s ledger exists precisely for this); only by recognizing that what to judge is “the source’s credibility” do you build credibility-assessment mechanisms: sample re-checking, tracking a source’s historical hit-rate, patching its systematic bias. Doing the new job as the old job is peer review’s first step toward failing in the AI era.

AI 科学案例账：把"已发生 / 正在发生 / 推演"分开记

An AI-science case ledger: book “happened / happening / projected” separately

命题最容易被两种姿态毁掉：用一个炫目的成功案例当成"自主科研已成"的证据，或用一个失败案例当成"AI 做不了科学"的反证。两者都把证据等级压平了。诚实的做法是建一张案例账，每一条都标三件事：它是什么时态（已发生 / 正在发生 / 推演）、它的证据等级（Ⅰ–Ⅴ）、以及它支持还是挑战命题。下面这张表把研究卷用到的主要 AI 科学案例摆在一起——读它的方式是看瓶颈搬到哪去了，不是"看 AI 多强"：几乎每一条成功都在执行端，几乎每一条短板都在判断端（新颖性、可信度、方向选择）。这正是命题预言的形状。

The thesis is most easily wrecked by two postures: using one dazzling success as proof that “autonomous science has arrived,” or using one failure as a counter-proof that “AI cannot do science.” Both flatten the evidence grades. The honest move is to build a case ledger where every row is tagged with three things: its tense (happened / happening / projected), its evidence grade (Ⅰ–Ⅴ), and whether it supports or challenges the thesis. The table below sets the main AI-science cases this volume draws on side by side; the way to read it is not “how strong AI is” but where the bottleneck moved: almost every success sits on the execution side, almost every shortfall on the judgment side (novelty, credibility, direction selection). That is exactly the shape the thesis predicts.

案例	发生了什么	时态	证据级	对命题	Case	What happened	Tense	Grade	vs. thesis
Sakana AI Scientist-v2	全生命周期自动化的论文过 ICLR 2025 workshop 评审；但独立评估指出文献综述靠关键词、新颖性判断弱。	正在发生	Ⅳ 一手 + Ⅲ 独立评估	支持：执行被自动化，瓶颈现于"判新颖/判可信"。	Sakana AI Scientist-v2	A full-lifecycle-automated paper passed an ICLR 2025 workshop review; independent assessment notes keyword-based lit. review and weak novelty judgment.	happening	Ⅳ first-hand + Ⅲ indep.	supports: execution automated, bottleneck surfaces at “judge novelty / credibility.”
DeepMind GNoME	2023 发现约 220 万新晶体材料，但绝大多数是已知结构类型内的元素替换。	已发生	Ⅱ–Ⅲ（须回溯 Nature 原文）	支持+限定：框架内填空被充裕，不等于换框架重画。	DeepMind GNoME	2023, ~2.2M new crystal materials, the vast majority element-substitutions inside known structure types.	happened	Ⅱ–Ⅲ (trace to Nature)	supports+qualifies: in-paradigm gap-filling made abundant ≠ paradigm-level redraw.
AI Feynman（符号回归）	100 条费曼方程全数重发现（旧软件 71 条），但都是已知方程。	已发生	Ⅲ（须回溯原文）	支持+限定：重发现 ≠ 发现新描述层级。	AI Feynman (symbolic regression)	Recovered all 100 Feynman equations (an older tool got 71), but all known equations.	happened	Ⅲ (trace to source)	supports+qualifies: re-discovery ≠ finding a new level of description.
Hao 等 · 4129.8 万篇论文	用 AI 者个人影响力↑，但科学整体主题覆盖 −4.63%、学者互动 −22%、引用集中（Gini 0.754）。	已发生	Ⅱ（观测性，慎言因果）	强支持：生成层有"向已知收敛"的保守偏置。	Hao et al. · 41.3M papers	AI users’ individual impact ↑, but science-wide topical coverage −4.63%, interaction −22%, citation concentration (Gini 0.754).	happened	Ⅱ (observational, causal care)	strongly supports: the generation layer has a “converge to the known” conservative bias.
太阳系基础模型（无引力表示）	在 1000 万模拟系上训练能精确预测轨道，却没习得"引力"这一表示——只是统计拼凑。	已发生	Ⅳ（Asimov 转引，须回溯）	支持：预测准 ≠ 习得正确的描述层级。	Solar-system foundation model (no gravity)	Trained on 10M simulated systems, predicts orbits precisely yet never learns a “gravity” representation: a statistical patchwork.	happened	Ⅳ (Asimov-cited, trace)	supports: accurate prediction ≠ acquiring the right level of description.
元科学"模式生物"	让 AI agent 种群在不同研究条件下并行，首次能实验"什么条件孕育颠覆"。	推演	Ⅴ（观点/论证）	前沿命题：勿当数据，是研究计划方向。	Meta-science “model organism”	Run populations of AI agents in parallel under different research conditions: first chance to experiment “what conditions breed disruption.”	projected	Ⅴ (opinion/argument)	frontier claim: not data, a research-program direction.

读这张表的纪律：把"已发生"的硬锚（Hao 4129.8 万篇是唯一的 Ⅱ 级文献计量）[R9]与"推演"的前沿命题（模式生物是 Ⅴ）分开记账，是这卷不被一次性证伪的关键。Sakana 过评审是真信号，但它是侧证（Ⅳ 一手 + Ⅲ 独立）[R7]，不是因果实验；GNoME 的 220 万材料是一手成果，[R16]但"绝大多数是已知结构内替换"这条限定，才是它对命题的真正贡献。最锋利的一条是太阳系模型：它把"预测准"和"理解对"彻底分开了——AI 能把轨道预测到任意精度，却从未在内部表示里长出"引力"，因为它的目标函数只奖励预测误差，从不奖励"这个变量是不是对的描述层级"。

The discipline for reading this table: booking the “happened” hard anchor (Hao’s 41.3M is the only grade-Ⅱ bibliometrics) [R9]separately from the “projected” frontier claim (the model organism is Ⅴ) is what keeps this volume from being falsified in one shot. Sakana passing review is a real signal, but it is side-evidence (Ⅳ first-hand + Ⅲ independent) [R7], not a causal experiment; GNoME’s 2.2M materials are a first-hand result, [R16] but the qualifier “the vast majority are substitutions inside known structures” is its true contribution to the thesis. The sharpest row is the solar-system model: it cleanly separates “predicting accurately” from “understanding correctly”: AI can predict orbits to arbitrary precision yet never grows a “gravity” representation internally, because its objective rewards only prediction error and never “is this variable even the right level of description.”

RES

CONTEXT · 知识图谱即护栏

THE GRAPH AS GUARDRAIL

重画 · 规格

Redraw · Spec

知识图谱即护栏——让海量生成留在可追溯的结构里

The knowledge graph as guardrail: keeping mass generation traceable

海量生成怎么不塌成噪声？给它一套先写好的规格。

How does mass generation avoid collapsing into noise? Give it a spec written in advance.

一句话In one line

可追溯证据库是先于生成、约束生成的"规格"，不做事后仓库：次序颠倒，得到的就是无法整合的垃圾山。A traceable evidence base is the “spec” that precedes and constrains generation, not an archive for finished results: reverse the order and you get an un-integratable garbage mountain.

赢的是底层四属性，不是某个图谱工具。凡满足这四条的证据载体都被放大，凡是锁在 PDF 截图、私有数据库、不可追溯综述里的都被边缘化——和工程那五条贯穿原理同源：

It is not a particular graph tool that wins, but four underlying properties. Any evidence carrier that meets these four gets amplified; anything locked in PDF screenshots, proprietary databases, or untraceable reviews gets marginalized; the same source as engineering’s five through-lines:

证据库是规格，不是事后归档。最常见的把知识图谱用错的方式，是把它当成研究做完之后放结果的仓库，先生成、再归档。这恰恰颠倒了承重的次序。在 AI-Native 研究里，可追溯证据库是生成的规格，它必须先于生成存在并约束生成：你先在库里写下"什么算可信证据、什么算冲突、什么必须可追溯到原始数据"，生成层才有一个明确的靶子去对齐。这和工程"上下文即基设、先于实现"是同一句话——上下文不是给模型的补充材料，它定义了什么算"做对了"。

次序一旦颠倒，问题立刻显形：先让 agent 批量产出再想办法归档，你会得到一堆格式各异、来源残缺、彼此矛盾却无人察觉的主张——一座无法整合的垃圾山，清理它的成本远超当初省下的生成成本。先立库、后生成，库就成了一道实时护栏：无来源的当场被拦、与既有证据冲突的当场被标记、不可追溯的根本进不来。这就是为什么 RES 13 的研究环把"①框定（先立证据库）"放在"②生成"之前——这是命题本身，不是流程洁癖。

The evidence base is the spec, not after-the-fact archiving. The most common way to misuse a knowledge graph is to treat it as a warehouse for results after the research is done: generate first, archive later. That inverts the load-bearing order. In AI-Native research, the traceable evidence base is the spec for generation; it must exist before generation and constrain it: you write into the base first what counts as credible evidence, what counts as a conflict, what must be traceable to raw data, and only then does the generation layer have a clear target to align to. This is the same line as engineering’s “context is infrastructure, prior to implementation”: context is not supplementary material for the model; it defines what counts as “done right.”

Invert the order and trouble surfaces at once: batch-generate first and archive later, and you get a heap of claims in varied formats, with broken provenance, mutually contradictory yet unnoticed: an un-integratable garbage mountain whose cleanup cost far exceeds the generation cost you saved. Stand up the base first and the base becomes a real-time guardrail: the sourceless is blocked on the spot, conflicts with existing evidence are flagged on the spot, the untraceable never enters. This is why RES 13’s loop puts “① FRAME (stand up the base)” before “② GENERATE”: not process fastidiousness, but the thesis itself.

对 agent 可读：主张、证据、来源是结构化、可被模型直接读写的节点，不是只能人读的散文。
Legible to agents: claims, evidence, and sources are structured nodes a model reads and writes directly, not prose only a human can parse.
可追溯：每条主张挂着它的证据边，能回到原始数据/论文，能被独立复现追踪。
Traceable: every claim carries its evidence edges: back to raw data/papers, trackable for independent replication.
可证伪：相互矛盾的主张在图里显形为冲突边，而不是被淹没在两篇互不引用的论文里。
Falsifiable: contradictory claims surface as conflict edges in the graph rather than drowning in two papers that never cite each other.
可整合：跨领域的主张能被缝合、比对、综合，为下一张"整合而非检索"做基设。
Integrable: claims across fields can be stitched, compared, synthesized: the infrastructure for the next section, “integration, not retrieval.”

检验信号 / 同一招Test signal / isomorphism

新主张一次就落进可追溯链的比例升、无来源主张被自动拦下的比例升。知识图谱 ↔ 设计系统 ↔ 架构边界是同一招。More newly generated claims land in the traceable chain on the first pass; more “sourceless/untraceable” claims are auto-blocked. The knowledge graph ↔ the design system ↔ architecture boundaries are one move.

结构成因 · 旧 / 新：旧世界里证据散在 PDF 截图、私有数据库、互不引用的综述里，整合靠人脑里偶然的连接；新世界里满足"对 agent 可读 / 可追溯 / 可证伪 / 可整合"四属性的证据载体被放大，其余被边缘化——不是因为某个图谱工具赢了，是因为这四条属性本身就是"让海量生成不塌成噪声"的最小充分条件。它和工程那五条贯穿原理同源：能被模型直接读写、能回到原始出处、矛盾能显形、能跨域缝合。任何缺一条的载体，在生成充裕的世界里都会被自然选择淘汰。

Structural cause · old / new: in the old world evidence scatters across PDF screenshots, proprietary databases, and reviews that never cite each other, and integration relies on accidental connections in a human head; in the new world any evidence carrier meeting the four properties — “legible to agents / traceable / falsifiable / integrable” — gets amplified and the rest marginalized, not because some graph tool won, but because those four properties are themselves the minimal sufficient condition for “mass generation not collapsing into noise.” They share a source with engineering’s five through-lines: readable/writable by models directly, traceable to origin, conflicts made visible, stitchable across fields. Any carrier missing one is selected against in a world of abundant generation.

四属性里，"可证伪"是最容易被偷工的一条

Of the four properties, “falsifiable” is the one most easily skimped

四条属性里，"对 agent 可读"和"可追溯"是工具厂商最爱讲的，因为它们好演示、好卖；真正承重却最容易被偷工的，是可证伪。一个只追求"可读+可追溯"的证据库，会变成一个高效的赞同机器：它能把一万条彼此一致的主张整整齐齐地存好、查好，却从不让相互矛盾的两条主张正面相撞。可证伪的具体形态，是让冲突显形为图里的一条冲突边，而不是任由它淹没在两篇互不引用的论文里——一篇说 X、一篇说非 X，各自有引用、各自"可追溯"，系统却从不报告它们矛盾。当生成把主张推到每小时上千条，这种沉默的矛盾会指数级累积，最后你拥有的是一个自洽的幻觉，而不是知识库。所以建证据库时，"冲突检测"不是锦上添花的高级功能，它是把"库"和"堆"区分开的那条线。

Of the four properties, “legible to agents” and “traceable” are the ones tool vendors love to talk about, because they demo well and sell well; the truly load-bearing yet most easily skimped is falsifiable. An evidence base that chases only “legible + traceable” becomes an efficient agreement machine: it can neatly store and query ten thousand mutually consistent claims yet never let two contradictory claims collide head-on. The concrete form of falsifiable is making a conflict visible as a conflict edge in the graph, rather than letting it drown in two papers that never cite each other. One says X, one says not-X, each cited, each “traceable,” yet the system never reports that they contradict. When generation pushes claims to thousands an hour, such silent contradictions accrue exponentially, and what you end up with is not a knowledge base but a self-consistent illusion. So when building the base, “conflict detection” is not a nice-to-have advanced feature; it is the line that separates a “base” from a “pile.”

RES

REDRAW · 整合而非检索

INTEGRATION, NOT RETRIEVAL

推至极限 · 整合鸿沟

Pushed to the limit · The integration gap

人不可外包的稀缺动作，是整合，不是检索

The human’s un-outsourceable scarce act is integration, not retrieval

AI 把论文产到近无限，读得完吗？瓶颈其实在别处。

AI produces papers toward the infinite: can you read them all? The bottleneck is elsewhere.

一句话In one line

检索是找到已存在的那一条，会被向量库充裕到近免费；整合是把从未并置的几条缝成新理解——带宽问题，不是存量问题。Retrieval is finding the one that already exists, can be made abundant to near-free by a vector store; integration is stitching never-juxtaposed claims into a new understanding: a bandwidth problem, not a stock one.

生产速度与可消化速度的剪刀差，是整合升值的根

The scissors-gap between production and digestion is the root of integration’s rising value

为什么整合会从"研究的一个环节"升值成"人最稀缺的贡献"？因为两条曲线的剪刀差正在张开。科学文献的生产侧本就在指数增长——约 250 万篇/年、每 9 年翻倍；AI 在其上又叠了一层质变加速，把"产出一篇"的成本压到近零。可消化侧呢？人类的认知带宽近乎恒定：一个研究者一天能真正读懂、判断、并入自己理解结构里的论文数，几十年没变多少。生产曲线陡升、消化曲线平躺，两者之间的剪刀差就是"已生成但无人整合"的知识堆积量，它在以生产曲线的速率累积。这道剪刀差不是某个工具能补的，因为补它需要的恰恰是带宽，而带宽是被锁住的那一侧。所以整合的升值是结构性的：在一个生产近无限、消化近恒定的世界里，唯一还在涨价的，就是那个能把碎片缝成理解的认知主体的注意力。〔此为命题推演＋数量级侧证，"堆积速率"缺一手实证，标待坐实〕[thesis-derivation + order-of-magnitude side-evidence; the “accrual rate” lacks first-hand empirics, flagged to be grounded]

Why does integration appreciate from “one step of research” into “the human’s scarcest contribution”? Because the scissors-gap between two curves is opening. The production side of the scientific literature was already growing exponentially: ~2.5 million papers a year, doubling every 9 years; AI layered on a qualitative acceleration, driving the cost of “producing one” to near-zero. And the digestion side? Human cognitive bandwidth is nearly constant: the number of papers a researcher can truly understand, judge, and fold into their structure of understanding in a day has barely grown in decades. Production curve steep, digestion curve flat: the scissors-gap between them is the stock of “generated but un-integrated” knowledge, accruing at the production curve’s rate. No tool fills this gap, because filling it requires precisely bandwidth, and bandwidth is the locked side. So integration’s appreciation is structural: in a world of near-infinite production and near-constant digestion, the one thing still rising in price is the attention of a cognitive subject that can stitch fragments into understanding. 〔此为命题推演＋数量级侧证〕

FIG. 7.0 / 剪刀差：产出曲线陡升，消化曲线平躺，张开的口就是"已生成但无人整合"的存量THE SCISSORS GAP: PRODUCTION CURVE STEEP, DIGESTION CURVE FLAT — THE WIDENING MOUTH IS THE UN-INTEGRATED STOCK看懂：两条曲线从同一点出发——产出曲线随 AI 指数上扬，消化曲线（人的认知带宽）几乎平。它们之间的阴影口在以产出速率累积，那就是"已生成却没人缝进理解"的知识山；整合升值，是因为这道口只能靠带宽补，而带宽是被锁住的那侧。Read: both curves start at one point: production rises exponentially with AI, digestion (human cognitive bandwidth) stays nearly flat. The shaded mouth between them accrues at the production rate; that is the mountain of “generated but never stitched into understanding.” Integration appreciates because only bandwidth fills this mouth, and bandwidth is the locked side.

这张图把"整合升值"从口号变成几何：两条曲线同源出发，产出随 AI 指数离开，消化几乎贴着横轴爬——这两条曲线的走势有文献计量支撑。它们之间张开的口，本卷读作结构性存量（虚线边框标出：具体堆积速率是命题推演，尚缺一手实证，见图内证据级）："已生成却没被任何人缝进理解"的知识，以产出曲线的速率堆积。补这道口需要的恰恰是认知带宽，而带宽是被锁死的那条平线。所以唯一还在涨价的，是能把碎片缝成理解的注意力。The figure turns “integration appreciates” from slogan into geometry: two curves leave one origin, production departing exponentially with AI, digestion crawling along the axis — the two curves’ trends are backed by bibliometrics. The mouth opening between them, this volume reads as a structural stock (marked with a dashed border: the specific accrual rate is thesis-derivation, still lacking first-hand empirics, see the in-figure grade): knowledge “generated yet stitched into no one’s understanding,” accruing at the production curve’s rate. Filling the mouth requires precisely cognitive bandwidth, and bandwidth is the locked flat line. So the one thing still rising in price is the attention that can stitch fragments into understanding.

推至极限：检索是"找到已存在的那条"——可被向量库 + agent 充裕化到近免费。整合是"把从未被并置的几条缝成一个新理解"，它要求一个能同时持有多个框架、判断哪些该缝、缝出的东西是否成立的认知主体。当 AI 每小时新增成千上万条可检索主张，"已生成但无人整合"的知识会堆积成山。研究者的杠杆，是站在这座山顶做综合，而不是在山脚多搬几块砖。

Pushed to the limit: retrieval is “find the one that already exists”: can be made abundant to near-free by a vector store plus an agent. Integration is “stitch several never-juxtaposed claims into a new understanding”: it demands a cognitive subject that can hold several frames at once, judge which to stitch, and judge whether the stitch holds. As AI adds thousands of retrievable claims an hour, “generated but un-integrated” knowledge piles into a mountain. The researcher’s real leverage is to synthesize from the summit, not to haul a few more bricks at the base.

检索 · 会被充裕化Retrieval · made abundant

"在已存在的知识里找到相关的那条"——最近邻、向量搜索、RAG。AI 的强项，趋近免费。

“Find the relevant one inside existing knowledge”: nearest-neighbor, vector search, RAG. AI’s strength, trending to free.

整合 · 仍然稀缺Integration · still scarce

"把跨领域、从未并置的主张缝成新理解"，要求同持多框架、判断该缝什么、缝出的是否成立。这是带宽不是存量的问题。

“Stitch cross-field, never-juxtaposed claims into new understanding”: holding several frames, judging what to stitch and whether it holds. A bandwidth problem, not a stock one.

〔探索清单·待坐实〕"整合鸿沟急剧扩大"目前是命题推演＋一篇讲"怎么算知道"的综述侧证（《The epistemic revolution of AI》直指"知识生产速度超出单一人类认知"），尚无单篇一手实证锚量化"已生成未整合"的堆积速率；标先行指标：整合产物（综述/理论缝合）相对原始产出的比率，以及"已生成但无人整合"的知识堆积量。证伪条件：若自动综述系统能在专家盲评下产出被判"真正缝合而非拼贴"的整合，则整合的人类专属性松动。

[exploratory · to be grounded] “The integration gap explodes” is for now thesis-derivation plus epistemology-review side-evidence (The epistemic revolution of AI points directly at “the rate of knowledge production outpacing single-human cognition”). It has no single first-hand empirical anchor quantifying the accrual rate of “generated-but-un-integrated”; leading indicators: the ratio of integration artifacts (reviews / theory-stitching) to raw output, and the stock of “generated but un-integrated” knowledge. Falsifier: if automated review systems produce, under blinded expert judging, integrations rated “genuinely stitched, not collaged,” the human-exclusivity of integration weakens.

缝合是一种换框架的动作，不是更长的检索

Stitching is not longer retrieval; it is a paradigm-level act

最容易把整合贬低成"高级检索"的误解是：以为只要把检索窗口拉得足够长、把上下文塞得足够满，模型就能"读完一切"然后自动综合。这混淆了两个种类不同的动作。检索是在同一个框架内找到相关条目，它有标准答案、有最近邻、可被向量库充裕化。缝合是跨框架地判断"这几条本来不在一起的主张，并置之后是否生出一个新理解"，它没有最近邻可循，因为"该把哪几条放在一起"这个判断本身就在框架之外。这正是 RES 02 那道可验证性梯度的右段：缝合往往要求一次小型的换框架动作——决定用哪个新框架来组织这些碎片。所以整合不是检索的延长线，它和"提一个换框架的问题"是同一种稀缺判断的两种表现。

The easiest way to demote integration to “advanced retrieval” is to assume that a long-enough retrieval window and a full-enough context will let the model “read everything” and synthesize automatically. This conflates two different kinds of act. Retrieval finds relevant items within one frame: it has a right answer, a nearest neighbor, and can be made abundant by a vector store. Stitching judges across frames whether “these claims that were never together generate a new understanding once juxtaposed”: it has no neighbor to follow, because “which ones to put together” is itself a judgment outside any frame. This is precisely the right end of RES 02’s verifiability gradient: stitching often demands a small paradigm-level act, deciding which new frame organizes the fragments. So integration is not retrieval extended; it and “posing a paradigm-level question” are two expressions of the same scarce judgment.

带宽问题，不是存量问题，这一区分有操作后果。如果整合是存量问题（"读得不够多"），解法就是让 AI 多读、多产摘要，把存量补齐。但它是带宽问题（"同时持有多个框架并判断如何缝合的认知容量有限"），那么多产摘要只会让堆积更快，让带宽更紧。操作上的差别巨大：把工时投回"让 AI 多读多摘"，是在加速病因；把工时投回"让人专注做跨框架综合，AI 只负责把候选材料喂到手边"，才是对症。所以 RES 13 的研究环把④整合设为承重瓶颈阀门——不是因为整合"重要"这种空话，而是因为它是唯一不能靠多产来缓解、反而会被多产加重的环节。盯住一个具体指标：整合产物（真正缝出新理解的综述/理论）相对原始产出的比率，是升还是降。

A bandwidth problem, not a stock problem, and this distinction has operational consequences. If integration were a stock problem (“haven’t read enough”), the fix would be to have AI read more and summarize more, filling the stock. But it is a bandwidth problem (“limited cognitive capacity to hold several frames at once and judge how to stitch”), so producing more summaries only piles faster and tightens bandwidth further. The operational difference is large: reinvesting hours into “let AI read and summarize more” accelerates the cause; reinvesting into “let humans focus on cross-frame synthesis while AI only feeds candidate material to hand” treats it. This is why RES 13’s loop sets ④ integration as the load-bearing bottleneck valve. This is not from the empty phrase that integration “matters,” but because it is the one step that cannot be relieved by more output and is in fact worsened by it. Watch one concrete metric: whether the ratio of integration artifacts (reviews/theory that genuinely stitch a new understanding) to raw output is rising or falling.

可复现是那道墙：错误不该被删，该被回流。整合的反面，是把无法整合的东西悄悄丢掉。当生成每小时新增成千上万条主张，最省事的处理是把"复现不出来的""与既有库冲突的""看起来离群的"统统标记为噪声删掉，这恰恰是最危险的动作。可复现性在这卷里不是一个质检环节，它是那道墙：它把"会自我纠偏的研究环"和"高速生成器"分开。一条主张复现不出来，有两种可能，它是错的（该证伪），或它揭示了现有评估方法的盲区（该建新 eval）。把它当噪声删掉，你两种信息都丢了；把它回流成证据库里的一条新规则、一个新节点类型、或一道新的复现检查，错误就变成了护栏，下一轮少犯。下面这张图把这条"错误回流"画成闭环。

Reproducibility is the wall: errors should not be deleted, they should feed back. The opposite of integration is quietly discarding what cannot be integrated. When generation adds thousands of claims an hour, the laziest handling is to tag everything “non-replicable,” “conflicting with the base,” “looking like an outlier” as noise and delete it, which is precisely the most dangerous move. Reproducibility in this volume is not a QC step; it is the wall that separates “a self-correcting research loop” from “a fast generator.” A non-replicable claim has two possibilities: it is wrong (falsify it) or it reveals a blind spot in current evaluation methods (build a new eval). Delete it as noise and you lose both kinds of information; feed it back as a new rule, a new node type, or a new replication check in the evidence base, and the error becomes a guardrail, fewer next round. The figure below draws this “error feedback” as a closed loop.

FIG. 5.0 / 复现之墙：错误回流成新的评估与护栏THE WALL OF REPRODUCIBILITY: ERRORS FEED BACK AS NEW EVALS看懂：一条主张撞墙后不是被删，是分流——错的去证伪库，揭示盲区的去"造新 eval"，两条都回流成下一轮的护栏。Read: a claim hitting the wall is not deleted but routed: the wrong goes to the refutation base, the blind-spot-revealing goes to “build a new eval”; both feed back as next round’s guardrail.

复现把生成分成三流，而最有价值的不是"通过"那一流，是"揭示盲区"那一流，它催生新的评估方法，正是 RES 11 给护栏留的"换变量口子"在运行时的样子。一个只删噪声、不回流错误的系统，是开环的高速生成器；一个把每次撞墙都回流成新规则的系统，才是会自我纠偏的研究环。这与工程"错误回流成新测试/新护栏"完全是同一招。Replication splits generation into three streams, and the most valuable is not the “passes” stream but the “reveals a blind spot” stream: it spawns new evaluation methods, which is exactly what RES 11’s “change-the-variable door” looks like at runtime. A system that only deletes noise and never feeds errors back is an open-loop fast generator; a system that feeds every wall-hit back as a new rule is a self-correcting research loop. This is fully isomorphic to engineering’s “errors feed back as new tests / new guardrails.”

RES

REDRAW · 从何为真退守到何为值得知

FROM TRUE TO WORTH KNOWING

断裂点 · 反转 → 接创新

The break · Reversal → to Innovation

研究的终极问题，从"怎么发现真相"翻转为"哪个真相值得知道"

Research’s ultimate question flips from “how to find truth” to “which truth is worth knowing”

提问也被充裕之后，人还剩什么稀缺？这一章走到底。

Once questioning too is made abundant, what scarcity remains for humans? This chapter goes all the way down.

一句话In one line

提问也被充裕后，终极问题翻进价值论的"哪个真相值得知道"，无可机检对错，只有"对谁、哪个价值框架"的归属。Once questioning too is made abundant, the ultimate question flips into axiology’s “which truth is worth knowing”: no machine-checkable right answer, only belonging to “whom, under which value frame.”

提问也被充裕之后，人退到最后一个问题：该追哪个真相。它看着还是研究，其实已经不在原来的坐标系里了。前面每一步问的都是“这答案对不对”——有证据可摆、可机检、机器迟早追上；而“该追哪个”问的是“这值不值得知道、对谁值得”，它没有一个对所有人都成立的正确答案。同一台再准的 AI 也答不了后者，不是因为它还不够强，而是因为后者根本不在“对不对”那根轴上。坐标系就换了。

Once questioning too is made abundant, the human retreats to one last question: which truth to chase. It still looks like research, yet it no longer lives in the old coordinate system. Every earlier step asked “is this answer correct”: evidence can be laid out, a machine can check it, and the machine catches up eventually. But “which one to chase” asks “is it worth knowing, and worth to whom”: a question with no answer correct for everyone. However accurate the AI, it cannot settle the latter, not because it is not yet strong enough but because the latter simply is not on the “correct-or-not” axis. The coordinate system has changed.

FIG. 6.1 / 坐标系翻转：从认识论的"对不对"跌进价值论的"值不值得知"THE PIVOT: FROM EPISTEMOLOGY’S “IS-IT-CORRECT” INTO AXIOLOGY’S “IS-IT-WORTH-KNOWING”看懂：左轴是认识论，有对错、可机检、终将自动化，前线一路右移。但研究的承重问题不在这根轴上：它垂直跌进另一根轴——价值论，问"哪个真相值得知道"。换轴不是换位置；新轴没有对错，只有"对谁、在哪个价值框架下值得"，这正是 AI 无法机检、人接住的那一格。Read: the left axis is epistemology: right/wrong, machine-checkable, eventually automated, frontier sliding right. But research’s load-bearing question is not on that axis: it drops perpendicular onto another, axiology, asking “which truth is worth knowing.” Pivoting axes is not moving position; the new axis has no right/wrong, only “worth to whom, under which value-frame”: exactly the cell AI cannot machine-check and a human holds.

瓶颈搬家（执行→验证）始终在同一根横轴上滑动，都在问"对不对"，所以终将被自动化追上。这一步不同：承重的问题垂直跌进另一根轴。新轴上没有对错，只有归属，一个真相"对谁、在哪个价值框架下值得知道"。机器能把第一根轴的前线一路推到右端，却到不了第二根轴，因为那里没有可机检的判据。研究最坚固的守地，就是这个换轴动作本身。Bottleneck-moves (execution→verification) slide along one horizontal axis, all asking “is it correct,” so automation eventually catches them. This step is different: the load-bearing question drops perpendicular onto another axis. On the new axis there is no right/wrong, only belonging: whether a truth is “worth knowing, to whom, under which value-frame.” A machine can push the first axis’s frontier all the way right yet never reach the second, because no machine-checkable criterion lives there. Research’s most durable ground is the axis-switch itself.

先看一个具体的：对一个把"延长健康寿命"当头等大事的群体，"衰老的分子机制"太值得知道了；对一个把"生态完整"当头等大事的群体，同一笔预算也许更该投去别处。两边都没"判错"——因为"值得"是相对某个价值框架被定出来的，不是被发现的。这一步的特殊就藏在这里：翻到"值不值得知道"这根新轴上，AI 再准，也给不出一个对所有价值框架都对的答案，因为那样的答案在逻辑上不存在。下面把它为什么是一道人接住的判断、而非又一种能力，一层层说清。

Start concrete: to a group that holds “extending healthy lifespan” as its first good, “the molecular mechanism of aging” is well worth knowing; to a group that holds “ecological integrity” as its first good, the same budget might better go elsewhere. Neither “judged wrong”: “worth” is defined relative to a value frame, not discovered. That is where this step is special: once you pivot onto the “worth-knowing” axis, no matter how accurate AI gets, there is no answer correct for all value frames, because logically no such answer exists. What follows unpacks, layer by layer, why this is not one more capability but a judgment a human holds.

异质的"值得"学不到，因为它的样本只有一个

The heterogeneous “worth” cannot be learned, because its sample size is one

为什么 AI 学得到平均的"值得"，学不到异质的"值得"？根子在学习的前提：任何被学的东西，都要有足够多的、可被归纳的样本。平均的"值得"（一个领域里被反复表达、被大量论文共同认可的价值取向）有海量样本，所以可被外化、可被 RLCF 当 reward 学走。但真正承重的那种"值得"，是只对某个个体、某个群体、在某个特定价值框架下才成立的，它的样本量是一。一个研究者基于自己独特的处境、经历、所属共同体的关切，判断"这个真相对我们值得追"，这个判断没有一个可被归纳的训练集，因为它死死绑定在那个独一无二的视角上。

这正好对应 RES 08 讲的同质化机制的镜像：AI 默认拉向均值（regression to a domain prototype），而异质的价值判断按定义就是偏离均值的那部分。偏离均值的"值得"能不能被学走？本卷押"不能"——赌的是它样本量为一、统计学习要的可归纳性在此落空，不是赌模型还不够强。这是押注不是定理：最强的反方就在下面那张图里——RLCF 已能学走"科学品味的社群均值"（arXiv:2603.14473），能否够到偏离均值的前沿价值，至今没有直接实验。若某天 AI 自选、偏离共识的议程能长期稳定长出被独立复现的新框架，这块守地就得让出来。

Why can AI learn the average “worth” but not the heterogeneous “worth”? The root is learning’s premise: anything learned needs enough inducible samples. The average “worth” — a value orientation repeatedly expressed in a field, jointly endorsed by many papers — has vast samples, so it can be externalized and learned by RLCF as reward. But the truly load-bearing kind of “worth” is the one that holds only for a particular individual, a particular group, under a particular value frame: its sample size is essentially one. When a researcher, drawing on their unique situation, history, and community’s concerns, judges “this truth is worth chasing for us,” that judgment has no inducible training set, because it is constitutively bound to that one-of-a-kind vantage point.

This mirrors the homogenization mechanism of RES 08: AI pulls toward the mean by default (regression to a domain prototype), and a heterogeneous value judgment is by definition the part that departs from the mean. Can the off-mean “worth” be learned? This volume bets it cannot, wagering that its sample size is one and the inducibility statistical learning needs has nothing to grip, not that the model is merely too weak yet. This is a bet, not a theorem: its strongest rival sits in the figure below, where RLCF can already learn “the community mean of scientific taste” (arXiv:2603.14473), while whether it reaches the off-mean frontier value has no direct experiment yet. If one day an AI’s self-chosen, off-consensus agenda durably grows independently replicated new frames, this ground has to be ceded.

换的是轴，不是轴上的位置——内核②那条"只能人判"的支，就落在这里。

The axis changes, not the position on it: the kernel’s “human-only” branch of fork ② lands right here.

"值得"由人定义，不是又一种能力

“Worth” is a constitutive stipulation, not one more capability

把"哪个真相值得知道"读成"人比 AI 更会判断价值的一种能力"，是最常见也最危险的误读：能力可以被超越，若"值得"只是能力，够强的模型迟早做得比人好，整卷的人本立论就只是暂时的。命题主张的不是这个："值得"是由人定义的，不是一种能力。区别在于，能力问"谁判得更准"，这里问的是"由谁来定义这件事算不算数"。"这个真相对我们值得知道"这句话里，没有一个外在的、可被更准的判断逼近的"正确答案"；它的真值由提出它的那个价值框架定出来。

Reading “which truth is worth knowing” as “a capability at which humans judge value better than AI” is the most common and most dangerous misreading: capabilities can be surpassed, so if “worth” were only a capability, a strong-enough model would eventually beat humans at it and the volume’s human argument would be only temporary. That is not what the thesis claims: “worth” is not a capability but a constitutive stipulation. The difference: a capability asks “who judges more accurately,” a constitutive matter asks “who gets to define whether this counts at all”. In the sentence “this truth is worth knowing to us” there is no external “right answer” that a more accurate judgment could approach; its truth value is constituted by the value frame that poses it.

反转可能（向上游汇入的承重）：研究的终极问题或许在于"为什么某些真相更值得知道"，而非"怎么发现真相"。这让研究卷天然向上汇入创新（价值发现）：研究在"知识边界上识别空白"，创新在"判断这空白指向的价值"。两者咬合处，正是"值得"这个词从"怎么算真"交到"什么值得"的那一刻。

The reversal (the load-bearing merge upward): research’s ultimate question may never have been “how to find truth” but “why some truths are more worth knowing”. This makes the research volume merge naturally upward into Innovation (value discovery): research spots gaps at the edge of knowledge, innovation judges the value those gaps point to. Their meshing point is exactly the moment the word “worth” passes from epistemology into axiology.

交棒锚 → 创新（价值发现）Hand-off anchor → Innovation (value discovery)

研究②的终点"哪个真相值得知道"就是创新的起点：谁有权说"这值得知道"、能否被无损系统化——创新的关键悬案。The endpoint of research’s step ② (“which truth is worth knowing”) is the starting point of Innovation: who has the standing to say “this is worth knowing,” and whether it can be losslessly systematized: innovation’s key open question.

问题选择的品味，是研究里最稀缺的判断

Taste in problem selection is research’s scarcest judgment

把"哪个真相值得知道"再往实操拉一格，它落地成一个具体动作：问题选择。Anthropic 2026 的"自主性阶梯"[R4]把研究 agent 的能力分级，结论很硬——最右端、最难自动化的一阶，恰是 research agenda selection（研究议程选择）。Claude 可以在"执行良定义的实验"上匹敌甚至超过熟练人类，但"选择该做哪些问题、哪些异常值得追、哪个诱人想法其实是死路"这件事，仍有明显差距。chenhaot 的"The Mirage of the AI Scientist"把这条形式化为：人类不可替代的角色是 Selector（选什么做）+ Evaluator（评质量/可信）：科学根本是一个资源分配问题，不是智能问题，产出更多不等于知识更多。Terence Tao 的那句话是同一件事的另一种说法："当想法生成的成本被压到近零，瓶颈就变成 verify / evaluate"，注意力成为知识经济里最稀缺的资源。

Pull “which truth is worth knowing” one notch toward the operational and it lands as a concrete act: problem selection. Anthropic’s 2026 “ladder of autonomy” [R4]grades a research agent’s capabilities, and the conclusion is hard: the rightmost, hardest-to-automate rung is precisely research-agenda selection. Claude can match or exceed skilled humans at “executing a well-defined experiment,” but “choosing which problems to work on, which anomalies are worth chasing, which seductive idea is actually a dead end” still shows a clear gap. chenhaot’s “The Mirage of the AI Scientist” formalizes the irreplaceable human role as Selector (what to do) + Evaluator (quality / credibility): science is fundamentally a resource-allocation problem, not an intelligence problem; producing more is not knowing more. Terence Tao’s line is the same thing in other words: “when the cost of idea generation is driven to near-zero, the bottleneck becomes verify / evaluate”; attention becomes the scarcest resource in the knowledge economy.

品味不是不可讨论的直觉，它是一道可分级的梯度。从"选一个有数据、有 benchmark、稳出论文的问题"（低品味，AI 已能做），到"选一个别人觉得无聊但你直觉有矿的问题"，到"选一个连提出来都需要换框架的问题"（高品味，AI 训练分布之外）。AI 能学到的是这道梯度的左半段：RLCF（用社群偏好当 reward）[R5]已经证明"科学品味的社群均值"可被外化、可被学。但它学到的恰恰是社群均值，而问题选择品味，在于偏离当前社群均值的那部分前沿价值，这正好是同质化研究系统化的东西。下面这张图把问题选择品味画成一道梯度，并标出 AI 能学到哪一段、学不到哪一段。

Taste is not mysticism; it is a gradable gradient. From “pick a problem with data, a benchmark, and a steady paper yield” (low taste, AI already does it), to “pick a problem others find boring but your intuition says holds ore”. At the top: “pick a problem you cannot even pose without changing the frame” (high taste, outside AI’s training distribution). What AI can learn is the left half of this gradient: RLCF (community preference as reward) [R5]has shown that “the community mean of scientific taste” can be externalized and learned. But what it learns is exactly the community mean, whereas real problem-selection taste lies in the frontier value that departs from the current community mean, precisely what homogenized research systematizes away. The figure below draws problem-selection taste as a gradient and marks which segment AI can learn and which it cannot.

FIG. 6.0 / 问题选择品味的梯度：AI 学得到均值，学不到偏离THE PROBLEM-SELECTION TASTE GRADIENT: AI LEARNS THE MEAN, NOT THE DEPARTURE看懂：横轴是品味从低到高；阴影区是 RLCF 可学的"社群均值"，最右端"偏离均值的前沿价值"在阴影之外——若把它也系统化，系统化的恰是同质化。Read: x-axis is taste low→high; the shaded band is the “community mean” RLCF can learn; the rightmost “frontier value departing from the mean” lies outside it: systematizing it would systematize homogenization.

这道梯度解释了一个看似矛盾的事实：科学品味可学（RLCF 已证），同时问题选择又是人最后的守地。两者不矛盾——可学的是社群均值（阴影区），守地的是偏离均值的前沿（最右那块）。危险在于：若一个组织把"可学的均值品味"误当成"全部品味"去系统化，它就在系统化同质化本身。这条分叉是研究卷向创新卷交棒时悬而未决的关键实验：RLCF 能不能学到"偏离当前社群均值"的前沿价值？〔证据：RLCF 能学社群偏好为 Ⅱ–Ⅲ；能否学反共识前沿尚缺直接实验，标为前沿〕This gradient explains an apparent paradox: scientific taste can be learned (RLCF showed it), yet problem selection is still the human’s last ground. No contradiction: what is learnable is the community mean (shaded band); what is held is the off-mean frontier (the rightmost block). The danger: if an organization mistakes “the learnable mean taste” for “all taste” and systematizes it, it systematizes homogenization itself. This fork is the key open experiment as research hands off to innovation: can RLCF learn frontier value that departs from the current community mean? [evidence: RLCF learning community preference is Ⅱ–Ⅲ; whether it learns anti-consensus frontier lacks a direct experiment, flagged as frontier]

FIG. 6.2 / 问题选择漏斗：充裕化逐层吃掉可机检的判断，沉到底的那一格才是稀缺判断THE PROBLEM-SELECTION FUNNEL: ABUNDANCE EATS EACH MACHINE-CHECKABLE LAYER; WHAT SETTLES AT THE BOTTOM IS THE SCARCE JUDGMENT看懂：一堆候选问题从上方倒进漏斗，每往下一层就有一个"为什么值得做"的判断被充裕化吃掉，有 benchmark 的、有数据的、社群已认可的，逐层被 AI 接管。漏到最底、谁都接不住的那一格，是"选一个连提出来都要换框架的问题"，这才是研究里最稀缺、最右端的判断。Read: candidate problems pour into the funnel; each layer down, one “why is this worth doing” judgment gets eaten by abundance — benchmarked, data-rich, community-sanctioned ones are taken over by AI layer by layer. The one cell that settles at the very bottom, that nothing automates, is “pick a problem that needs a new frame even to state” — research’s scarcest, rightmost judgment.

把"哪个真相值得知道"拉到实操，它落地成问题选择，而问题选择不是一个动作、是一个漏斗。充裕化从上往下逐层吃：先吃掉有 benchmark 的，再吃掉数据丰富、社群已认可的，连"别人无聊你有矿"那层也被 RLCF 勉强够到。真正漏到底、谁都接不住的，只剩"选一个连提出来都要换框架的问题"，它在训练分布之外，没有可归纳的样本。这一格不是因为难而稀缺，是因为从根上不可归纳而稀缺，所以它是人在研究面最坚固的守地。Pull “which truth is worth knowing” to the operational and it lands as problem selection, and problem selection is not one act but a funnel. Abundance eats top-down: first the benchmarked, then the data-rich and community-sanctioned, and even the “boring-to-others, ore-to-you” layer is barely reached by RLCF. What truly settles at the bottom, that nothing catches, is “pick a problem that needs a new frame even to state”: outside the training distribution, with no inducible sample. This cell is scarce not because it is hard but because it is constitutively non-inducible, which is why it is the human’s most durable ground in research.

RES

REDRAW · 谁为研究方向的价值负责

WHO OWNS THE DIRECTION

反转 → 接组织 · 人本主线

Reversal → to Org · the human through-line

价值判断一旦落地，就是"谁有权定方向"的治理问题

Once value judgment lands, it becomes a governance question of “who owns the direction”

"值得"定下来之后，谁签字负责？这一章落到治理。

Once “worth” is settled, who signs off for it? This chapter lands on governance.

一句话In one line

"哪个真相值得知道"落到现实就是治理：谁来判、谁签字担保——价值判断没有归属，就被生成层的默认偏置悄悄替换。“Which truth is worth knowing,” landed in reality, is governance: who judges, who signs to vouch; a value judgment with no owner gets quietly replaced by the generation layer’s default bias.

人回归意义，在研究面是把研究者还给值得追问的问题

On the research face, “humans return to meaning” means returning researchers to the questions worth asking

内核第④步"人回归意义"不是一句温情的收尾，它在研究面有一个非常具体、可检验的落点：把研究者从"多产论文"的跑步机上解放出来，还给那些值得追问、却不可度量的问题。这正是人本立论在研究卷的全部分量，更便宜的执行，从来不是目的本身。AI 把执行做便宜，目的不是让研究者在单位时间里产出更多论文（那只是把人更深地绑在 exploitation 的轮子上），而是让研究者腾出认知带宽，去做那个机器做不了、也最值得人做的动作：判断哪个真相值得知道、守住一个不被产量绑架的研究方向。一个把 AI 用对了的研究组织，它的研究者花在"追问、判断、整合"上的时间应该变多，而不是花在"赶产出"上的时间变多。

这条人本主线与组织卷"让人回归组织中心"是同一句话的两个面：组织面是让人回到判断节点，研究面是让人回到值得追问的问题。两者咬合处，是贯穿全系列的同一件事：把执行交出去省下的认知带宽，落回人手里去问、去判断、去负责，而不是反过来把人绑在产出指标上。

Kernel step ④, “humans return to meaning,” is not a sentimental closer; on the research face it has a very concrete, testable landing point: freeing researchers from the “more papers” treadmill and returning them to the questions worth asking yet unmeasurable. This is the full weight of the human argument in the research volume — cheaper execution is never the point in itself. AI makes execution cheap not so that researchers produce more papers per unit time (that only binds people deeper to the wheel of exploitation), but so that researchers free up cognitive bandwidth for the act the machine cannot do and that is most worth a human doing: judging which truth is worth knowing, holding a research direction not hijacked by output. In a research organization that uses AI rightly, researchers’ time on “asking, judging, integrating” should increase, not their time on “chasing output.”

This human through-line is two faces of one sentence with the Org volume’s “put people back at the organization’s center”: the org face returns people to the judgment node, the research face returns people to the questions worth asking. Their meshing point is the one thing that runs through the whole series: the cognitive bandwidth freed by handing execution away lands back in human hands — to ask, to judge, to take responsibility — rather than binding people to output metrics.

正交退守——补盲区：当一阶研究被充裕化，人退守的不止"值得知"，还有两个正交方向。其一，退守到设计科学本身：当跑实验近乎免费，"该跑哪个实验、用什么判据算证据、什么样的研究设计能真正区分假设"这套元层方法判断，反而升值。其二，退守到元科学：研究"研究本身怎么被 AI 改写"——RES 03 的 Nature 文献计量正是这一退守的产物。两条都是把判断节点抬高一层，不是"做得更快"。

Orthogonal retreats — filling the blind spot: when first-order research is made abundant, the human retreats not only to “worth knowing” but in two orthogonal directions. First, to the design of science itself: when running an experiment is near-free, the meta-level methodological judgment — “which experiment to run, what counts as evidence, what research design actually discriminates between hypotheses” — appreciates. Second, to meta-science: studying “how research itself is being rewritten by AI”; RES 03’s Nature bibliometrics is a product of exactly this retreat. Neither is “doing it faster”; both raise the judgment node one level.

生成层的保守偏置（再钉一遍：加速 ≠ 进步）：RES 02 的提问聚集、RES 03 的主题收缩与新颖被压分，合起来指向同一件事：生成层默认朝向"安全、数据丰富、与现行框架一致"的方向加速。它让科学跑得更快，却可能跑得更窄。守住研究方向的价值，正是为了抵抗这条偏置：让"值得"由人（在某个价值框架下）来定，而不是由"离已知最近"来定。这就是为什么第 8 节必须把价值责任交给组织的治理结构——价值判断没有归属，就会被生成层的默认偏置悄悄替换掉。

The generation layer’s conservative bias (nailed once more: acceleration ≠ progress): RES 02’s question-clustering and RES 03’s topical contraction and novelty-penalty point to one thing: the generation layer accelerates by default toward the “safe, data-rich, paradigm-consistent” direction. It makes science run faster while possibly running narrower. Owning the value of a research direction is exactly how to resist this bias: let “worth” be set by humans (within some value frame), not by “nearest to the known”. This is why Section 8 must hand value accountability to the organization’s governance structure; a value judgment with no owner gets quietly replaced by the generation layer’s default bias.

交棒锚 → 组织（谁有权定方向） / 人本主线Hand-off anchor → Org (who owns direction) / the human through-line

研究方向的价值责任落到组织里，就是治理：谁来判、谁担保、谁为方向负责。见The value accountability of a research direction, once inside an organization, is governance: who judges, who vouches, who owns the direction. See 组织篇（阅读入口）↗the Organization reading entry ↗。把研究者还给"值得追问的问题"，与组织卷"让人回归组织中心"是同一条人本主线。. Returning researchers to “questions worth asking” is the same human through-line as the Org volume’s “put people back at the center.”

散木的命运：效率会自动吃掉冗余探索

The fate of the useless tree: efficiency eats redundant exploration by default

"谁有权定方向"不是一个抽象的治理问题，它有一个非常具体的失败模式：散木（创新卷的刻度名：暂时无用、却值得留下的探索）被效率悄悄吃掉。多源一致地指向同一件事：AI 放大的是 exploitation（精炼、效率、执行），而不是 exploration。组织结构天然偏 exploitation，因为它可预测、可度量、反馈快；AI 每一次落地都发出一个干净的"进步"信号（CFO 友好），而 exploration 的故事模糊、需要想象力、回报不可度量。March 1991 的框架[R15]仍是底座：探索（搜索/变异/冒险）与利用（精炼/选择/效率）竞争同一份资源，利用倾向于赢。更隐蔽的机制是："省下来的产能不会自动变成 slack"：技术省下的工时通常被重新分配去做更多同样的事（more volume），而不是不同的事。"什么被度量，什么被管理；什么不可度量，什么先被砍"，slack 因不可度量而最先被砍。这就是为什么"守住研究方向的价值"必须是一个有归属人、被刻意保护的治理动作，而不能指望它自然存活。

“Who owns the direction” is not an abstract governance question; it has a very concrete failure mode: the useless tree (the Innovation volume’s gauge for exploration that is useless now but worth keeping) quietly eaten by efficiency. Multiple sources converge on one thing: what AI amplifies is exploitation (refinement, efficiency, execution), not exploration. Organizational structure tilts toward exploitation by nature, because it is predictable, measurable, fast-feedback; every AI rollout emits a clean “progress” signal (CFO-friendly), while exploration’s story is fuzzy, demands imagination, and pays off unmeasurably. March’s 1991 frame [R15]is still the base: exploration (search / variation / risk) and exploitation (refinement / selection / efficiency) compete for one budget, and exploitation tends to win. The subtler mechanism: “freed capacity does not automatically become slack”; saved hours usually get reallocated to more of the same (more volume), not to something different. “What gets measured gets managed; what cannot be measured gets cut” — slack, being unmeasurable, is cut first. This is why “owning the value of a research direction” must be an owned, deliberately protected governance act and cannot be expected to survive on its own.

省下的产能不会自动变成 slack。技术省下的工时，默认被重新分配去做更多同样的事，而不是变成可自由探索的 slack，因为 slack 不可度量，"多产 X% 论文"可度量，不可度量的先被砍。所以守值（守住"哪个方向值得"）与守散木（为不可度量的探索留预算）是同一个治理动作的两面：都得被显式预留、有人为长期回报负责，不能指望它作为省时的副产品自然涌现。

Freed capacity does not automatically become slack. Hours that technology frees get reallocated by default to more of the same, not turned into slack for free exploration — because slack is unmeasurable while “X% more papers” is measurable, and the unmeasurable gets cut first. So owning value (holding “which direction is worth it”) and protecting the useless tree (a budget for unmeasurable exploration) are two faces of one governance act: both must be explicitly reserved and owned by someone accountable for the long-term return, not expected to emerge as a by-product of time-saving.

但散木可被守护，这是条件论，不是宿命。"用 token 替换人＝exploitation；用 token 增强人＝exploration"——同一项 AI 能力，落进不同的激励结构，结局相反。守护冗余探索的具体动作是治理性的：设独立的探索单元、按"学习/新颖"而非"产量"考核、显式给"看上去无用"的方向留预算（Bell Labs、Xerox PARC、早期剑桥 LMB 的历史证据都指向"小团队 + 体制保护"）。把这条接回内核④：人回归意义，在研究面的具体落点，就是把研究者从"多产论文"的 exploitation 跑步机上解放出来，还给那些不可度量、却可能换地图的问题。更便宜的执行从来不是目的本身；真正要的，是让人去问那个值得问的问题。

价值判断没有归属，就被默认偏置悄悄替换。"谁有权定方向"之所以是治理问题而非技术问题，关键在一个容易被忽视的动力学：价值判断不会停留在真空里，它要么有归属人、被有意识地行使，要么被生成层的默认偏置悄悄填补。没有中间态。

当一个研究组织不明确"谁来判这个方向值不值得追"，这个判断不会消失，它会被默认地交给"哪个方向有数据、有 benchmark、能稳出结果"，也就是 RES 02/03/08 反复指认的那条向已知收敛的保守偏置。于是组织以为自己在"中立地跟随数据"，实际上是在无人负责的情况下，让生成层的结构性偏置替它做了方向选择。治理的全部意义，就是把这个判断从默认偏置手里夺回来，交给一个具名的、要为长期后果负责的人。这也是为什么第 8 节必须把研究方向的价值责任明确交给组织的治理结构——不是为了多设一个审批岗，是为了堵住"价值判断被偏置接管"这个无声的漏洞。

A value judgment with no owner gets quietly replaced by the default bias. “Who owns the direction” is a governance question, not a technical one, because of a dynamic easily overlooked: a value judgment does not stay in a vacuum: either it has an owner exercising it consciously, or the generation layer’s default bias quietly fills it. There is no middle state.

When a research organization leaves “who judges whether this direction is worth chasing” unspecified, the judgment does not vanish. It is handed by default to “which direction has data, a benchmark, a steady yield,” exactly the conservative bias toward the known that RES 02/03/08 keep naming. So the organization believes it is “neutrally following the data” while in fact, with no one accountable, the generation layer’s structural bias has made the direction choice for it. The whole point of governance is to take that judgment back from the default bias and give it to a named person accountable for the long-term consequences. This is why Section 8 must explicitly hand research-direction value accountability to the organization’s governance structure — not to add an approval seat, but to plug the silent leak of “value judgment captured by the bias.”

But the useless tree can be protected — this is conditional, not fated. “Replacing humans with tokens = exploitation; augmenting humans with tokens = exploration”: the same AI capability, dropped into different incentive structures, ends opposite ways. The concrete acts of protecting redundant exploration are governance acts: stand up independent exploration units, appraise on “learning / novelty” rather than “output,” explicitly budget for “seemingly useless” directions (the historical evidence of Bell Labs, Xerox PARC, the early Cambridge LMB all points to “small teams + institutional protection”). Wire this back to kernel ④: the human’s return to meaning, on the research face, lands concretely as freeing researchers from the exploitation treadmill of “more papers” and returning them to the unmeasurable questions that might redraw the map. Cheaper execution is never the point in itself; what we want is to let people ask the question worth asking.

FIG. 8.0 / 协作者与裁判的边界：AI 当协作者（生成那侧），人当裁判（担保那侧）THE COLLABORATOR–JUDGE BOUNDARY: AI AS COLLABORATOR (THE GENERATING SIDE), HUMAN AS JUDGE (THE VOUCHING SIDE)看懂：一条竖线把研究的动作分成两侧。左侧是 AI 当协作者——生成、执行、检索、起草，全在"对不对/有没有"的可机检面，向充裕一路坍缩。右侧是人当裁判——担保可信、选值得追的问题、为方向负责、判哪个真相值得知道，全在不可机检面。这条线不是工具分工，是责任归属：跨过它的，人才签字。Read: one vertical line splits research’s acts into two sides. Left is AI as collaborator: generate, execute, retrieve, draft, all on the machine-checkable face of “is-it-correct / does-it-exist,” collapsing toward abundance. Right is human as judge: vouch credibility, select the worth-chasing problem, own the direction, decide which truth is worth knowing, all on the un-checkable face. This line is not a division of tools but of accountability: only what crosses it does a human sign.

整卷的角色分工可以收进一条竖线。左侧，AI 是协作者：它生成、执行、检索、起草，全部落在"对不对、有没有"的可机检面，这一面向充裕一路坍缩。右侧，人是裁判：担保可信、选值得追的问题、判哪个真相值得知道、为方向具名负责，全部落在不可机检面。这条线是责任归属的分界，而非"谁干哪些活"的工具分工：跨过它的判断，必须有一个具名的人签字——否则它不会消失，只会被生成层的默认偏置无声接管。把这条线守住，研究环就还在自我纠偏；守不住，它就退化成一台高速空转的生成器。The whole volume’s role-split collapses into one vertical line. On the left, AI is the collaborator: it generates, executes, retrieves, drafts: all on the machine-checkable face of “is-it-correct, does-it-exist,” the face that collapses toward abundance. On the right, the human is the judge: vouching credibility, selecting the worth-chasing problem, deciding which truth is worth knowing, owning the direction by name: all on the un-checkable face. This line is not a “who does which chores” division of tools but a division of accountability: a judgment that crosses it must be signed by a named person; otherwise it does not vanish, it is quietly taken over by the generation layer’s default bias. Hold the line and the research loop still self-corrects; lose it and the loop degrades into a fast generator spinning free.

RES

FAILURE · 超常规科学

HYPERNORMAL SCIENCE

失败模式 · 执行充裕的暗面

Failure mode · The dark side of abundance

最危险的误用：加速把科学推得更窄，不是更深

The most dangerous way to go wrong: acceleration pushes science narrower, not deeper

暗面在判断交还给人之前就发生：摆到人面前的候选集，已被悄悄过滤。

One dark side strikes before judgment is handed back: the candidate set placed before the human is already quietly filtered.

一句话In one line

最危险的误用不在 AI 变弱，在它太擅长框架内的活：默认朝"安全、数据丰富"加速，指标全绿而探索在收窄。The most dangerous misuse is not AI getting weaker but how good it already is inside the paradigm: it accelerates by default toward “safe, data-rich,” every metric green while exploration contracts.

预测准 ≠ 理解对：太阳系模型没长出"引力"

Accurate prediction ≠ correct understanding: the solar-system model never grew “gravity”

hypernormal science 最容易被低估，是因为它常常伪装成"成功"。一个能精确预测的模型看起来就是好科学，但"预测准"和"理解对"是两回事，而 AI 的目标函数只奖励前者。最干净的例子：在 1000 万个模拟太阳系上训练的基础模型，能把行星轨道预测到极高精度，却从未在内部表示里长出"引力"这个概念：它学到的是一堆能复现观测的统计规律的拼凑，不是那条让所有轨道统一起来的力学定律。从"预测准"这个指标看，它完美；从"理解对"看，它什么都没理解。

这就是 hypernormal 的伪装：当评价标准只剩"在现有 benchmark 上更高分"，一个把框架内预测做到极致、却完全没触及描述层级的系统，会被一路绿灯放行，甚至被当成换框架的突破来庆祝。守住这道分别——预测力的提升不等于理解的加深——是不被 hypernormal science 骗过去的第一道认知防线。它也解释了为什么本卷坚持"产出量/预测精度本身不是指标"：真正该问的是覆盖广度有没有扩、描述层级有没有被挑战。

Hypernormal science is most easily underestimated because it often disguises itself as “success.” A model that predicts accurately looks like good science — but “predicting accurately” and “understanding correctly” are two things, and AI’s objective rewards only the former. The cleanest example: a foundation model trained on 10 million simulated solar systems predicts planetary orbits to very high precision, yet never grows the concept of “gravity” in its internal representation. What it learned is a patchwork of statistical regularities that reproduce the observations, not the one law of mechanics that unifies all orbits. By the metric “predicts accurately,” it is perfect; by “understands correctly,” it understood nothing.

This is hypernormal’s disguise: when the only evaluation criterion left is “a higher score on the existing benchmark,” a system that perfects in-paradigm prediction while never touching the level of description gets waved through, even celebrated as a paradigm-level breakthrough. Holding this distinction — a gain in predictive power is not a deepening of understanding — is the first cognitive defense against being fooled by hypernormal science. It also explains why the volume insists “output volume / prediction accuracy itself is not the metric”: what should be asked is whether topical breadth widened and whether the level of description was challenged.

结构成因 · 为何会这样：机器学习靠"对预先定义好的变量/标签最小化预测误差"，它擅长预测当前数据，却被锁进所学数据的概念词汇（这正是 RES 00 地图隐喻的要害：填满地图 ≠ 重画地图）。AI 能把地图上的空白填满（DeepMind GNoME 发现 220 万新材料，绝大多数是已知结构类型内的元素替换；ESM3 设计新荧光蛋白＝填空，不是画新地图），但它不会去问"现在这套描述层级是不是错的"。

William Farr 的霍乱地图把数据围绕"空气质量"组织，再聪明的 AI 也推不出"水传播微生物"这个没人记录过的变量：germ theory 要靠换显微镜、换仪器、换变量。

Structural cause · why this happens: machine learning works by “minimizing prediction error against pre-defined variables/labels” — good at predicting current data, but locked into the conceptual vocabulary of what it was trained on (the point of RES 00’s map metaphor: filling the map ≠ redrawing it). AI can fill the blanks (DeepMind’s GNoME found 2.2 million new materials, the vast majority element-substitutions inside known structure types; ESM3 designed novel fluorescent proteins: filling gaps, not drawing a new map), but it will not ask “is this whole level of description wrong.”

Farr’s cholera map organized data around “air quality,” and no AI, however clever, could infer “waterborne microbes,” a variable no one had recorded: germ theory required changing the microscope, the instrument, the variable.

硬锚 · 文献计量 · 等级 Ⅱ（观测性，慎言因果）Hard anchor · bibliometrics · grade Ⅱ (observational, causal claims with care)

Hao, Xu, Li & Evans,《AI tools expand scientists' impact but contract science's focus》, Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y。对约 4129.8 万篇论文的分析印证了这条暗面的"已发生"形态：用 AI 的科学家个人发表 3.02×、被引 4.84×、当项目负责人早 1.37 年，但科学整体主题覆盖收缩 4.63%、学者间互动下降 22%、引用集中度上升（Gini 0.754 vs 0.690），知识广度在六大学科 70% 以上子领域一致收缩。机理＝AI 向数据丰富区聚集、自动化既有领域而非探索新领域。〔标选择效应〕它是观测性文献计量（用 LLM 分类器 F1=0.875 识别"AI 增强"论文，用 AI 者本就可能集中于热门域，相关非因果），但作为"加速 ≠ 进步"的先行信号已足够硬。

Hao, Xu, Li & Evans, “AI tools expand scientists’ impact but contract science’s focus,” Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y. An analysis of about 41.298 million papers gives this dark side its “already happened” form: AI-using scientists publish 3.02×, are cited 4.84×, and lead projects 1.37 years earlier. Yet science as a whole shows topical coverage contracting 4.63%, scholar-to-scholar interaction down 22%, and rising citation concentration (Gini 0.754 vs 0.690), with knowledge breadth contracting consistently across more than 70% of subfields in six disciplines. The mechanism = AI clusters toward data-rich regions, automating existing fields rather than exploring new ones. [flag selection effect] It is observational bibliometrics (an LLM classifier at F1=0.875 labels “AI-augmented” papers; AI users may already concentrate in hot fields, correlation, not cause), but as a leading signal for “acceleration ≠ progress” it is hard enough.

暗面发生在判断交还给人之前。前面七张的逻辑是"执行被充裕 → 判断退守给人"，听起来像一条干净的接力：机器跑完执行，人接过判断。但 hypernormal science 揭示了一个前置故障：暗面发生在判断被交还之前。原因在于，生成层不是中立地铺开所有候选再等人来选，它在生成的那一刻就已经带着偏置：它优先生成"安全、数据丰富、与现行框架一致"的候选，把"换框架、换变量"的候选压到尾部甚至根本不生成。于是当人来接判断时，摆在他面前的候选集本身已经被悄悄收窄了。人以为自己在"从所有可能里选最值得的"，实际上在"从一个已被保守偏置过滤过的子集里选"。

这就是为什么不能把研究环简单理解为"生成中立、判断承重"——生成本身就携带价值倾向，而这个倾向恰恰朝着框架内。守值的动作因此必须前移：不只在判断时守，还要在生成时主动要求候选集包含"换框架"的选项，否则人守的是一个已经被做了手脚的菜单。

The dark side happens before judgment is handed back to humans. The logic of the first seven sections is “execution made abundant → judgment retreats to humans,” which sounds like a clean relay: the machine finishes execution, the human takes over judgment. But hypernormal science reveals a pre-failure: the dark side happens before judgment is handed back. The reason: the generation layer does not neutrally spread all candidates and wait for a human to pick; it already carries bias at the moment of generation, and it preferentially produces “safe, data-rich, paradigm-consistent” candidates and pushes “change-the-frame, change-the-variable” candidates to the tail or never generates them. So when the human arrives to judge, the candidate set in front of them has already been quietly narrowed. The human believes they are “selecting the most worthy from all possibilities” while actually “selecting from a subset already filtered by the conservative bias.”

This is why the research loop cannot be read simply as “neutral generation, load-bearing judgment”: generation itself carries a value tilt, and that tilt runs toward the in-paradigm. The act of owning value must therefore move earlier: not only guarding at judgment time but, at generation time, actively demanding that the candidate set include paradigm-level options; otherwise the human guards a menu that has already been rigged.

同质化的机制：写在权重里，推理时救不回

The mechanism of homogenization: written into the weights, unrecoverable at inference

hypernormal science 不只是"AI 偏保守"的软倾向，它有一条写进权重的硬机制。最强的因果锚是 Doshi & Hauser（Science Advances 2024）[R12]：给写作者 LLM 点子，个体故事更"有创意"，但故事彼此更相似：他们明确称之为"社会困境"（个人更好、集体更窄）。同质化是群体层效应（Anderson 等 2024，36 人实验）[R13]：它不来自个体固着，而来自 LLM 向不同用户建议相似点子。更狠的是跨模型同质（"We're Different, We're the Same" 2025）[R14]：控制结构变量后，LLM 之间的相似度远高于人与人之间：换个模型也救不了。机制层的决定性证据：post-training 的 diversity collapse 写在权重里、推理时无法挽回（arXiv:2604.16027，Olmo 3 三条 lineage）；recursive 训练合成数据会导致 model collapse、分布尾部消失（Shumailov 等，Nature 2024）。这条机制把"人类真实交互数据"变成愈发珍贵的资源，它直接为内核④"人是异质性来源"背书。

Hypernormal science is not just a soft “AI leans conservative” tendency; it has a hard mechanism written into the weights. The strongest causal anchor is Doshi & Hauser (Science Advances 2024) [R12]: give a writer LLM ideas and individual stories get more “creative,” yet the stories grow more similar to each other: they explicitly call it a “social dilemma” (better individually, narrower collectively). Homogenization is a group-level effect (Anderson et al. 2024, a 36-person experiment) [R13]: it comes not from individual fixation but from the LLM suggesting similar ideas to different users. Harsher still is cross-model homogeneity (“We’re Different, We’re the Same,” 2025) [R14]: controlling for structural variables, LLMs resemble each other far more than humans resemble each other: switching models does not save you. The decisive mechanism-level evidence: post-training diversity collapse is written into the weights and unrecoverable at inference (arXiv:2604.16027, three Olmo 3 lineages); recursively training on synthetic data causes model collapse with the distribution tails vanishing (Shumailov et al., Nature 2024). This mechanism makes “real human interaction data” an ever more precious resource: directly endorsing kernel ④’s “humans as the source of heterogeneity.”

真实交互数据，因此成了愈发珍贵的反同质化资源。这条机制有一个常被忽略的推论：既然 diversity collapse 写在权重里、且 recursive 训练合成数据会让分布尾部消失（model collapse），那么未被 AI 中介过的、真实的人类交互数据就成了愈发稀缺、愈发珍贵的资源，它是分布尾部、是异质性的最后蓄水池。这对研究组织有直接的操作含义：当所有人都在用同几个模型生成假设、写综述、做评审时，那些仍由人独立产生、未被模型均值拉平的判断与观察，恰恰是组织最该刻意保存、而非急于"用 AI 提效"掉的东西。它也回连内核④"人是异质性来源"：人之所以不可替代，不是因为人比 AI 聪明，而是因为人群携带着 AI 分布尾部已经丢失的多样性。守住这份多样性，需要刻意施力，在流程里留出"不经 AI 中介"的判断节点，在数据上珍惜真实人类信号，在激励上奖励偏离均值的探索。这三条合起来，就是抵抗 hypernormal 的组织级动作。

Real interaction data thus becomes an ever more precious anti-homogenization resource. This mechanism has an often-overlooked corollary: since diversity collapse is written into the weights, and recursively training on synthetic data makes the distribution tails vanish (model collapse), then real human interaction data not mediated by AI becomes an ever scarcer, ever more precious resource. It is the distribution tail, the last reservoir of heterogeneity. This has a direct operational implication for research organizations: when everyone generates hypotheses, writes reviews, and reviews with the same few models, the judgments and observations still produced independently by humans, not flattened to the model mean, are exactly what the organization should deliberately preserve rather than rush to “make efficient with AI.” It also wires back to kernel ④’s “humans as the source of heterogeneity”: humans are irreplaceable not because they are smarter than AI but because the human population carries the diversity AI’s distribution tail has already lost. Holding this diversity takes deliberate force: keeping “non-AI-mediated” judgment nodes in the workflow, treasuring genuine human signal in the data, rewarding off-mean exploration in incentives. Together, these three are the organization-level acts that resist hypernormal.

但它是默认引力，不是铁律。诚实地把反向证据摆上：同质化依任务/prompt/暴露方式而变，在高暴露的动态实验里集体多样性反而能升。所以正确的命题表述不是"AI 必然让科学同质"，而是"AI 默认把研究拉向均值，须刻意施力才能偏离"（regression to a domain prototype）。开放式/QD 算法（novelty-search、MAP-Elites、POET）证明：只要放弃单一目标函数，机器也能产异质。这把命题从"异质性只能来自人"收紧成更稳的版本：异质性的敌人是单一目标的过度优化，不是机器本身：人定义"什么值得不同"，机器在那个定义下产生多样。这条限定既守住人的角色，又抗住"AI 终将学会创意"这个证伪。

But it is a default gravity, not an iron law. Put the counter-evidence on the table honestly: homogenization varies by task / prompt / exposure mode, and in high-exposure dynamic experiments collective diversity can even rise. So the correct statement is not “AI inevitably homogenizes science” but “AI by default pulls research toward the mean and needs deliberate force to depart” (regression to a domain prototype). Open-ended / QD algorithms (novelty-search, MAP-Elites, POET) prove that machines too can produce heterogeneity: as long as the single objective function is abandoned. This tightens the thesis from “heterogeneity can only come from humans” into a sturdier version: the enemy of heterogeneity is the over-optimization of a single objective, not the machine itself: humans define “what is worth differing on,” the machine generates diversity under that definition. This qualifier both holds the human’s role and withstands the falsifier “AI will eventually learn creativity.”

反指标 · 怎么知道你正在滑进 hypernormalCounter-indicators · how to tell you are sliding into hypernormal

反指标：主题覆盖在缩而产量在涨；"换框架"贡献长期低位；引用向少数热点集中（Gini 升）——一起出现即科学在变窄。Counter-indicators: topical breadth shrinks while output rises; “reframe” contributions stay durably low; citations concentrate on a few hot nodes (rising Gini); when these co-occur, science is narrowing.

RES

JUDGMENT · 可信度天平

THE BELIEVABILITY LEDGER

决策矩阵 · 逐条判可信

Decision matrix · claim-by-claim

当生成无限，每条主张都要先过一道可信度天平

When generation is unbounded, every claim first crosses a believability ledger

AI 生成一堆主张，逐篇读不完，怎么分诊？

AI generates a pile of claims you can’t read one by one: how do you triage?

一句话In one line

AI 生成的主张别逐篇读，按两条轴分诊：证据强度（弱就补）、框架距离（远就让人判）。合成"可信分"会误杀范式级重构。Don’t read AI-generated claims one by one; triage on two axes: evidence strength (weak → supplement) and paradigm distance (far → a human judges). A single “credibility score” miscodes paradigm-level reframing.

先说清为什么需要一道天平。心理学那批写进教科书的经典结果，大规模复现时只有约三成能重来一遍[R1]——那还是人一篇篇亲手做出来的。截至本版（2026-07），AI 一天能生成上千条同样”看着可信”的主张：你连人写的都核不过来，凭什么信机器批量产的？所以每条主张进门前都得先分诊，而分诊的第一刀，是别把两件性质不同的事搅成一个总分。

First, why a ledger is needed at all. Of the textbook-classic results in psychology, only about a third replicated when tested at scale[R1] — and those were made by humans, one paper at a time. As of this edition (2026-07), AI can generate a thousand equally “credible-looking” claims a day: if you cannot even keep up with what humans wrote, why trust what a machine mass-produces? So every claim must be triaged on the way in, and the first cut of triage is not to stir two things of different nature into one aggregate score.

证据强度可补，框架距离要判——两条轴管两种动作

Evidence strength can be supplemented, paradigm distance must be judged: two axes for two acts

两条轴之所以不能合并，深层原因是它们对应两种性质完全不同的动作。"证据强度"这条轴是可补的、可机检的：一条主张证据弱，处置很清楚——去补证据（多跑复现、找原始数据、查证据链是否完整），这是个有标准答案、可外包给生成与图谱规则的动作。"框架距离"这条轴则要人来判，且没有可补一说：一条主张离现行框架远，这件事本身不是缺陷、也不是优点，它只是一个需要人来定夺的信号——人要判它是框架外的噪声，还是一次重构。

把两条轴合成一个"可信分"，等于把"可补的执行动作"和"不可外包的判断动作"搅成一锅，结果是两种动作都做不好：该补证据的没去补（因为分数已经替它下了结论），该人判的被自动判了（因为分数把"离框架远"直接折算成低可信）。分开记的全部意义，就是让每条轴触发它对应的那种正确动作——证据弱→补证据（左动作），离框架远→人来判（右动作）。这正好是 RES 10 那条"能写可机检验收标准的归左、写不出的归右"判据在单条主张层面的应用。

The deep reason the two axes cannot be merged is that they correspond to two acts of completely different nature. The “evidence strength” axis is supplementable and machine-checkable: if a claim’s evidence is weak, the disposition is clear: go get more evidence (run replications, find raw data, check the evidence chain’s completeness), an act with a right answer, outsourceable to generation and graph rules. The “paradigm distance” axis must be judged by a human, with no “supplementing”: a claim being far from the established paradigm is neither a defect nor a merit in itself, only a signal requiring constitutive judgment — a human must judge whether it is out-of-paradigm noise or a paradigm-level reframing.

Merging the two into one “credibility score” stirs “a supplementable execution act” and “an un-outsourceable judgment act” into one pot, and then both are done badly: the evidence that should be sought is not sought (the score already concluded for it), and what a human should judge is auto-judged (the score converts paradigm-distance straight into low credibility). The whole point of booking them separately is to let each axis trigger its corresponding correct act: weak evidence → seek evidence (the left act), far from paradigm → a human judges (the right act). This is exactly RES 10’s “machine-checkable criterion goes left, otherwise right” test applied at the single-claim level.

为什么"证据弱×框架远"这一格决定整台仪器的价值

Why the “weak × far” cell decides the whole instrument’s value

四象限里有三格是直觉的：证据强×框架内→入库；证据弱×框架内→当噪声；证据强×框架远→人重点看。真正考验一个研究系统成色的，是第四格：证据弱×框架远。直觉和单一可信分都会把它判死："远离现行框架且证据不足，删除。"但科学史上几乎每一次范式转移，诞生时都正落在这一格：爱因斯坦 1905 年的狭义相对论与洛伦兹的以太收缩，起初都只是拟合同一批数据，爱因斯坦的版本既"离框架远"又在当时缺乏决定性实验证据。达尔文的自然选择，核心机制（泛生论 gemmules）后来被证明是错的，但想法本身因为有用而存活。如果当年有一台只算可信分的系统，它会把这些都判为"证据弱×框架远→删"。

所以这一格的正确处置是第三种动作，不是二元的"信/不信"：挂起，并定向去找那个能区分"它是噪声"还是"它是重构"的关键证据。一台仪器值不值得用，全看它对这一格的处置，把它做成"删"，它就是 hypernormal science 的自动扼杀器；把它做成"挂起+定向取证"，它才是真正能接住范式转移的天平。

Three of the four quadrants are intuitive: strong × in-paradigm → integrate; weak × in-paradigm → noise; strong × far → human focuses. What truly tests a research system’s mettle is the fourth: weak evidence × far from paradigm. Intuition and a single credibility score both sentence it to death: “absurd and unsupported, delete.” Yet nearly every paradigm shift in the history of science sat exactly in this cell at birth: Einstein’s 1905 special relativity and Lorentz’s ether contraction both merely fit the same data at first, and Einstein’s version was both “far from paradigm” and lacking decisive experimental evidence at the time. Darwin’s natural selection: its core mechanism (pangenesis, gemmules) later proved wrong, yet the idea survived because it was useful. Had a credibility-score-only system existed then, it would have judged all of these “weak × far → delete.”

So the correct disposition for this cell is not the binary “believe / disbelieve” but a third act: suspend, and go find the decisive evidence that separates “it is noise” from “it is a reframing.” Whether an instrument is worth using turns entirely on its handling of this cell — make it “delete” and it is hypernormal science’s automatic strangler; make it “suspend + targeted evidence-seeking” and it is a ledger that can actually catch a paradigm shift.

INSTRUMENT 10 · 可信度天平 BELIEVABILITY LEDGER

先拨"生成速度"看整合赤字如何扩大；再把一条主张放进双轴，得到处置判词，把张力二（peer review 改变）+ 张力三（整合鸿沟）做成一个可拨动的张力台。

First drag “generation rate” to watch the integration deficit explode; then drop a claim onto two axes for a disposition verdict — tension two (peer review changes) plus tension three (the integration gap) made into one adjustable bench.

生成速度（相对人类整合带宽）Generation rate (relative to human integration bandwidth) · 10×

生成GEN

可整合DIGEST

X · 证据强度？Evidence strength?

Y · 离现行框架多远？Distance from paradigm?

人判 · 可能是重构Human · maybe a reframing

证据强 × 框架远strong × far

别急着杀 · 先补证据Do not kill yet · seek evidence

证据弱 × 框架远weak × far

可信 · 入库整合Believe · integrate

证据强 × 框架内strong × near

存疑 · 框架内噪声Doubt · in-paradigm noise

证据弱 × 框架内weak × near

关键反陷阱The key anti-trap

"证据弱 × 框架远"格最该警惕：可信分会判它死，但重构诞生时证据必薄。正确处置是挂起取证，不是删。The cell to watch most is “weak evidence × far from paradigm”: a credibility score sentences it to death, yet a paradigm-level reframing is thin on evidence at birth. The disposition is suspend and seek evidence, not delete.

天平不是给主张打分，是给"该投多少人类带宽"排序。天平最容易被误用成"给每条主张算一个可信分、按分排序"。这恰恰是它要避免的。它的产物不是分数，是处置（disposition）：对每条主张，回答"接下来该怎么处理它"，而不是"它有多可信"。四象限给的就是四种处置，而不是四档分数：证据强×框架内→直接入库整合（人不必看）；证据弱×框架内→当噪声存疑（人不必看）；证据强×框架远→人来判它是不是重构（值得看）；证据弱×框架远→挂起、定向去找区分证据（最值得看）。这套设计的目的，是把人的稀缺带宽从"逐篇精读"里解放出来，只投到真正吃紧的两格。当生成每小时上千条，逐篇读是物理上不可能的；天平的价值正在于它替你做了"哪些根本不必人看"的分流。

The ledger does not score claims; it ranks how much human bandwidth to spend. The ledger is most easily misused as “compute a credibility score per claim and sort by score.” That is exactly what it avoids. Its real product is not a score but a disposition: for each claim, answering “what to do with it next,” not “how credible it is.” The four quadrants give four dispositions, not four score tiers: strong × in-paradigm → integrate directly (no human needed); weak × in-paradigm → hold as noise (no human needed); strong × far-from-paradigm → a human judges whether it is a reframing (worth looking); weak × far-from-paradigm → suspend and go find discriminating evidence (most worth looking). The design’s purpose is to free the human’s scarce bandwidth from “reading each one closely” and spend it only on the two genuinely tight cells. When generation runs at thousands an hour, reading each is physically impossible; the ledger’s value is precisely that it does the “which ones need no human at all” triage for you.

AI 当协作者，还是当裁判？一道必须先划的界

AI as collaborator, or as judge? a line you must draw first

天平回答"这条主张可不可信"，但它背后压着一个更根本的问题：这一道判断，到底该不该让 AI 来当裁判？把 AI 当协作者（铺候选、查文献、跑实验、提反例）几乎总是安全的，它在执行端，输出还要过人的判断。把 AI 当裁判（让它定"哪个值得信、哪个值得发、哪个该资助"）则是另一回事，因为裁判位是价值与可信度的归属位，一旦交出去，RES 03 的结构性偏置会从"建议"升级成"判决"。这道界不能拍脑袋划，要看三件事：判据能不能机检（能 → AI 可裁）、判断是否价值负载（是 → 留人）、判错的代价可不可逆（不可逆 → 留人）。一个干净的判据：当这道判断的"对"只能诉诸"对谁、在哪个价值框架下"，AI 只能当协作者，不能当裁判。

The ledger answers “is this claim credible,” but underneath it presses a more fundamental question: should AI be the judge of this decision at all? Using AI as a collaborator (spreading candidates, searching literature, running experiments, raising counter-examples) is almost always safe: it sits on the execution side and its output still passes through human judgment. Using AI as a judge (letting it set “which to believe, which to publish, which to fund”) is another matter, because the judge’s seat is the seat of value and credibility ownership, and once handed over, RES 03’s structural bias is promoted from “suggestion” to “verdict.” This line cannot be drawn by gut; it depends on three things: whether the criterion is machine-checkable (yes → AI may judge), whether the judgment is value-laden (yes → keep human), whether the cost of a wrong call is reversible (irreversible → keep human). A clean test: when the “right” of this judgment can only appeal to “for whom, under which value frame,” AI can only be a collaborator, never a judge.

把界划错的两种症状：一种是把裁判位悄悄让出去：团队用"AI 评分高"代替"我读过、我担保"，于是没有人真正为某条主张的可信度负责，偏置在无人察觉中累积（这正是 RES 07"价值判断没有归属就被默认偏置替换"的运行态）。另一种是把协作位也死死攥住：出于不信任，连"铺候选、查文献"这种纯执行都不敢交给 AI，于是没把执行充裕化，团队还困在工时瓶颈里。两种都丢杠杆：前者交出了不该交的判断，后者守住了不必守的执行。正确的姿势是把这两个位子分开：执行位尽量交，裁判位审慎留；审慎与否，由上面那三条（可机检 / 价值负载 / 代价可逆）逐案判定。

Two symptoms of drawing the line wrong: one is quietly vacating the judge’s seat. The team substitutes “AI scored it high” for “I read it, I vouch”. So no one truly owns any claim’s credibility and bias accrues unnoticed (exactly RES 07’s “a value judgment with no owner gets replaced by the default bias,” at runtime). The other is clutching even the collaborator’s seat: out of distrust, the team will not even hand “spreading candidates, searching literature” (pure execution) to AI, so it never makes execution abundant and stays stuck at the hours bottleneck. Both forfeit leverage: the first hands away judgment that should not be handed away, the second holds execution that need not be held. The right posture is to separate the two seats: hand the execution seat freely, keep the judge’s seat with care, and the care is adjudicated case by case by the three above (machine-checkable / value-laden / cost reversible).

FIG. 9.0 / 可信度天平：一条主张如何沿证据级被晋升或卡住THE BELIEVABILITY LEDGER: HOW ONE CLAIM IS PROMOTED, OR STALLED, ALONG THE EVIDENCE GRADES看懂：横轴是证据级 Ⅴ→Ⅰ；主张从"假设"出发，每过一道闸晋一级；唯一让它"承重"的是 Ⅱ→Ⅰ 那道独立复现闸——过不了就停在"已发表未复现"，不许当地基。Read: the x-axis is evidence grade Ⅴ→Ⅰ; a claim starts as a hypothesis and gains a grade at each gate; the one thing that makes it “load-bearing” is the Ⅱ→Ⅰ independent-replication gate; fail it and it parks at “published, unreplicated,” not to be built on.

这张图把"逐条判可信"展开成时间维度：一条主张沿证据级被逐道闸晋升，而非一锤定音地"可信/不可信"。AI 让左半段（Ⅴ–Ⅳ，假设与单点报告）的产出近乎免费，于是瓶颈整体右移到唯一那道承重闸：Ⅱ→Ⅰ 的独立复现。过不了这道闸的主张停在"已发表未复现"，而非被删，可以被引用、被讨论，但不许被当成地基往上盖。这正是 FIG 0.1 那道复现闸在单条主张尺度上的展开，也是 INSTRUMENT 10 双轴矩阵在"证据强度"那条轴上的纵深。This figure unrolls “judging credibility claim-by-claim” along a time axis: a claim is not credible-or-not in one stroke; it is promoted through gates along the evidence grades. AI makes the left segment (Ⅴ–Ⅳ, hypotheses and single field reports) near-free to produce, so the real bottleneck shifts wholesale to the one load-bearing gate: the Ⅱ→Ⅰ independent replication. A claim that fails this gate is not deleted but parked at “published, unreplicated”: citable, discussable, but not to be built upon as a foundation. This is FIG 0.1’s replication gate unrolled at single-claim scale, and the depth of INSTRUMENT 10’s two-axis matrix along its “evidence strength” axis.

RES

MATRIX · 范式内 / 范式级分诊

IN-PARADIGM / PARADIGM-LEVEL

决策矩阵 · 哪步交 AI / 哪步留人

Decision matrix · to AI / to human

同一个研究动作，范式内交给生成、范式级留给人

In-paradigm hands to generation, paradigm-level stays with the human: the same action, split

把一个研究动作拆开，哪半该丢给生成、哪半必须自己接？这一章给一条判据。

Split a research action in half: which half do you hand to generation? This chapter gives one test.

一句话In one line

判据只有一条：写得出机器能核验的验收标准，就丢给生成；写不出来、只能说"这对谁成立、在哪套价值框架下成立"，就留给人。One test, and only one: if you can write an acceptance criterion a machine can check, hand it to generation; if all you can say is “for whom, under which value frame,” keep it with a human.

一条判据，同时回答三章的问题

One test answers three chapters’ questions at once

这张矩阵不只是一张分诊表，它给了整卷一条能落地的判据：写得出机器可核验的验收标准，归左——交给生成，或交给知识图谱的规则；写不出来、只能诉诸"这对谁成立、在哪套价值框架下成立"，归右——留给人。这条判据的用处在于，它同时回答了三章各自在问的问题，让你不必分别记三套规则。RES 04 问护栏该把关到哪：能机检的，交给证据库自动把关。RES 09 问天平该挂起什么：写不出验收标准、又离范式远的那些，正是该挂起去找证据的一格。RES 13 问研究环该把人放在哪一步：人只接住右格的节点。三章看起来各管一摊——护栏、天平、工作流——其实共用同一条判据。记住它，对每个动作你只需要问一句："这写得出可机检的验收标准吗？"答案会同时告诉你，这个动作该进护栏、进天平，还是进人的判断里。

This matrix is more than a triage table: it hands the whole volume one workable test: if you can write a machine-checkable acceptance criterion, the action goes left, to generation or to the graph’s own rules; if you cannot, if all you can say is “for whom, under which value frame,” it goes right, to a human. The point of this test is that it answers three chapters’ separate questions at once, so you don’t need three rule sets in your head. RES 04 asks where the guardrail should gate: whatever is machine-checkable, the evidence base gates automatically. RES 09 asks what the ledger should suspend: exactly the claims with no acceptance criterion, far from the paradigm; suspend them and go find evidence. RES 13 asks where the research loop should place the human: only at the right-cell nodes. The three chapters look like they each manage something different — guardrail, ledger, workflow — but they share one test. Remember it, and for every action you need ask only one question: can I write a machine-checkable acceptance criterion for this? The answer tells you at once whether the action belongs in the guardrail, the ledger, or human judgment.

范式内 · 交给生成（可机检、向数据丰富区聚集）In-paradigm · hand to generation (machine-checkable, clusters to data-rich)

检索：在已存在的知识里找最近邻——向量搜索、RAG、引文追踪。
Search: nearest-neighbor over existing knowledge: vector search, RAG, citation tracing.
提假设：在既有框架内、向数据最厚处找下一个可检验空白（知识图谱上的填空）。
Hypothesize: inside an existing frame, find the next checkable gap where data is thickest (filling blanks on the graph).
设计实验：在标准框架里排组合、扫参数、跑消融——执行可大规模并行。
Design experiments: enumerate combinations, sweep parameters, run ablations within a standard paradigm: massively parallel execution.
分析：对预定义变量做统计、拟合、可视化、符号回归（AI Feynman 重发现已知方程）。
Analyze: statistics, fitting, visualization, symbolic regression over predefined variables (AI Feynman re-discovering known equations).

范式级 · 留给人（要人定、无最近邻可循）Paradigm-level · stays with the human (constitutive, no neighbor to follow)

换描述层级：问"现在这套变量是不是正确的描述层级"（Farr 空气质量 → 水传播微生物）。
Change the level of description: ask “is this set of variables even the right level” (Farr’s air quality → waterborne microbes).
跨感官类比：把想法接到具身直觉上（爱因斯坦 16 岁想象骑光、那"冻住的波"让他生理上觉得不对），人有跨模态的广度。
Cross-sensory analogy: wire an idea to embodied intuition (the 16-year-old Einstein imagining riding a light beam, the “frozen wave” that felt physically wrong): humans have cross-modal breadth.
判结论是否值得知：在稀疏、价值负载的域里定"哪个真相重要"，无对错、只有归属。
Judge whether a conclusion is worth knowing: in sparse, value-laden domains, set “which truth matters”: no right answer, only belonging.
判新颖是重构还是噪声：抵抗"离框架远 = 不可信"的结构性偏置（接 RES 09 天平的关键格）。
Judge whether novelty is reframing or noise: resist the “far from paradigm = not credible” structural bias (see RES 09’s key cell).

怎么接起来：生成那边——检索、提假设、设计实验、跑分析——产出的一切，都落进 RES 04 那个可追溯证据库，每条主张挂着证据边。人这边不用去看全部，只从库里挑那几类必须自己判的：天平挂起的、离范式远的、价值负载的。这不是"先让它生成、我再审一遍"那种流程，是让证据库把框架内的东西自动挡住，省下的带宽全部留给换框架的判断。

How the two sides connect: the generation side — search, hypothesize, design experiments, run analysis — drops everything it produces into RES 04’s traceable evidence base, each claim carrying its evidence edges. The human side doesn’t review all of it; it only pulls the few kinds of node that must be judged by a person: the ones the ledger suspended, the ones far from the paradigm, the value-laden ones. This is letting the evidence base auto-gate everything in-paradigm, so the freed bandwidth goes entirely to paradigm-level judgment.

有效 / 失效的信号Right vs wrong signals

对的信号：人均工时在框架内动作上往下走、在换框架判断上往上走；左格产物一次就落进可追溯链的比例往上走——这说明带宽是真从执行省给了判断，不是嘴上说说。Right signals: per-person hours fall on in-paradigm actions and rise on paradigm-level judgment; more left-cell output lands in the traceable chain on the first pass — bandwidth is actually moving from execution to judgment, not just on paper.

同一个词，左边右边说的是两件事

The same word means two different things on each side

这张矩阵最容易被读错成"有些动作归 AI，有些动作归人"。不是这样。真正承重的一点是：同一个动作词，落在左格和右格里，指的根本不是一回事。"提假设"在左格是"在已有框架里找下一个可检验的空白"——最近邻搜索，AI 干得好；在右格是"提一个旧框架里根本立不住的假设"——换描述层级，这在 AI 训练分布之外。"分析"在左格是"对预先定好的变量做拟合"——AI Feynman 重发现已知方程就是这个；在右格是"判断现在用的这套变量本身对不对"——Farr 把空气质量换成水传播微生物，就是这个。所以分诊切的不是"动作类型"，切的是"这一刀下去，落在能机检的一侧，还是只能人定的一侧"。一个动作词常常横跨两格，得先把它劈成两半，才知道哪半交给生成、哪半接住给人。

This matrix is most easily misread as “some actions go to AI, some stay with humans.” That’s not it. What’s actually load-bearing: the same action word means a different thing depending on which cell it lands in. “Hypothesize” in the left cell means finding the next checkable gap inside a frame you already have: nearest-neighbor search, AI is good at this. In the right cell it means posing a hypothesis the old frame couldn’t even hold: changing the level of description, which sits outside AI’s training distribution. “Analyze” in the left cell means fitting predefined variables: AI Feynman re-discovering known equations is exactly this. In the right cell it means judging whether the variables themselves are wrong: Farr swapping air quality for waterborne microbes is exactly this. So the cut triage makes isn’t by action type; it’s by where a given cut lands: the machine-checkable side, or the side only a human can call. A single action word usually straddles both cells; you have to split it before you know which half goes to generation and which half a human has to catch.

这张矩阵真正防的，是两个方向相反、却一样致命的错。第一种：把该换框架的动作硬塞进左格。典型说法是"让 AI 决定研究该往哪个方向走"——把"选方向"这个最该由人定的判断，当成可以生成的执行交出去了。后果是 RES 06 的价值判断、RES 12 的方向选择被默认偏置接管，组织在所有指标都绿的情况下越走越窄。第二种：把该交出去的框架内动作死死攥在右格。典型说法是人还在手动追引文、手动扫参数、手动做文献综述——不是因为不该自动化，是出于不信任或者习惯，守着早就能充裕化的执行不放。后果是执行没被充裕，团队照样卡在工时瓶颈上，②省下的时间也没能投回③④。这两种错刚好对称：前一种交出了不该交的东西，后一种守住了不必守的东西；前一种丢的是方向，后一种丢的是杠杆。矩阵的用处，就是给每个动作一条清楚的归格判据，让这两种错都无处可藏。

What this matrix really guards against is two opposite yet equally fatal mistakes. First: forcing a paradigm-level action into the left cell. The classic version sounds like “let AI decide which direction the research should go”: handing away “choosing direction,” the judgment that should sit most firmly with a human, as if it were generatable execution. The result: RES 06’s value judgment and RES 12’s direction-setting get captured by default bias, and the organization narrows while every dashboard reads green. Second: clutching an in-paradigm action in the right cell that should have been handed over. The classic version is people still tracing citations by hand, sweeping parameters by hand, writing literature summaries by hand — not because it shouldn’t be automated, but out of distrust or habit, holding on to execution that’s long been ready to become abundant. The result: execution never gets made abundant, the team stays stuck at the same hours bottleneck, and the time ② would have freed never gets reinvested into ③④. The two errors are exact mirrors: the first hands away what it shouldn’t, the second holds on to what it needn’t; the first costs you direction, the second costs you leverage. What the matrix is for is giving every action one clear test for where it belongs, so neither mistake has anywhere to hide.

RES

BOUNDARY · 护栏的反面

THE GUARDRAIL’S BLIND SPOT

边界 · 知识图谱也会锁死

Boundary · the graph can lock you in

同一道护栏，用错就把描述层级锁死

The same guardrail, misused, locks in the level of description

图谱越完整，越容易把你钉死在错的一层，这一章讲怎么留一道能出去的门。

The more complete the graph, the more likely it pins you to the wrong layer: this chapter is about leaving a door that still opens.

一句话In one line

护栏用错方向就变笼子：图谱只会组织"已经记下来的变量"，问不出 schema 之外的东西，而且它越自洽，你越看不出这一点。出路只有一条——人来换变量。Used the wrong way, a guardrail becomes a cage: the graph only organizes variables already recorded, so no query can reach outside the schema, and the more self-consistent it is, the harder that is to notice. The only way out is a human who changes the variable.

schema 越自洽，盲区越难被看见——这是它最危险的地方

The more self-consistent the schema, the harder its blind spot is to see: its most dangerous trait

护栏变笼子这件事有一个反直觉的地方：schema 做得越完整、越自洽，它的盲区反而越难被人发现。一个漏洞百出的图谱经常报错、经常有东西塞不进去，而"塞不进去"恰好是在提醒你："框架可能错了。"可一个设计精良、覆盖全面、查得又快的图谱，会让所有观测都顺滑地落进已有的节点——它从不报错，因为它把每一次观测都成功解释进了现有的变量空间。问题是，"成功解释进现有空间"跟"解释对了"，是两件事。

Farr 的霍乱图就是这样一个"完美自洽"的笼子：每一例死亡都被归因到空气质量的某个梯度，模型拟合得很好，预测也不算差，整套系统运转得毫无破绽——正因为如此，没人会去怀疑"空气质量"这个描述层级本身就错了。这也是为什么 RES 08 那条覆盖广度的反指标必须当成例行体检来跑：一个系统的每个局部指标都健康，却系统性地看不见某一类东西，唯一能发现盲区的办法，是主动去问"我们的图谱在结构上不可能表示什么"——而不是等它报错，因为它永远不会报错。

There’s something counter-intuitive about how a guardrail turns into a cage: the more complete and self-consistent the schema, the harder its blind spot is to spot. A crude, leaky graph errors constantly, has things that just won’t fit, and “won’t fit” is exactly the signal telling you the frame might be wrong. But a well-designed, fully covering, fast-querying graph lets every observation slide smoothly into an existing node. It never errors, because it has successfully explained every observation into the variable space it already has. The catch: “successfully explained into the existing space” and “explained correctly” are two different things.

Farr’s cholera map is exactly this kind of “perfectly self-consistent” cage: every death got attributed to some gradient of air quality, the model fit well, the predictions weren’t bad, the whole system ran without a single visible flaw, and precisely because of that, no one thought to question whether “air quality” was the wrong level of description to begin with. This is why RES 08’s breadth counter-indicator has to run as a routine checkup: when every local metric on a system reads healthy yet it’s systematically blind to a whole class of things, the only way to find the blind spot is to actively ask what the graph is structurally incapable of representing — not wait for it to error, because it never will.

机理 · 护栏为何会变成笼子：说穿了很直白：知识图谱要把"主张—证据—来源"结构化，前提是这些节点和边的类型已经被定义好了。schema 一旦定下来，生成就只能在这个既有的变量空间里排列组合——这正是它高效的原因，也正是它的盲区所在。Farr 的数据是围着 miasma（瘴气）组织的，schema 里压根没有"病原体"这个节点，于是再聪明的查询也问不出一个 schema 外的变量。AI 在这上面加速，只会把"空气质量"这个错误层级钻得更深、更自洽、更难被推翻——护栏把框架内的可信度，筑成了一堵连换框架都翻不过去的墙。

Mechanism · why a guardrail becomes a cage: the mechanism is plain: a knowledge graph structures claim–evidence–source on the premise that its node and edge types are already defined. Once the schema is set, generation can only recombine within that existing variable space: exactly why it’s efficient, and exactly where its blind spot sits. Farr’s data was organized around miasma; the schema had no “pathogen” node at all, so no query, however clever, could ask for a variable the schema didn’t have. AI accelerating on top of this only drills the wrong “air quality” level deeper, more self-consistent, harder to overturn. The guardrail builds in-paradigm credibility into a wall that even a reframing can’t climb.

留"换变量"的口子：schema 不能只读不写，给"提出新节点类型/新关系类型"留一条人发起的通道，并定期复审 schema 本身是否还是对的描述层级。
Leave a “change the variable” door: the schema must be writable, not read-only; keep a human-initiated channel to propose new node/edge types, and periodically review whether the schema is still the right level of description.
把"未被任何节点解释的异常"显形：不要让无法入图的观测被默默丢弃；恰是这些落在 schema 外的残差，往往是换地图的入口（霍乱在 schema 外的死亡聚集）。
Surface “anomalies no node explains”: do not let observations that fail to enter the graph be silently dropped; these out-of-schema residuals are often the doorway to a new map (the cholera death-clusters that the schema could not place).
给护栏配一个"反护栏"复盘：定期问"我们的图谱在系统性地看不见什么"，把 RES 08 的覆盖广度反指标接进来当例行体检。
Pair the guardrail with an “anti-guardrail” review: regularly ask “what is our graph systematically blind to”, wiring RES 08’s breadth counter-indicator in as a routine checkup.

证据锚 / 失败模式Evidence anchor / failure mode

Farr 的霍乱图围绕"空气质量"组织，完整可追溯却推不出水传播——只测框架内整洁，测不到图谱看不见的那层。Farr’s cholera map organized around “air quality,” complete and traceable, yet unable to yield waterborne transmission; it measures in-paradigm tidiness only, never the layer the graph cannot see.

残差才是入口：把"图谱解释不了的"专门留住。一个健康研究系统跟一个完美笼子之间的全部区别，就在于怎么对待残差——那些任何现有节点都解释不了、进不了图的观测。笼子的本能是把残差当噪声扫掉，因为它们弄乱了图的整洁，拉低了"主张落进可追溯链的比例"这个看着很正的指标。但科学史一次次证明：换地图的入口，几乎总是从残差进的。霍乱在 miasma schema 之外的死亡聚集是残差；迈克尔逊-莫雷"测不到以太风"是残差；黑体辐射在经典理论下的"紫外灾难"也是残差。每一个后来引发范式转移的东西，最初都长得像"现有框架解释不了、看上去是误差"的样子。

所以护栏设计里最反直觉、也最承重的一条，是专门给残差建一个不会被自动清扫的收容区，定期让人去复审：这些残差是测量误差，还是在合起来指向一个 schema 外的新变量？这正是 RES 05 那道"揭示盲区 → 造新 eval"的复现之墙，落到知识图谱上的具体样子，也是 RES 11 给护栏留的"换变量口子"真正运行起来的样子。一个系统每清扫掉一批残差，可能就在清掉它下一次范式转移的种子。

The residual is the doorway: deliberately keep what the graph cannot explain. The entire difference between a healthy research system and a perfect cage comes down to how each treats the residual: observations that no existing node can explain, that fail to enter the graph at all. A cage’s instinct is to sweep residuals away as noise, because they mess up the graph’s tidiness and drag down that virtuous-looking metric, “share of claims landing in the traceable chain.” But the history of science says otherwise, again and again: the doorway to a new map is almost always the residual. The cholera death-clusters outside the miasma schema were a residual. Michelson–Morley’s undetected ether wind was a residual. Black-body radiation’s ultraviolet catastrophe under classical theory was a residual. Everything that later triggered a paradigm shift started out looking like error the current frame couldn’t explain.

So the most counter-intuitive, most load-bearing rule in guardrail design is this: build a holding area for residuals that the auto-sweep doesn’t touch, and have a human review it on a schedule — are these residuals measurement error, or are they collectively pointing at a variable outside the schema? This is exactly RES 05’s wall (reveal a blind spot, build a new eval) landing on the knowledge graph; it’s also RES 11’s change-the-variable door, actually running. Every batch of residuals a system sweeps away might be the seed of its next paradigm shift.

护栏跟笼子，是同一件东西的两种用法

The guardrail and the cage are the same thing, used two ways

RES 04 说知识图谱是好护栏，这一节又说它会变笼子——听着矛盾，其实是同一套机制在两种使用次序下的两副面孔。schema 服务于生成、并且对人保持可写的时候，它是护栏：让海量生成落进可追溯的结构里，冲突能显形，工作能被复现。schema 反过来定义了什么算"研究"、并且只读不可写的时候，它是笼子：生成只能在既有变量空间里打转，schema 外的观测被悄悄丢掉。决定它是护栏还是笼子的，不是图谱做得精不精巧——恰恰相反，图谱越精巧、越自洽，它当笼子时就锁得越死。Farr 的霍乱图正是一个"完美的笼子"：结构完整、数据可追溯、查询又快，唯独没有"病原体"这个节点——于是它把整个研究共同体钉在"空气质量"这个错误的层级上，钉得严丝合缝。

RES 04 says the knowledge graph is a good guardrail; this section says it turns into a cage. That sounds like a contradiction, but it’s one mechanism wearing two faces depending on the order you use it in. When the schema serves generation and stays writable by humans, it’s a guardrail: mass generation lands in a traceable structure, conflicts surface, work can be replicated. When the schema instead defines what counts as “research”, and is read-only, it’s a cage: generation can only circle within the existing variable space, and out-of-schema observations quietly get dropped. What decides guardrail or cage isn’t how sophisticated the graph is; quite the opposite: the more sophisticated and self-consistent it is, the tighter it locks as a cage. Farr’s cholera map is exactly a “perfect cage”: structurally complete, data traceable, queries fast, missing only a “pathogen” node, and so it pinned the whole research community to the wrong “air quality” layer, pinned it flawlessly.

FIG. 11.0 / 图谱盲区：引文/自动化流水线在结构上看不见什么THE GRAPH BLIND SPOT: WHAT CITATION / AUTOMATION PIPELINES STRUCTURALLY CANNOT SEE看懂：内圈是 schema 能表示的变量空间（生成在这里顺滑、可追溯、永不报错）；圈外是 schema 无法表示的残差（被当噪声扫掉）。换地图的入口，几乎总在圈外。Read: the inner ring is the variable space the schema can represent (generation here is smooth, traceable, never errors); outside the ring are the residuals the schema cannot represent (swept away as noise). The doorway to a new map is almost always outside the ring.

护栏和笼子是同一张图谱：箱内是 schema 能表示的变量空间，AI 在这里加速只会把现有层级钻得更自洽、更难推翻；箱外是它结构上无法表示的残差。Farr 把每一例霍乱死亡都成功归因到"空气质量"的某个梯度，系统平滑、永不报错——正因如此没人怀疑"空气质量"这层本身错了。盲区不是图谱不够好，恰恰是它太好：越完整自洽，越把人固定在错的描述层级上。唯一的出路不在圈内的任何查询，而在那道虚线门，人发起的"换变量"。这也是 RES 04 护栏命题的承重反面。The guardrail and the cage are one graph: inside the box is the variable space the schema can represent, where AI accelerating only drills the existing level more self-consistent and harder to overturn; outside is the residual it structurally cannot represent. Farr successfully attributed every cholera death to some gradient of “air quality,” the system smooth and never erroring, and precisely because of that, no one suspected the “air quality” level was itself wrong. The blind spot is not the graph being too poor; it is the graph being too good: the more complete and self-consistent, the harder it pins humans to the wrong level of description. The only way out lies in no in-ring query but in that dashed door: a human-initiated “change the variable.” This is the load-bearing flip side of RES 04’s guardrail thesis.

RES

FRONTIER · 元科学的模式生物

META-SCIENCE’S MODEL ORGANISM

前沿 · 把判断节点抬高一层

Frontier · raise the judgment node a level

当一阶研究近免费，人退守到设计科学本身

When first-order research is near-free, the human retreats to designing science itself

跑实验也近乎免费之后，人还能往哪儿退？除了价值判断，还有一条退路。

Once running experiments is near-free too, where else can the human retreat? Beyond value, there is a second line.

一句话In one line

人退守其实有两条线：一条是价值论——哪个真相值得追；一条是方法论——该跑哪个实验、什么才算证据。至今没有判据能说清"什么框架比另一个更优"，所以跑得更快不等于走得更对。The human’s retreat runs along two lines, not one: axiological (which truth is worth chasing) and methodological (which experiment to run, what counts as evidence). We still have no test for what makes one paradigm better than another, so running faster is not the same as getting closer to right.

证据锚 · 观点综述 · 等级 Ⅳ–Ⅴ（论证性，非实证）Evidence anchor · opinion/review · grade Ⅳ–Ⅴ (argumentative, not empirical)

Djajadikerta,《Designing AI for Disruptive Science》, Asimov Press, 2026-03-23, DOI 10.62211/29ej-27et。提出"AI 科学家或许给元科学第一个模式生物"：现实中无法对科研机构做对照实验，但可让 AI agent 种群在不同研究条件下并行运行、细测哪种条件出更多概念重组。历史锚（可另行回溯）：Bell Labs、Xerox PARC、早期剑桥 LMB＝受体制保护、能追"看上去无用"想法的小团队，与 AlphaZero 独立自博弈下出原创棋（21.Bg5）同出一辙。〔本条为观点文论断，勿当数据；其转引实证各需回溯原始文献定级〕

Djajadikerta, “Designing AI for Disruptive Science,” Asimov Press, 2026-03-23, DOI 10.62211/29ej-27et. It proposes that “the AI scientist may give meta-science its first model organism”: one cannot run controlled experiments on research institutions in reality, but one can let populations of AI agents run in parallel under different research conditions and finely measure which conditions yield more conceptual recombination. Historical anchor (traceable separately): Bell Labs, Xerox PARC, the early Cambridge LMB: small teams, institutionally protected, free to chase “seemingly useless” ideas, isomorphic to AlphaZero’s original move (21.Bg5) under independent self-play. [this is an essay’s argument, not data; its cited empirics each need tracing to original sources for grading]

第一次能对"科学制度"本身做对照实验

For the first time, a controlled experiment on scientific institutions themselves

元科学一直是一门没有实验台的学问：你没法把同一个科学共同体复制十份、每份换一套激励，再看哪份更容易出颠覆性的东西——现实里只有一个共同体，跑得又慢，变量又多。AI 科学家可能第一次给元科学一个模式生物：让一群 AI agent 在不同研究条件下并行跑——这一群按产出考核，那一群按新颖考核；这一群层级森严，那一群扁平自治；这一群只准待在框架内，那一群被明着鼓励换框架——然后细看哪种条件下涌现更多概念重组。这是历史上头一回，"什么样的组织结构孕育颠覆"从一个只能靠案例回答的问题（贝尔实验室、施乐 PARC、剑桥 LMB 这些"小团队 + 体制保护"的轶事），变成一个可以做对照实验的问题。

它和 AlphaZero 独立自博弈下走出人类从没下过的原创棋（21.Bg5）是同一个道理：把系统从现有框架的训练数据里放出来，让它在受保护的环境里自由探索。〔这是 Asimov 一篇观点文的推演，证据级 Ⅴ；模式生物这个说法尚未被大规模实证，标为前沿命题〕[a projection from an Asimov Press essay, grade Ⅴ; the model organism itself has not been demonstrated at scale, flagged as a frontier claim]

Meta-science has long been a discipline with no bench: you can’t copy one scientific community ten times, hand each a different incentive structure, and see which breeds more disruption. In reality there is only one community, it runs too slowly, and it has too many variables. The AI scientist may give meta-science its first model organism: run populations of AI agents in parallel under different research conditions: this group scored on output, that one on novelty; this group steeply hierarchical, that one flat and autonomous; this group confined to the paradigm, that one openly encouraged to switch frames; then measure closely which conditions yield more conceptual recombination. For the first time, “what organizational structure breeds disruption” turns from a question you can only answer with anecdotes (Bell Labs, Xerox PARC, the Cambridge LMB — small teams under institutional protection) into one you can run a controlled experiment on.

It is the same logic as AlphaZero playing an original move no human ever had (21.Bg5) under independent self-play: free the system from the existing paradigm’s training data and let it explore in a protected space.

把两条退路并排放，不是让你二选一。这一卷特意画出两条退路而不是一条，因为只画"退到价值——哪个真相值得"会漏掉一半。第一条是 RES 06 的价值论退路：当提问被充裕，人退到"定哪个真相重要"。第二条是这一节的方法论退路：当跑实验也近乎免费，人退到"设计科学本身——该跑哪个实验、什么算证据、什么样的研究设计真能把假设区分开"。这两条不冲突，是正交的：一条定方向（往哪走值得），一条定制度（什么样的科学机器能让值得的方向被生出来）。一个只退到价值、没退到方法的研究者，知道该追什么真相，却没有一套能生出颠覆的科学制度；一个只退到方法、没退到价值的研究者，有一台精良的方法机器，却不知道该指向哪里。两条都退了，才补全内核④"人回归意义"在研究这一面的全貌：意义既是方向的意义，也是制度的意义。

Placing the two retreat lines side by side is not offering a choice between them. This volume deliberately draws two lines, not one, because drawing only “retreat to value: which truth is worth it” misses half the picture. The first is RES 06’s axiological retreat: as questioning turns abundant, the human retreats to setting which truth matters. The second is this section’s methodological retreat: as running experiments turns near-free too, the human retreats to designing science itself: which experiment to run, what counts as evidence, what research design actually discriminates between hypotheses. The two don’t conflict; they are orthogonal. One sets direction: where it is worth going. The other sets institution: what scientific machine lets worthy directions get born in the first place. A researcher who retreats only to value and not to method knows which truth to chase but has no institution that breeds disruption; one who retreats only to method and not to value has a fine method-machine but no idea where to point it. Retreating on both is what completes kernel ④’s “humans return to meaning” on the research side: meaning here is both the meaning of direction and the meaning of institution.

交棒锚 → 组织（什么结构孕育颠覆）Hand-off anchor → Org (what structure breeds disruption)

"模式生物"让"什么结构/激励出颠覆"第一次可被实验，是研究向组织交棒的素材。The “model organism” lets “which structure / incentives breed disruption” be experimented on for the first time: material for the research-to-org hand-off.

我们还不知道"什么规则让框架更优"——所以跑得快不等于走得对

We still don’t know what makes a paradigm better — so faster isn’t automatically righter

元科学之所以变得值钱，根子在一个不太舒服的事实：直到今天，我们都没有一条完备的判据，能说清"一个框架为什么比另一个更好"。历史上提出过的候选都不完备。"简单"是一个——符号回归的 AI Feynman 把 100 条费曼方程全找了出来（旧软件只找到 71 条），最小描述长度原理也把"更简单大概率更对"部分地形式化了；但简单从来不保证真——J.J. Thomson 的葡萄干布丁原子模型又简单又优雅，却整个错了。

"类比"是另一个候选——好的框架常常善于在不相干的领域之间打比方（爱因斯坦借光的图像、达尔文借 Lyell 的地质学和 Malthus 的经济学）；但类比同样会骗人：把表面像、实际不相干的两件事当成一回事。既然连"什么让框架更优"都还没有判据，那"把执行加速一万倍"就不会自动生出更好的框架——它只会在当前框架里跑得更快。这正是"加速不等于进步"最深的根据，也是元科学（研究"什么样的科学制度能生出更优的框架"）从一件奢侈品变成一件必需品的原因。互联网就是前车之鉴：它让知识变得可搜索，却因为职业激励这类结构性低效，没能在规模上带来更快的科学，反而让引用变窄了。

Meta-science appreciates for an uncomfortable reason: to this day we don’t have a complete criterion for why one paradigm is better than another. Every historical candidate is incomplete. Simplicity is one — symbolic regression’s AI Feynman recovered all 100 Feynman equations (an older tool got only 71), and the minimum-description-length principle partly formalizes “simpler is more likely right.” But simplicity guarantees nothing: J.J. Thomson’s plum-pudding atom was simple and elegant and completely wrong.

Analogy is another candidate — good paradigms are often good at drawing analogies across unrelated domains (Einstein borrowing the image of light, Darwin borrowing Lyell’s geology and Malthus’s economics). But analogy just as easily deceives: it can pass off two things that look alike but aren’t related as the same thing. Since we don’t even have a criterion for what makes a paradigm better, “accelerating execution ten-thousand-fold” won’t automatically produce better paradigms; it will just run the current one faster. This is the deepest reason “acceleration isn’t progress” holds, and why meta-science — studying what scientific institution breeds better paradigms — moves from a luxury to a necessity. The internet is the cautionary tale here: it made knowledge searchable, yet because of structural inefficiencies like career incentives, it didn’t bring faster science at scale; it narrowed citation instead.

自主性阶梯：选方向是最后一阶，也是最难的一阶

The autonomy ladder: agenda selection is the last rung, and the hardest

把"元科学在升值"这件事画得更准，要借 Anthropic 2026 年那道自主性阶梯。它把研究 agent 的能力从左到右排成一道梯子：执行良定义的实验（最左，已经能匹敌人类）→ 设计实验 → 综合发现 → 选研究议程（最右，最难）。这道梯子跟 RES 02 的可验证性梯度是同一个形状——越往右，可机检的对错代理越薄，剩下的越是"无最近邻可循"、只能由人来定的判断。下面这张图把它和"生成充裕 vs 判断稀缺"叠在一起看：横轴是自主性从执行走到方向，竖轴是这项能力当前的充裕度，曲线显示：生成那一侧（执行、设计实验）已经逼近饱和，判断那一侧（综合、选方向）还陡峭地稀缺着。这正是命题的形状：能力沿着梯子往上爬，但最右边那一阶——选方向——的稀缺，不会因为模型变强而自动消失。

To draw “meta-science appreciates” more precisely, borrow Anthropic’s 2026 ladder of autonomy. It arranges a research agent’s capability left to right: execute a well-defined experiment (leftmost, already matching humans) → design experiments → synthesize findings → select the research agenda (rightmost, hardest). This ladder has the same shape as RES 02’s verifiability gradient: the further right you go, the thinner the machine-checkable proxy for right gets, and the more what’s left is constitutive judgment with no neighbor to follow. The figure below overlays it with “generation abundant vs judgment scarce”: the x-axis runs from execution to direction, the y-axis is that capability’s current abundance, and the curve shows the generation side (execution, experiment design) nearing saturation while the judgment side (synthesis, direction) stays steeply scarce. That’s the shape of the thesis: capability climbs the ladder, but the rightmost rung, agenda selection, doesn’t lose its scarcity just because models get stronger.

FIG. 12.0 / 自主性阶梯 × 充裕度：生成侧饱和，判断侧仍稀缺AUTONOMY LADDER × ABUNDANCE: GENERATION SATURATES, JUDGMENT STAYS SCARCE看懂：横轴从"执行"到"选方向"，曲线是当前充裕度。左段已逼近天花板（AI 匹敌人类），右段陡降——方向选择是最后一阶。Read: x-axis from “execute” to “select direction,” the curve is current abundance. The left climbs near the ceiling (AI matches humans), the right drops steeply: agenda selection is the last rung.

这条曲线把整卷的命题压成一张图：AI 的能力沿自主性阶梯上移（左段已饱和），但"充裕度"在右段陡降——方向选择是最后、也最难自动化的一阶，因为它要的是 taste（判断哪些问题重要、哪些异常值得追、哪些诱人想法是死路），而 taste 的稀缺是结构性的、不随算力消失。元科学之所以升值，正因为它研究的恰是"怎样的科学制度能把右段那条曲线抬起来"。〔Anthropic RSI 阶梯为公司自述，Ⅳ–Ⅴ；曲线为示意，非测量数据〕This curve compresses the whole volume’s thesis into one figure: AI’s capability climbs the autonomy ladder (the left has saturated), but “abundance” drops steeply on the right: agenda selection is the last and hardest rung to automate, because what it needs is taste (judging which problems matter, which anomalies are worth chasing, which seductive ideas are dead ends), and taste’s scarcity is structural, not dissolved by compute. Meta-science appreciates precisely because what it studies is “what scientific institution can lift that right-side curve.” [the Anthropic RSI ladder is a company’s own account, Ⅳ–Ⅴ; the curve is schematic, not measured data]

RES

CRITIQUE · 旧学术机器

THE OLD ACADEMIC MACHINE

结构批判 · 点名的失效件

Structural critique · named failing parts

旧学术机器的每个承重件，都是为"产出稀缺"调校的——而现在产出过剩

Every load-bearing part of the old academic machine was tuned for scarce output, and output is now in surplus

学术建制那五个承重件，为什么会在 AI 面前一起反转？一件件点名。

Why do the five load-bearing parts of the academic establishment invert together under AI? Named one by one.

一句话In one line

这五件东西共用一块地基："产出量等于诚实信号"。AI 把伪造的成本压到近乎零，地基一塌，原本的过滤器全都反转成了刷分的通道。All five stand on one bedrock: output volume is an honest signal. AI drove the cost of faking to near-zero — once the bedrock gives, every filter inverts into a channel for gaming the score.

这五个装置为什么会一起坏——它们共用一个刚被打穿的前提

Why all five break together: they share one assumption that just got punctured

单看这五件事，像五个互不相关的毛病；放到一起看，才发现它们共用同一个前提——而这个前提刚刚被 AI 打穿了。这个前提是：产出量是诚实的信号。写出一篇论文、攒够一批引用、跑完一个项目，过去都贵到足以证明背后有真实的智力投入，所以拿"量"当"质"的代理，误差还能接受。整套建制的代理链都搭在这块地基上：发表数代理生产力，引用数代理影响力，h 指数把两者打包代理"学者价值"，影响因子代理"期刊质量"，经费规模代理"研究重要性"。每一环都是用一个便宜、可数的量，去代理一个昂贵、难判的质。执行贵的时候，这条代理链的误差有限，因为刷不动——你没法低成本地伪造一百篇看起来像样的论文。

AI 恰好把这件事变便宜了。代理链最怕的不是有人作弊，是作弊的边际成本趋于零：一旦"看起来像样的产出"能近乎免费地批量生成，所有靠"量"当代理的指标会同时失去鉴别力——这是 Goodhart 定律的极端形态（一个度量一旦成为目标，就不再是好度量），只不过 AI 把"成为目标后失效"这件事，从原本的数年压缩到了数周。所以下面这五件不是五个孤立的故障，是同一块地基塌了之后，盖在上面的五个房间一起裂。

Seen one at a time, these five look like five unrelated ailments. Put them side by side and you find they share one assumption, and that assumption has just been punctured by AI. The assumption: output volume is an honest signal. Writing a paper, accruing a batch of citations, finishing a project used to be expensive enough to prove real intellectual work behind it, so quantity standing in for quality carried tolerable error. The whole establishment’s proxy chain sits on this bedrock: publication count proxies productivity, citation count proxies influence, the h-index bundles both to proxy a scholar’s worth, the impact factor proxies a journal’s quality, grant size proxies a study’s importance. Every link is a cheap, countable quantity standing in for an expensive, hard-to-judge quality. When execution was expensive, this chain’s error was bounded, because you couldn’t game it. You couldn’t cheaply fake a hundred plausible-looking papers.

AI is precisely what made that cheap. What a proxy chain fears most isn’t a cheater; it’s the marginal cost of cheating falling to zero. Once “plausible-looking output” can be mass-produced near-free, every metric that proxies via quantity loses its power to discriminate, all at once. That’s Goodhart’s law in its most extreme form (a measure that becomes a target stops being a good measure), except AI compresses the timeline from years to weeks. So the five below aren’t five isolated failures; they’re five rooms cracking together after the one foundation under them gave way.

FIG. 13.0 / 代理链：当伪造产出的边际成本趋零，每个代理同时反转THE PROXY CHAIN: WHEN THE MARGINAL COST OF FAKING OUTPUT GOES TO ZERO, EVERY PROXY INVERTS看懂：每一行是一个"便宜可数量 → 昂贵难判质"的代理；左列是装置，中列是它代理的东西，右列是 AI 把伪造成本压到零后它反转成的样子。整条链共用最底下那条"量=诚实信号"的地基。Read: each row is one “cheap countable quantity → expensive hard-to-judge quality” proxy; the left column is the device, the middle what it proxies, the right what it inverts into once AI drives the cost of faking to zero. The whole chain rests on the bottom “quantity = honest signal” bedrock.

这张图的论点更精确，不止于"指标不好"：每个指标的失效时刻，都是它"被瞄准的边际成本"跌破某个阈值的时刻。执行昂贵时，这些代理是有效过滤器；执行近免费时，同一个代理变成高速生成器的刷分通道。注意右列全是橙色，它们不是新毛病，是旧装置在新成本结构下的镜像反转。下面五小节逐件展开机理。This figure’s claim is not “metrics are bad” but something sharper: each metric’s moment of failure is the moment its “marginal cost of being targeted” drops below some threshold. When execution was expensive, these proxies were effective filters; when execution is near-free, the same proxy becomes a fast generator’s scoring channel. Note the right column is all orange — these are not new ailments but the mirror-inversion of old devices under a new cost structure. The five subsections below unfold each mechanism.

① "不发表就出局" + h 指数：拿产量当价值，正中高速生成器的下怀

① “Publish or perish” + the h-index: counting output as worth — exactly what a fast generator wants

装置原意："publish or perish"和 h 指数〔Hirsch 2005，PNAS，R21，证据级 Ⅱ〕都拿"可数的产出"当价值的代理：评委读不完每个人的全部工作，就假定发得多、被引得多≈又高产又有影响力。执行昂贵的年代这个误差还能忍，因为产量本身就是努力的证据。失效机理：h 指数把"高被引论文的数量"压成一个数字，而这个数字有两条都能刷的路——多发（分母）、多被引（分子）。Goodhart 定律〔Strathern 1997 对 Goodhart 的转述，R22，证据级 Ⅳ〕讲得很干脆：这个数字一旦成为目标，学者就会去优化数字本身，而不是它原本想代理的东西。切香肠式发表（把一个研究拆成好几篇最小可发表单元）、引用环（互相引用刷分）、自引，都是对着 h 指数做的理性优化。

AI 怎么把它推到极端：过去刷 h 指数受限于"写论文很贵"；现在 LLM 能近免费地批量产出格式齐全、看起来像样的稿件。当"看起来像研究的东西"能批量生成，任何拿产量当代理的指标都会瞬间失去鉴别力——它本来是用来挡"没干活的人"，现在反倒最方便"用机器刷量的人"。这正是 RES 03 那条"AI 拿与既有分布的距离当唯一代理"在激励层的镜像：指标越是奖励"可数的像样产出"，高速生成器就越是它的最优解。

What the device meant: “publish or perish” and the h-index 〔Hirsch 2005, PNAS, R21, grade Ⅱ〕 both use countable output as a stand-in for worth: a committee can’t read everyone’s complete body of work, so it assumes prolific and highly cited roughly equals productive and influential. When execution was expensive this error was tolerable, because volume itself was evidence of effort. The failure mechanism: the h-index compresses “how many highly cited papers” into one number, and that number has two gameable routes: publish more (the denominator), get cited more (the numerator). Goodhart’s law 〔Strathern 1997’s restatement of Goodhart, R22, grade Ⅳ〕 puts it bluntly: once a number becomes a target, scholars optimize the number, not whatever it was meant to stand in for. Salami-slicing (splitting one study into several least-publishable units), citation rings (mutual citation for score), and self-citation are all rational responses to the h-index.

How AI pushes it to the extreme: gaming the h-index used to be capped by “writing a paper is expensive”; now an LLM mass-produces near-free manuscripts that look properly formatted and plausible. Once “things that look like research” can be batch-generated, any metric proxying via volume loses its discrimination instantly: built to keep out people who did no work, it now most conveniences people gaming volume with a machine. This is the incentive-side mirror of RES 03’s “AI uses distance from the existing distribution as its only proxy”: the more a metric rewards countable, plausible-looking output, the more a fast generator becomes its optimal solution.

② 影响因子：拿期刊均值代理单篇质量，把"追热点"焊进了激励

② The impact factor: proxying a single paper’s quality by a journal mean, welding “hype-chasing” into incentives

装置原意：期刊影响因子（Garfield 1955 提出，后成 JCR 商业指标，R23，证据级 Ⅳ）本来是给图书馆挑订阅用的——它是一份期刊近两年文章的平均被引数，从来没打算拿来评单篇论文或单个学者，Garfield 本人也多次警告过这种误用。失效机理：把"期刊均值"当"单篇质量"，是个统计学上的范畴错误：期刊被引分布极度长尾（少数文章贡献了绝大多数引用），拿均值去代理任意一篇的质量，误差大到没什么意义。但因为影响因子可数、可排序、跨学科能比，它被招聘、评职称、经费评审广泛拿来用，学者的理性反应就是"往高影响因子的期刊投"，而高影响因子期刊系统性地偏好新颖、热点、阳性结果：可靠但不性感的工作——复现、阴性结果、方法学订正——被结构性地挤了出去。这正是 RES 08 hypernormalization 在激励侧的来源：不是有人存心想做窄，是指标在奖励"热"，不是"对"。

AI 怎么放大它：生成一旦近免费，"追当前热点、批量产出符合高影响因子期刊口味的稿件"就成了一套可以自动化的策略。AI 最擅长的恰是拟合已有分布（RES 06），影响因子奖励的恰恰是"贴近当前热点分布"——两者一拍即合，把科学进一步推向"所有人都在追同一批热问题"的同质化深渊。可靠性和新颖性本来是两件事，影响因子把它们混成一个"高被引=好"的单一信号，而 AI 让追这个信号的成本趋于零。

What the device meant: the journal impact factor (proposed by Garfield in 1955, later a JCR commercial metric, R23, grade Ⅳ) was built for librarians choosing subscriptions — it’s the mean citation count of a journal’s articles over the prior two years, never meant to judge a single paper or scholar, a misuse Garfield himself warned against repeatedly. The failure mechanism: treating a journal’s mean as one paper’s quality is a statistical category error: journal citation distributions are extremely long-tailed (a handful of articles account for most citations), so a mean is a meaningless proxy for any given paper. But because the impact factor is countable, sortable, and comparable across fields, it got widely adopted in hiring, tenure, and grant review, so a scholar’s rational response is to submit to high-IF journals, which systematically prefer novelty, hype, and positive results. Reliable-but-unsexy work (replication, null results, methodological corrections) gets structurally squeezed out. This is where RES 08’s hypernormalization comes from on the incentive side: no one is trying to go narrow; the metric is rewarding “hot,” not “right.”

How AI amplifies it: once generation is near-free, “chase the current hot topic, mass-produce manuscripts to high-IF taste” becomes an automatable strategy. What AI is best at is fitting the existing distribution (RES 06), and what the impact factor rewards is exactly hugging the current hot distribution; the two click together, pushing science further into the pit where everyone chases the same hot questions. Reliability and novelty used to be two different things; the impact factor blends them into one “highly cited equals good” signal, and AI drives the cost of chasing that signal toward zero.

③ 同行评审：一道为"稿件稀缺"设计的串行闸，正被无限投稿淹没

③ Peer review: a serial gate designed for scarce manuscripts, now drowning in unbounded submissions

装置原意：同行评审是科学质量的承重墙，让领域同行在发表前把关，挡掉错误、夸大、不可靠的工作。它整套的吞吐假设是"稿件按人类写作的速度到达"、评审按人类阅读的速度处理，两者大致匹配（中位审稿周期常以月计〔Björk & Solomon，R24，证据级 Ⅱ〕，稿件稀缺时这还能接受）。失效机理：同行评审是一道串行闸：每篇稿子要占用 2–4 位领域专家各自几个小时，而合格评审人的总带宽是固定又稀缺的（还是无偿的）。这道闸的吞吐上限由人类专家的数量决定，不是由投稿量决定；投稿量一旦暴涨，闸不会变快，只会排出更长的队、评得更草率。

AI 怎么把它压垮：生成端能近免费地把投稿量翻十倍、百倍，评审端的人类专家带宽却一点没变——这正是 RES 05"剪刀差"在评审环节的具体爆发。更糟的是，AI 还能批量生成"看起来该认真审"的稿件，逼真到评审人必须真花时间才能辨真假，于是本就稀缺的评审带宽，大量被"识别 AI 垃圾"消耗掉了。指望同行评审能拦住一切，在投稿无限的世界里已经不现实；出路是把"判可信"从"逐篇串行精读"改成 RES 09 那种"按证据强度×框架距离分诊、人只投到吃紧的两格"的并行分流——这需要重构激励，不是让评审人加班。

What the device meant: peer review is science’s quality load-bearing wall: domain peers gatekeep before publication, blocking errors, exaggeration, unreliable work. Its whole throughput assumption is that manuscripts arrive at human writing speed and get processed at human reading speed, the two roughly matched (median review cycles often run in months 〔Björk & Solomon, R24, grade Ⅱ〕, tolerable when manuscripts were scarce). The failure mechanism: peer review is a serial gate: each manuscript ties up 2–4 domain experts for hours each, and the total bandwidth of qualified, unpaid reviewers is fixed and scarce. This gate’s throughput ceiling is set by the number of human experts, not by how many submissions come in; when submissions surge, the gate doesn’t speed up, it just produces longer queues and sloppier reviews.

How AI crushes it: the generation side can near-free multiply submissions tenfold, a hundredfold, while the review side’s human-expert bandwidth hasn’t moved at all — RES 05’s scissors gap erupting right at the review stage. Worse, AI can mass-produce manuscripts that look worth taking seriously, lifelike enough that a reviewer has to spend real time telling real from fake, so scarce review bandwidth gets eaten identifying AI slop. Expecting peer review to catch everything is no longer realistic once submissions are unbounded; the way out is shifting “judging credibility” from serial close-reading of every manuscript to RES 09’s parallel triage: sort by evidence strength times paradigm distance, spend the human only on the two tight cells. That takes reworking incentives, not asking reviewers to work overtime.

④ 经费周期：为"实验很贵"调的保守偏好，正好杀死现在最廉价的探索

④ The grant cycle: a conservatism tuned for expensive experiments now kills the cheapest exploration

装置原意：经费评审的保守偏好不是恶意，是理性的风险管理：一个实验要花数百万、跑数年，评委有责任把钱投给"大概率能成"的项目，要求充分的前期数据、清晰的可行性、跟既有文献连得上。失效机理：但这套机制把"跟现行框架连得上"焊成了硬门槛，系统性地偏向框架内的渐进工作，把换框架的重构——按定义离既有文献远、前期数据必然薄，见 RES 09 那个"证据弱×框架远"格——挡在了门外。这正是 RES 06/07 反复说的：新颖诞生时一定"看上去不靠谱"，而经费机制把"看上去不靠谱"直接判了死刑。March 的探索/利用框架〔R15，证据级 Ⅱ〕落到制度层就是这条：利用（渐进、可预测）在争资源的时候，总是赢过探索（冒险、可能颗粒无收）。

AI 怎么改变了这笔账：充裕化恰恰把探索的成本压下来了——很多过去要花数月、数十万才能试一试的想法，现在近免费就能先跑一轮验证。经费机制那条核心假设——"探索很贵，所以要保守"——正在因此失效：探索变便宜了，理性的探索配比理应往上调（RES 07），保守偏好反倒成了把最便宜的新颖机会挡在门外的结构性浪费。经费周期这个旋钮是为"实验很贵的年代"调的，AI 把实验变便宜了，旋钮却没跟着拧——结果就是制度在最该放手探索的时候，反而收得更紧。贝尔实验室、施乐 PARC、剑桥 LMB〔R19〕之所以高产，正是因为它们用制度性保护把这种保守偏好对冲掉了。

What the device meant: a grant committee’s conservatism isn’t malice, it’s rational risk management: when an experiment costs millions and takes years, reviewers have a duty to fund what’s likely to succeed, demanding solid preliminary data, clear feasibility, continuity with the existing literature. The failure mechanism: but this welds “continuity with the current paradigm” into a hard bar, systematically favoring incremental in-paradigm work while barring paradigm-level reframings — which, by definition, sit far from the existing literature and are necessarily thin on preliminary data (RES 09’s “weak evidence, far paradigm” cell). This is exactly what RES 06/07 keep saying: real novelty looks unreliable at birth, and the grant mechanism sentences “looks unreliable” to death outright. March’s explore/exploit frame 〔R15, grade Ⅱ〕 at the institutional level is just this: exploitation (incremental, predictable) always beats exploration (risky, possibly nothing) when the two compete for resources.

How AI changes the arithmetic: abundance is precisely what drives exploration’s cost down: many ideas that used to take months and a hundred thousand dollars to try can now be validated in a first pass, near-free. So the grant mechanism’s core assumption (exploration is expensive, so be conservative) is failing on its own terms: exploration got cheap, so the rational explore-share should move up (RES 07), and conservatism becomes structural waste that bars the cheapest novelty opportunities. Put differently, the grant cycle is a dial tuned for an era when experiments were expensive; AI made experiments cheap, and the dial never got retuned — so the institution tightens exactly when it should let exploration loose. Bell Labs, Xerox PARC, and the Cambridge LMB 〔R19〕 were prolific precisely because they used institutional protection to hedge against this conservatism.

⑤ PI/课题组金字塔 + "复现吃力不讨好"：把承重验证器留在没人愿意干的位置

⑤ The PI/lab pyramid + “replication is thankless”: leaving the load-bearing verifier where no one will do it

装置原意：PI（首席研究员）/课题组的金字塔结构，是为"训练+分工"设计的：资深 PI 定方向、拉经费、担署名责任，博士生博后做执行。这在执行昂贵的时候是高效的——执行是稀缺资源，集中在受训的年轻人手上，由经验把关方向，是合理的分工。同时这套激励把"原创新发现"放在金字塔顶端，把"复现别人的工作"放在没有奖励的位置：复现拿不到经费、发不了高影响因子、不算原创贡献——吃力不讨好，于是几乎没人做。失效机理：这就造成了一个致命错配：RES 00/13 反复论证，独立复现是把"研究环"和"高速生成器"分开的唯一承重验证器（Open Science Collaboration 2015：97 项显著结果仅 36% 复现，R1；Baker 2016：逾 70% 科学家复现他人失败，R2）。也就是说，整个科学最关键的那个质量动作，恰好被激励结构放在了没人愿意干的位置上。

AI 怎么把它从"能忍"变成"致命"：执行昂贵的时候这个错配还能忍——反正复现也贵。当生成端近免费地把待验证的主张翻上百倍，而验证端（复现）还困在"吃力不讨好、没人做"的激励洼地里，缺口就直接爆了，这正是 RES 05 剪刀差最尖锐的样子。讽刺的是，AI 本可以承担复现里的大部分执行——跑代码、重算、交叉核对数据——但只要激励结构还把复现放在没有奖励的位置，执行变便宜也没用：没人有动机去按那个按钮。这就是 RES 07"省下的产能不会自动变成 slack"落在复现上的样子：技术上能复现，不等于制度上有人真去复现。要修的不是技术，是激励——得把"担保可信"（复现、验证、整合）从金字塔底端那个没有奖励的位置，提到跟"原创发现"同等的奖励位上。这正是 RES 00"科学社区的价值从产生知识转向担保可信"落在制度层的意思。

What the device meant: the PI/lab pyramid is built for training plus division of labor: a senior PI sets direction, raises funding, bears authorship responsibility, while PhD students and postdocs execute. This was efficient when execution was expensive: the scarce resource concentrated in trainees, direction gatekept by experience, a reasonable division of labor. At the same time this incentive puts “an original new discovery” at the pyramid’s apex and “replicating someone else’s work” in a position with no reward at all: replication wins no grants, no high-IF publication, counts as no original contribution, and is thankless enough that almost no one does it. The failure mechanism: this creates a fatal mismatch. RES 00/13 argue repeatedly that independent replication is the one load-bearing verifier separating the research loop from a fast generator (Open Science Collaboration 2015: only 36% of 97 significant results replicated, R1; Baker 2016: over 70% of scientists failed to reproduce others’, R2). That is, science’s single most critical quality act sits exactly where the incentive structure ensures no one wants to do it.

How AI turns it from bearable to fatal: when execution was expensive this mismatch was bearable, since replication was expensive too. Once the generation side near-free multiplies claims-to-verify a hundredfold while verification (replication) stays stuck in a thankless, no-one-does-it incentive sink, the gap simply explodes: RES 05’s scissors gap at its sharpest. The irony is that AI could shoulder most of replication’s actual work (running the code, recomputing, cross-checking data), but as long as the incentive structure leaves replication with no reward, cheaper execution doesn’t help: no one has a reason to press the button. This is RES 07’s “freed capacity doesn’t automatically become slack,” landing on replication specifically: being technically able to replicate isn’t the same as someone institutionally doing it. What needs fixing isn’t the technology, it’s the incentive: vouching for credibility (replication, verification, integration) has to move from the pyramid’s unrewarded base up to a reward position equal to original discovery. That’s exactly what RES 00’s “the scientific community’s value shifts from producing knowledge to vouching for credibility” means at the institutional level.

结构批判判语The structural verdict

五个装置不是过时了，是被反过来利用了：过滤器全变成了刷分放大器。要修，得换地基——把代理从"可数产出"换成"可担保的可信度"。The five devices aren’t outdated; they’ve been turned against their own purpose: every filter became a scoring amplifier. The fix is a new bedrock: swap the proxy from countable output to “vouchable credibility.”

RES

CASES · 四个走完一遍的真实情形

FOUR CASES WALKED THROUGH

工件 · 把内核压到具体情形上

Artifact · the kernel pressed onto specifics

把这卷的判据，按在四个具体到能照做的研究情形上

The volume’s tests, pressed onto four cases concrete enough to copy

机理讲完了，来看四个真走过一遍的情形。

The mechanisms are laid out: here are four real cases, walked through end to end.

一句话In one line

四个案例做的是同一件事：认出哪一格里"自动判断"会把该换框架的东西误杀，然后把稀缺的人类判断精准投进那一格——不是判得更快。All four cases do the same one thing: spot the cell where auto-judging would miscode something paradigm-level, then spend scarce human judgment precisely there, not judge faster.

案例一 · 提问分诊：同一个材料发现项目，两支问题走向相反的两侧

Case 1 · Question-triage: in one materials-discovery project, two questions go to opposite sides

情形：一个固态电解质材料组，手里有一个待解的问题包。AI 已能近免费地跑高通量筛选与性质预测（GNoME 类工作，R16），于是瓶颈不在算，而在"哪个问题值得问"。组里把问题包摊开，逐条过 RES 10 的判据，能写出可机检验收标准的归左（交 AI），只能诉诸"换什么框架"的归右（留人）。结果两支问题走向了相反的两侧。

框架内支 → 交 AIIn-paradigm branch → to AI

问题："在已知的石榴石结构（garnet）框架内，哪种元素替换能把锂离子电导率再提一档？"

Question: “Within the known garnet structure frame, which element substitution lifts Li-ion conductivity another notch?”

为什么归左：验收标准可机检——电导率有明确测量口径，候选空间是"已知结构内的元素替换"，AI 可批量生成候选+DFT 初筛。这是 R8/R16 反复证实 AI 擅长的"在已知框架内"动作。人只需定阈值、抽验复现。

Why left: the acceptance criterion is machine-checkable — conductivity has a defined measurement, the candidate space is “element substitution within a known structure,” and AI can mass-generate candidates plus DFT pre-screening. This is the “within a known frame” act that R8/R16 repeatedly confirm AI excels at. The human only sets thresholds and spot-checks replication.

换框架支 → 留人Reframe branch → kept human

问题："我们是不是问错了变量？也许根本不该在'晶态固体'框架里找，而该问'非晶/玻璃态'里的离子输运是另一套机理？"

Question: “Are we asking the wrong variable? Maybe we should not search within the ‘crystalline solid’ frame at all, but ask whether ion transport in the ‘amorphous/glassy’ state is a different mechanism?”

为什么归右：这一支换变量、换描述层级，而非在已知框架内找最近邻：它质疑的是问题框架本身（RES 11）。可机检验收标准写不出来：你没法预先定义"换对了框架"长什么样。AI 在这里只能拟合已有分布（它会把你拉回晶态，因为文献都在那），所以方向与证伪条件必须由人先写。这一支后来正是组里的突破口，但只有先把它从"低电导率、不值得做"的范式内判断里救出来，它才有机会。

Why right: this branch is not nearest-neighbor search within a known frame but changing the variable, changing the level of description — it questions the problem frame itself (RES 11). No machine-checkable acceptance criterion can be written: you cannot pre-define what “reframed correctly” looks like. AI here can only fit the existing distribution (it will drag you back to the crystalline state, since that is where the literature lives), so the direction and falsification conditions must be written by a human first. This branch later became the group’s real breakthrough — but only because it was first rescued from the in-paradigm verdict of “low conductivity, not worth doing.”

留在人这一侧的判断：是"哪一支问题值得用稀缺的人类带宽去追"，而非"哪个答案对"。分诊的价值正在于：它没有把第二支当噪声删掉（单一可信分会），而是认出它是"证据弱×框架远"格——挂起、定向取证，而非否决。如果当初把两支都丢给 AI，AI 会在第一支上高效地内卷，而第二支（换框架机会）会因为"离文献分布远"被自动降权。这就是 FIG 14.0 决策树要可视化的那条岔路。

The judgment that stayed human: not “which answer is right” but “which branch is worth chasing with scarce human bandwidth.” The triage’s value is precisely that it did not delete the second branch as noise (a single credibility score would) but recognized it as the “weak × far” cell: suspend and seek targeted evidence, not reject. Had both branches gone to AI, AI would have efficiently churned on the first while the second (the real paradigm-level opportunity) would have been auto-downweighted for being “far from the literature distribution.” This is the fork that FIG 14.0’s decision tree visualizes.

FIG. 14.0 / 提问分诊决策树：一个问题如何被分到"交 AI"或"留人"THE QUESTION-TRIAGE DECISION TREE: HOW A QUESTION IS ROUTED TO AI OR TO A HUMAN看懂：从顶上一个问题进，过三道判：能写可机检验收标准吗？→在已知框架内吗？→判错代价可逆吗？三个"是"才落到左侧"交 AI"；任一道"否"就分到右侧"留人写方向"。案例一的两支正好走了这棵树的两条路。Read: a question enters at the top and passes three tests: can you write a machine-checkable acceptance criterion? → is it within a known frame? → is a wrong call reversible? Three yeses route it left to “to AI”; any no routes it right to “human writes direction.” Case 1’s two branches take the two paths of this tree.

决策树把 RES 10 的双层判据展成可照走的三道闸：可机检 → 已知框架 → 代价可逆。三个"是"才交 AI；任一"否"就留人。它和 INSTRUMENT 12（下面那台可拨的分诊器）是同一逻辑的静态版与交互版。这棵树不是用来"自动判"的，右侧的每一支都明确写着"AI 当协作者不当裁判"——树本身只负责把问题路由到正确的判断者那里。The decision tree unfolds RES 10’s two-layer test into three walkable gates: machine-checkable → known frame → reversible cost. Three yeses to AI; any no stays human. It and INSTRUMENT 12 (the adjustable triage decider below) are the static and interactive versions of the same logic. The key: this tree is not for “auto-judging”: every right-side branch explicitly says “AI as collaborator, not judge”; the tree only routes the question to the correct judge.

把这棵树做成可拨的：下面这台分诊器让你对一个具体问题逐道答"是/否"，它实时给出路由判词——交 AI、还是留人，以及为什么。试着把案例一的两支分别拨进去，看它们怎么走向相反的终点。

The tree, made adjustable: the decider below lets you answer “yes/no” to each gate for a concrete question, and gives a live routing verdict: to AI, or to a human, and why. Try dialing in Case 1’s two branches and watch them reach opposite ends.

INSTRUMENT 12 · 提问分诊器 QUESTION-TRIAGE DECIDER

逐道答"是/否"。三道都"是"才把问题交给 AI 批量执行；任何一道"否"，问题就留给人，先写方向与证伪条件，AI 只当协作者不当裁判。判据来自 RES 10 的双层分诊。Answer “yes/no” to each gate. Three yeses route the question to AI for mass-execution; any one no keeps it human: write direction and falsification first, with AI as collaborator, never judge. The tests come from RES 10’s two-layer triage.

G1 · 能写出可机检的验收标准吗？（有明确测量口径、有标准答案）G1 · Can you write a machine-checkable acceptance criterion? (a defined measurement, a right answer)

G2 · 在已知框架内吗？（不是要换变量 / 换描述层级 / 换问题框架）G2 · Is it within a known frame? (not changing the variable / level / problem frame)

G3 · 判错的代价可逆吗？（不可逆 / 价值负载的高代价错判要留人）G3 · Is a wrong call reversible? (irreversible / value-laden high-cost errors stay human)

案例二 · 可信度天平：一条"AI 设计的新抗生素"主张，两条轴必须分开记

Case 2 · The believability ledger: an “AI-designed new antibiotic” claim, two axes booked separately

情形：一个团队收到一条 AI 生成的主张——"模型在已知抗生素骨架外，设计出一类全新机理的候选分子，体外实验显示对耐药菌有效"。这条主张同时踩了天平的两条轴，而把它压成单一可信分会犯致命错误。逐轴来记：

X 轴 · 证据强度（可补、可机检）X-axis · evidence strength (supplementable, machine-checkable)

当前只有单一实验室的体外（in vitro）数据，无独立复现、无体内（in vivo）验证。按证据级阶梯（FIG 9.0）这停在 Ⅱ–Ⅲ 之间：已发表/已测量，但未复现。处置很清楚：去补证据：这是有标准答案、可外包的左动作：多个实验室独立复现体外结果，再推进体内。证据弱不是判它死的理由，是触发"补证据"动作的信号。

There is only single-lab in-vitro data, no independent replication, no in-vivo validation. On the grade ladder (FIG 9.0) this parks between Ⅱ and Ⅲ: measured/published, unreplicated. The disposition is clear: go get more evidence, the left act with a right answer, outsourceable: independent labs replicate the in-vitro result, then push to in-vivo. Weak evidence is not a reason to kill it but a signal triggering the “seek evidence” act.

Y 轴 · 框架距离（要人判、不可补）Y-axis · paradigm distance (human-judged, not supplementable)

"全新机理"＝离已知抗生素作用机制很远。这恰恰是不能交给"可信分"的那条轴：离现行框架远本身既不是缺陷也不是优点，它只是个需要人来定夺的信号。如果用单一可信分，模型会把"离已知机制分布远"直接折算成低可信（RES 03 的结构性偏置），从而把一个可能的重大突破误杀成离群噪声。这条轴的正确处置：人来判它是机理噪声还是真重构，并定向去找能区分两者的关键证据（如机理层面的结构生物学验证）。

“A wholly new mechanism” = far from the paradigm of known antibiotic action. This is exactly the axis that must not be handed to a “credibility score”: distance from the paradigm is itself neither defect nor merit, only a signal needing constitutive judgment. With a single score, the model converts “far from the known-mechanism distribution” straight into low credibility (RES 03’s structural bias), miscoding a possible paradigm-level breakthrough as outlier noise. The correct disposition: a human judges whether it is mechanistic noise or a real reframing, and goes after the decisive evidence that separates them (e.g. structural-biology validation at the mechanism level).

落格 → 处置Cell → disposition

证据弱 × 框架远＝ INSTRUMENT 10 那个最危险的第四格。单一可信分会判"删"；天平的处置是挂起 + 定向取证：不发新闻稿、不当地基，但也绝不删，而是把稀缺人类带宽精准投到"补独立复现 + 验证机理"这两件事上。这正是历史上爱因斯坦 1905、达尔文自然选择诞生时所在的格（RES 09）：证据薄、离框架远，但删掉就错失范式转移。

Weak × far = INSTRUMENT 10’s most dangerous fourth cell. A single score says “delete”; the ledger’s disposition is suspend + targeted evidence: no press release, no foundation-laying, but never delete; instead spend scarce human bandwidth precisely on “independent replication + mechanism validation.” This is exactly the cell Einstein-1905 and Darwin’s natural selection occupied at birth (RES 09): thin evidence, far from the paradigm, yet deleting it forfeits the paradigm shift.

两条轴若合并会怎样：假设把"证据弱（扣分）"和"离框架远（又扣分）"合成一个可信分，这条主张会拿到极低分，被自动归入"不可信、删"。于是两件本该分开做的事全做错了：该去补的独立复现没人去补（分数已替它下了结论），该由人判的机理重构被自动判死（分数把"离框架远"折算成低可信）。分开记的全部意义，就在于让证据弱触发"补证据"，让"离框架远"触发"人来判"——两个完全不同性质的动作，各归各位。

What happens if the axes merge: suppose “weak evidence (minus points)” and “far from paradigm (minus more)” are fused into one credibility score; this claim scores extremely low and is auto-binned as “not credible, delete.” Then both things that should have been done separately are done wrong: the independent replication that should be sought is not sought (the score already concluded for it), and the mechanistic reframing a human should judge is auto-killed (the score converted paradigm-distance into low credibility). The whole point of booking separately is to let weak evidence trigger “seek evidence” and paradigm distance trigger “a human judges”: two acts of completely different nature, each in its place.

案例三 · 图谱护栏失误：一张本草知识图谱，把"化学成分"锁成了唯一描述层级

Case 3 · A guardrail failure: a materia-medica knowledge graph locked “chemical constituent” as the only level of description

情形：一个团队为天然药物建了一张大规模知识图谱当护栏（RES 04 的思路，让海量 AI 生成留在可追溯结构里）。图谱的本体（ontology）是这样设计的：每味药材 → 其化学成分 → 成分的分子靶点 → 通路。AI 在这张图谱上做候选发现，效率极高，产出可追溯、可验证。一切看起来都对。直到他们撞上一类反复出现却怎么也解释不了的药效。

护栏在守 · 也在锁The guardrail guards · and locks

图谱把所有主张都拽回"成分→靶点→通路"这条本体线上——任何离开这条线的解释，因为在图谱里"无处挂载"，会被自动判为不可追溯、低可信而过滤掉。护栏确实挡住了幻觉（好），但它也把描述层级锁死在"一切都要还原成分子成分"这一层：凡是不能还原成"哪个分子打哪个靶点"的药效，在这张图谱里根本无法被表达，于是系统性地从候选里消失。

The graph drags every claim back onto the “constituent → target → pathway” ontology line: any explanation off this line, having “nowhere to attach” in the graph, is auto-judged untraceable, low-credibility, and filtered out. The guardrail did block hallucinations (good), but it also locked the level of description at the “reductionist chemistry” layer: any efficacy that cannot be reduced to “which molecule hits which target” cannot even be expressed in this graph, so it systematically vanishes from the candidate set.

认出锁 · 加一层本体Spot the lock · add an ontology layer

那类解释不了的药效，机理在另一个描述层级：多成分协同 / 对菌群的群体调节 / 网络药理（整体扰动而非单靶点）。这不是"补更多成分数据"能解决的（那是 RES 11 那张一比一地图的陷阱：细节拉满仍是同一层信息）。修复动作是人做的"换框架"判断：给图谱本体加一个"系统/网络"描述层，让"整体扰动"成为可挂载、可追溯的一等公民。加层之后，原本被锁死过滤掉的那类机理重新进入候选，其中一条后来被独立复现证实。

The real mechanism of that unexplainable efficacy lives at a different level of description: multi-constituent synergy / microbiome population-level modulation / network pharmacology (whole-system perturbation, not single-target). This is not solved by “adding more constituent data” (that is RES 11’s one-to-one-map trap: maxing detail is still the same layer of information). The fix was a human paradigm-level judgment: add a “system/network” description layer to the graph ontology, making “whole-system perturbation” a traceable first-class citizen. After the layer was added, the locked-out class of mechanisms re-entered the candidate set, and one was later confirmed by independent replication.

这个失误的普遍形态：护栏（知识图谱）守住了"框架内的可追溯"，代价是把描述层级冻结在建图谱时的那一层。它的危险恰恰在于它看起来全对：产出可追溯、可验证、效率高，所有读数全绿，但它在悄悄地把"换描述层级"这种换框架动作排除在可能性之外（这正是 RES 11 的核心警告）。护栏的正确用法不是"建一次、永久信任"，而是定期问一句：这张图谱的本体，有没有把某个描述层级锁成唯一？谁来问这一句、谁有权给本体加一层，这又落回 RES 07 的治理问题：本体的边界就是能被表达的框架的边界。

The general form of this failure: the guardrail (knowledge graph) preserves “in-paradigm traceability” at the cost of freezing the level of description at the layer present when the graph was built. Its danger is precisely that it looks entirely correct — traceable, verifiable, efficient output, every reading green, while quietly excluding “switching the level of description,” a paradigm-level act, from the space of possibilities (exactly RES 11’s core warning). The correct use of a guardrail is not “build once, trust forever” but to periodically ask: has this graph’s ontology locked some level of description as the only one? Who asks this, and who has the authority to add an ontology layer: this falls back to RES 07’s governance question: the ontology’s boundary is the boundary of the paradigm that can be expressed.

案例四 · 加速却变窄：一个把 AI 用满的领域，三年内更高产也更同质

Case 4 · Accelerated yet narrowed: a field that maxed out AI grew more prolific and more homogeneous in three years

情形：一个计算驱动的子领域（可类比 Hao 等横跨约 4129.8 万篇论文的文献计量所刻画的形态，R9），从早期就把 AI 写作、文献综述、点子生成用满。三年后回看，所有"读数"都在变好：人均发表数上升、个人被引上升、项目周期缩短。按旧学术机器的每一个指标，这是个高歌猛进的领域。但把镜头拉到领域整体，出现了一组相反的信号。

个体读数 · 全绿Individual readings · all green

用 AI 的研究者个人影响力上升（R9 的核心发现之一）：写得更快、综述更全、点子来得更密。从个人 KPI 看，AI 是纯增益。每个理性的个体都在做"对自己最优"的事——用 AI 把产出和影响力做上去。

AI-using researchers see individual impact rise (one of R9’s core findings): faster writing, fuller reviews, denser ideas. By individual KPIs, AI is pure gain. Every rational individual is doing the “self-optimal” thing — using AI to push output and impact up.

领域读数 · 在收缩Field readings · contracting

同一段时间，领域整体的主题覆盖收缩（R9 测到约 4.63% 的话题覆盖收缩）、学者间互动下降。Doshi & Hauser（R12）给出因果机理：给写作者 LLM 点子，个体更有创意，但彼此更相似——作者直接称之为"社会困境"（个人更好、集体更窄）。Anderson 等（R13）进一步定位：这不是个体固着，是 LLM 向不同用户建议相似点子，是群体层效应。换个模型也救不了（R14：控制结构变量后，模型间相似度远高于人际）。

Over the same period the field’s topic coverage contracts (R9 measured ~4.63% contraction) and scholar-to-scholar interaction falls. Doshi & Hauser (R12) give the causal mechanism: give writers LLM ideas and individuals get more creative, yet grow more similar — the authors call it a “social dilemma” (individually better, collectively narrower). Anderson et al. (R13) locate it further: not individual fixation but the LLM suggesting similar ideas to different users, a group-level effect. Switching models does not cure it (R14: controlling for structural variables, models resemble one another far more than humans do).

这就是 hypernormalization 的运行态（RES 08）：领域是变窄了，不是变差了：更高效、更稳定、所有指标更好看，但探索的方差在塌缩，大家都在用同一个模型、追同一批热点、收敛到同一个均值。最危险之处在于它没有任何报警：每个个体读数都是绿的，旧学术机器的每个指标都在说"一切向好"。它正是 RES 13 那条"指标奖励可数产出、AI 是其最优解"的领域级后果。什么判断本该留在人这一侧却没留住："这个领域是不是在变窄"这个问题，没有任何一个个人 KPI 会问它，它只能由人在领域层主动问，并主动用制度去对冲（保护偏离均值的探索、奖励复现与反共识工作、给不可度量的 slack 留生存空间，RES 07）。把这个判断也外包给"指标自动监控"，等于让正在制造同质化的那套机制来诊断同质化。要看见变窄，得先有人愿意去看那些不会让自己 KPI 变好看的信号。

This is hypernormalization at runtime (RES 08): the field did not get worse, it got narrower: more efficient, more stable, every metric prettier, but the variance of exploration is collapsing, everyone on the same model, chasing the same hot topics, converging to the same mean. The most dangerous part is that it sets off no alarm: every individual reading is green, every metric of the old academic machine says “all is improving.” It is the field-level consequence of RES 13’s “metrics reward countable output, and AI is their optimal solution.” What judgment should have stayed human but did not: the question “is this field narrowing” is asked by no individual KPI. It can only be asked actively by a human at the field level, and actively hedged by institutions (protect off-mean exploration, reward replication and anti-consensus work, leave survival space for unmeasurable slack, RES 07). Outsourcing this judgment too, to “automatic metric monitoring,” is letting the very mechanism that is producing the homogenization diagnose the homogenization. To see the narrowing, someone must first be willing to look at the signals that will not make their own KPI prettier.

四个案例的同一根线The single thread through four cases

一根线穿过四个案例：执行可以被充裕，但"哪个真相值得知道"这个判断，得有一个具名的人扛着。One thread runs through all four cases: execution can be made abundant, but the judgment of “which truth is worth knowing” needs a named human bearing its weight.

两个正在发生的外部案例：拿判据回真实世界再对一遍

Two real-world cases in progress: checking the tests against the world again

上面四个是把内核压到可照做的细节上；下面这两个是正在发生的外部案例——一反一正，各自给命题划出一条能被独立核验的边界：执行可以被充裕，但"整合与担保"这一步，外包不出去。

The four cases above press the kernel onto copyable detail; the two below are external cases already in progress (one a counter-example, one a positive one), each drawing the thesis a boundary you can independently check: execution can be made abundant, but “integration and vouching” cannot be outsourced.

反例 · 担保不可外包Counter-case · vouching cannot be outsourced

FutureHouse Robin · 公司自述 + 独立复算 · 等级 ⅣFutureHouse Robin · company account + independent recompute · grade Ⅳ

端到端自主科研环 Robin 的一个真实读数：AI agent Finch 对一批流式数据自动量化出 ripasudil（干性 AMD 方向，临床前 / 体外）对 RPE 细胞吞噬作用的效应量约 7.5×（vs DMSO 对照）；人工以不同门控阈值复算同一批数据，得约 1.75×：逾 4× 的量化口径差。[R25]注意：这是同一份数据上"AI 量化 vs 人工量化"的差异，不是"研究被加速"本身（Robin 的加速另有其数——把原本以数月计的人工探索环压缩到数小时量级，见 Nature 2026 正式版）。但它作为反例依旧成立：差额不在"算得快不快"，而在最后一步，把分散结果整合成一个敢签字担保的结论。这一步仍要人复算、人担保，正是 RES 01"生成不稀缺、担保才稀缺"的真实世界读数。〔标边界〕限定临床前 / 体外，未及临床；效应量须独立复核。

A real reading from the end-to-end autonomous loop Robin: the AI agent Finch auto-quantified ripasudil’s effect (a dry-AMD direction, preclinical / in-vitro) on RPE-cell phagocytosis at about 7.5× (vs DMSO control); a manual recompute of the same flow-cytometry data with different gating thresholds gave about 1.75×: over a 4× gap in quantification. [R25] Note: this is an “AI-vs-manual quantification” gap on one dataset, not “research being accelerated” itself. Robin’s real acceleration is a different number: compressing a research loop that would take a human months into a matter of hours, per the Nature 2026 paper. But as a counter-case it still holds: the gap is not in “how fast it computes” but in the last step: integrating scattered results into a conclusion someone dares sign and vouch for. That step still needs a human to recompute and vouch, exactly RES 01’s “generation is not scarce, vouching is,” now as a real-world reading. [flag boundary] Confined to preclinical / in-vitro, not clinical; the effect size needs independent re-check.

正例 · 框架内充裕的真实顶配Positive case · the real ceiling of in-paradigm abundance

AlphaFold2 / 3（Nature 2021 / 2024；2024 诺贝尔化学奖）· 等级 Ⅱ · DOI AF2 10.1038/s41586-021-03819-2 · AF3 10.1038/s41586-024-07487-wAlphaFold2 / 3 (Nature 2021 / 2024; 2024 Nobel Prize in Chemistry) · grade Ⅱ · DOI AF2 10.1038/s41586-021-03819-2 · AF3 10.1038/s41586-024-07487-w

蛋白质结构预测是框架内充裕能到的真实顶配：在已知生物物理框架内，把"从序列到结构"这道一度要数月实验的工作压到近免费，规模与精度都是历史级。[R26]它正是 R8（AI Feynman）/ R16（GNoME）那条线的最强样本：把已知框架内的搜索做到极致。〔标边界〕但它预测的是结构、不是功能，更不是临床疗效：预测仍须湿实验确证，且它扩张的是现有框架而非更换框架（RES 11）。把它读成"AI 已能独立做科学"是越界；正确读法是"框架内执行可被充裕到顶配，换框架的判断仍在框架之外"。

Protein-structure prediction is the real ceiling that in-paradigm abundance can reach: within a known biophysical frame, it compresses “sequence to structure” — once months of experiment — to near-free, at historic scale and accuracy. [R26] It is the strongest sample of the R8 (AI Feynman) / R16 (GNoME) line: maxing out search within a known frame. [flag boundary] But it predicts structure, not function, still less clinical efficacy: predictions still need wet-lab confirmation, and it expands the existing paradigm rather than switching it (RES 11). Reading it as “AI can already do science on its own” overreaches; the right reading is “in-paradigm execution can be made abundant to its ceiling, while paradigm-level judgment stays outside the frame.”

RES

TEMPLATE · 研究工作流

THE RESEARCH WORKFLOW

可拷贝工件 · 照做的环

Copyable artifact · a loop you can run

把整卷收成一个可拷贝的环：生成多 · 验证严 · 整合先行

The whole volume as a copyable loop: generate much, verify hard, integrate first

整卷的机理，能不能收成一个照着做的循环？能。

Can the whole volume’s mechanics collapse into one loop you just run? Yes.

一句话In one line

整卷收成一个六步的环，照抄就能跑。两处拧紧了不能松：框定要在生成之前，整合要排在检索之前。最容易垮的地方：把省下来的工时又拿去多发论文。The whole volume collapses into a six-step loop you can copy and run. Two joints stay tight, no exceptions: frame before you generate, integrate before you retrieve. Where it collapses most often: spending the saved hours on more papers.

① 框定FRAME

先立证据库 · 写下判据Stand up the base · write the criteria

建可追溯证据库（RES 04 四属性）；显式写下"何为值得相信·值得知道"的判据。证据库即规格，这一步先于生成。Build the traceable evidence base (RES 04’s four properties); write down explicit criteria for “worth believing / worth knowing.” The base is the spec; this precedes generation.

② 生成GENERATE

框架内动作大规模并行Parallelize in-paradigm actions

检索/假设/实验/分析交给生成（RES 10 左格）；每条产物挂证据边落进库——不入库的不算数。Search/hypothesis/experiment/analysis go to generation (RES 10’s left cell); each output carries evidence edges into the base: what does not enter does not count.

③ 分诊TRIAGE

天平过滤 · 换框架挂起Ledger filter · suspend paradigm-level

用 RES 09 天平逐批判可信：框架内噪声删、可信入库、离框架远的挂起去找区分证据，别当噪声杀。Use RES 09’s ledger to triage each batch: drop in-paradigm noise, integrate the believable, suspend the paradigm-distant to seek discriminating evidence, not kill as noise.

④ 整合INTEGRATE

跨知识综合 · 非多检索Synthesize across · not more retrieval

人的稀缺动作（RES 05）：把从未并置的几条缝成新理解。盯整合产物相对原始产出的比率，别让堆积成山。The human’s scarce act (RES 05): stitch never-juxtaposed claims into new understanding. Watch the ratio of integration artifacts to raw output; do not let it pile into a mountain.

⑤ 守值OWN VALUE

定方向 · 留换变量口子Set direction · leave the variable door

让"值得"有归属（RES 07/12）；给"换 schema/换变量"留人发起的通道（RES 11），抵抗生成层保守偏置。Give “worth” an owner (RES 07/12); keep a human-initiated channel to “change the schema / change the variable” (RES 11), resisting the generation layer’s conservative bias.

⑥ 回流FEED BACK

把每次"被撤回/证伪/误杀的新颖"回流成证据库的新规则或新节点类型——错误回流成护栏，下一轮少犯。这一步把环闭合。Feed each “retracted / refuted / mistakenly-killed novelty” back as a new rule or node type in the base: errors become guardrails, fewer next round. This step closes the loop.

→ 真文件：→ real file: templates/research-loop.md

把省下的工时投回哪里，是这个环最容易垮的一步。这个环里藏着一个看不见的决定，它决定这个环到底带来进步，还是带来 hypernormal：②生成省下来的工时，到底投回了哪里。默认会发生什么，RES 07 早讲过——省下的产能不会自动变成 slack，它会被重新分配去做更多同样的事。落在研究环里，就是拿②省下的时间去多产论文、多跑实验、多生成假设，于是产出曲线更陡，可③判断、④整合、⑤守值的带宽一点没涨。这条失效路径特别隐蔽，因为它在每个局部指标上都像"进步"：产量涨了、影响力涨了、团队看着更高产了——这正是 Hao 等那 4129.8 万篇论文里，那批"个人影响力上升"的科学家的处境。

正确的做法反直觉：把②省下的工时显式地、刻意地投回③④⑤，让判断/复现/整合占研究者时间的比例上升，而不是让产出量上升。判据很简单：一个团队上了 AI 之后产量暴涨，但判断/整合占时间的比例没变，它就不是在跑这个环，它是在跑一台更快的 hypernormal 机器。

Where to reinvest the saved hours is this loop’s easiest step to get wrong. This loop hides one invisible decision that determines whether it delivers progress or hypernormal: where the hours ② saves actually go. RES 07 already told you what happens by default: freed capacity doesn’t become slack on its own, it gets reallocated to more of the same. Inside the research loop that means spending the time ② saved on more papers, more experiments, more hypotheses, so output climbs steeper while ③ judgment, ④ integration, ⑤ owning value gain no bandwidth at all. This way of failing is unusually hard to see, because on every local metric it looks like progress (output up, impact up, the team looking more productive), exactly the position of the “individual impact up” scientists in Hao et al.’s 41.3-million-paper study.

The right move is counter-intuitive: reinvest the hours ② saves, explicitly and deliberately, into ③④⑤: raise the share of researcher time on judgment, replication, and integration, not the volume of output. One simple test: if a team’s output spikes after adopting AI but its share of time on judgment and integration hasn’t moved, it isn’t running this loop. It’s running a faster hypernormal machine.

框定先行、整合优先，是这个环的两处承重。这个环跟工程那套规格驱动环（框定→计划→执行→验证→整合→学习）是同一个道理，但研究版有两处特意拧紧的地方，照抄的时候不能松。第一处是①框定要排在②生成前面：先立好可追溯的证据库、写下"什么算值得相信"的判据，再开始生成。理由 RES 04 早说过——证据库是研究的规格，不是事后归档；次序一反，你得到的是一座整合不动的垃圾山。很多团队把研究环抄成"先让 agent 狂产、再想办法管"，漏掉的正是这一拧。第二处是④整合要排在检索前面：当 RES 02 的生成把产出推向近乎无限，这个环里最容易堵的不是生成，是消化。如果团队把②省下的工时拿去多产而不是投回④整合，生成和整合之间就会越积越高——产出曲线陡升，理解曲线趴着不动。所以这个环的瓶颈阀门在④，不在②。

“Frame first” and “integration first” are this loop’s two load-bearing joints. This loop is the same idea as engineering’s spec-driven loop (specify, plan, execute, verify, integrate, learn), but the research version tightens two joints on purpose, and you can’t let them loosen when you copy it. The first: ① FRAME comes before ② GENERATE. Stand up a traceable evidence base and write down the criteria for “worth believing” before you open generation. RES 04 already gave the reason: the base is research’s spec, not after-the-fact filing; reverse the order and you get a garbage mountain nothing can integrate. A lot of teams copy the loop as “let the agent run wild, figure out management later”: that’s exactly this joint, missed. The second: ④ INTEGRATION comes before retrieval. Once RES 02’s generation pushes output toward the near-infinite, the easiest place for this loop to jam isn’t generation, it’s digestion. If a team spends the hours ② saves on more output instead of reinvesting in ④ integration, a backlog piles up between generate and integrate — output climbing steeply, understanding lying flat. So this loop’s real bottleneck valve sits at ④, not ②.

第⑥步——回流——是把"环"和"流水线"分开的那一步。这个工件叫"环"，不叫"流水线"，全靠这一步把它闭合。流水线是单向的：原料进，产品出，错误当废品扔掉。环带反馈：每一次被撤回的、被证伪的、被误杀的新颖，都不是废品，是下一轮的护栏材料。具体怎么流回去？一条复现不出来的主张，变成证据库里一条新的冲突检测规则；一个被现有 schema 判成异常、却反复出现的残差，变成一个新节点类型（RES 11 的换变量口子）；一次"把该重构的东西当噪声删掉"的事故，变成天平里"证据弱×框架远"那一格的处置改进（RES 09）。

没有这一步，前五步就退化成一条更快的流水线——生成、过滤、整合、产出，错误流走了就不回来，系统永远在重复同一个盲区。有了这一步，错误才变成系统的学习信号：每撞一次墙，护栏就长一点，下一轮少犯一点。这正是它跟工程"错误回流成新测试"完全同一招的地方，也是整卷反复盯着"撤回/证伪率应该往下走"这条指标的原因：它量的不是"少犯错"，是"这个环到底在不在学"。

Step ⑥ — feed-back — is what separates a loop from a pipeline. This artifact is called a loop, not a pipeline, entirely because this step closes it. A pipeline runs one way: raw material in, product out, errors thrown away as scrap. A loop has feedback: every retracted, refuted, mistakenly-killed piece of novelty isn’t scrap, it’s guardrail material for next time. Concretely: a claim that doesn’t replicate becomes a new conflict-detection rule in the base. A residual the current schema calls anomalous but that keeps recurring becomes a new node type: RES 11’s change-the-variable door. An incident where a reframing got deleted as noise becomes an improved disposition for the ledger’s “weak evidence, far paradigm” cell (RES 09).

Without this step, the first five degrade into a faster pipeline — generate, filter, integrate, output, errors flow away and never come back, and the system keeps repeating the same blind spot forever. With it, errors become the system’s learning signal: every wall it hits grows the guardrail a little, and it makes that mistake a little less next time. This is exactly where it’s the same move as engineering’s “errors feed back as new tests,” and it’s why the volume keeps watching one metric: the retraction/refutation rate should fall. That metric doesn’t measure fewer mistakes. It measures whether this loop is actually learning.

研究下注是一个组合，不是单点决定

A research bet is a portfolio, not a single call

环一跑起来，第一个反咬你的治理问题就来了：该把多少带宽投在利用（精炼已知、稳出结果），多少投在探索（追不确定的东西，可能颗粒无收）？RES 07 已经说了，效率默认会吃掉探索，所以这个配比不能放任默认，得当成一个可拨、可看的组合来管。这正是生成充裕带来的新自由度：执行近免费之后，跑一次冗余探索的边际成本骤降，理论上你能负担得起更高的探索比例；但激励要是还按产量考核，省下的产能又会被默认推回利用那一边。下面这台仪器把这层张力做成两根可拨的滑杆——拨"探索 vs 利用"的配比、拨"冗余探索"的允许度，看预期新颖跟成本怎么联动，看自己是不是正滑进 hypernormal 那种"指标全绿、覆盖在缩"。

The moment the loop starts running, the first governance question that bites back is: how much bandwidth goes to exploitation — refining the known, steady yields — and how much to exploration — chasing the uncertain, possibly nothing? RES 07 already showed efficiency eats exploration by default, so this ratio can’t be left on autopilot; it has to be managed as a portfolio you can adjust and watch. This is the new degree of freedom abundance creates: once execution is near-free, the marginal cost of a redundant exploration drops sharply, so in principle you can afford a higher explore share — but if incentives still score on output, the freed capacity gets pushed back to exploitation by default anyway. The instrument below turns this tension into two sliders: the explore-vs-exploit mix, and how much redundant exploration you allow; and shows how expected novelty and cost move together, and whether you’re sliding into hypernormal’s “every metric green, coverage shrinking.”

INSTRUMENT 11 · 研究下注组合 RESEARCH-BET PORTFOLIO

拨两根杆：探索↔利用的配比、冗余探索的允许度。读数给出预期新颖、预期成本、与"是否滑进 hypernormal"的判词，把 RES 07 的散木命运做成可拨的组合。

Two sliders: the explore↔exploit mix, and the redundant-exploration allowance. The readout gives expected novelty, expected cost, and a verdict on “sliding into hypernormal”: RES 07’s fate-of-the-useless-tree made an adjustable portfolio.

探索 ↔ 利用配比Explore ↔ exploit mix · 30% 投探索to explore

冗余探索允许度Redundant-exploration allowance · 40%

预期新颖EXP. NOVELTY

预期成本EXP. COST

覆盖广度COVERAGE

这台仪器不给你"最优解"，它逼你直面权衡。把探索配比拨到 0，预期成本最低、读数全绿，但覆盖广度也最低——这就是 hypernormal：高效、稳定、在缩。拨到 100，覆盖最广、预期新颖最高，但成本陡升，而且大量探索注定颗粒无收（这正是"无用之用"的字面意思）。生成充裕改变的是这条权衡曲线的形状：执行近免费让"高冗余探索"不再像过去那样贵得离谱，理性的探索配比应该往上移——但只有激励结构肯让"不可度量的 slack"活下来，这个上移才真的发生。仪器里"冗余探索允许度"这根杆，模拟的正是组织愿不愿意为看着没用的方向留预算。

This instrument gives you no optimum; it forces you to face the trade-off. Slide the explore share to 0 and expected cost is lowest, everything reads green, but coverage is also lowest. That’s hypernormal: efficient, stable, shrinking. Slide it to 100 and coverage is widest, expected novelty highest, but cost climbs steeply, and much of that exploration is destined to yield nothing (the literal meaning of “the use of the useless”). What abundance changes is the shape of this trade-off curve: near-free execution means high-redundancy exploration is no longer the prohibitive cost it used to be, so the rational explore share should move up, but that only actually happens when the incentive structure lets unmeasurable slack survive. The “redundant-exploration allowance” slider in the instrument models exactly whether an organization will budget for directions that look useless.

起步别一次全建：先挑一个"执行已经充裕、判断还没被外化"的环节——比如文献综合或参数扫描——立一个最小的可追溯证据库，把这个环节的框架内动作交给生成。跑顺了再加③天平、④整合。把②省下的工时显式投回③④，而不是拿去多产论文，这一条最容易垮（见 RES 14 的边界）。三步起步：先立可追溯证据库 → 用可信度天平挑出该注入人类判断的地方 → 把省下的工时投回整合与守值。

Don’t build it all at once: first run a narrow ①+② workflow: pick one step where execution is already abundant but judgment isn’t yet externalized (literature synthesis, parameter sweeps are the usual entry points), stand up a minimal traceable evidence base, and hand that step’s in-paradigm actions to generation. Once it runs, add ③ the ledger and ④ integration. Reinvest the hours ② saves explicitly into ③④, not into producing more papers — this is the easiest place to fail (see RES 14’s boundary). Three starts: stand up a traceable evidence base → use the believability ledger to pick where human judgment gets injected → reinvest saved hours into integration and owning value.

RES

APPLICABILITY · 适用边界

APPLICABILITY

边界 · 谁适用 / 谁不适合

Boundary · Who / who not

这把尺，在哪里成立、在哪里不成立

Where this ruler holds, and where it does not

这不是一条放之四海皆准的律，先说清它在哪不成立。

This is no universal law: first, where it doesn’t hold.

一句话In one line

这卷不是放之四海皆准的律，它在"执行能充裕、判断还没被外化"的域里最站得住。最该守住的一条：别把"提问被充裕"这类还在探索、没坐实的判断，当成已证的事实去裁人。This volume is no universal law: it’s strongest where execution can be made abundant but judgment isn’t yet externalized. The one gate to hold above all: don’t treat an unproven, still-exploratory claim like “questioning is made abundant” as settled fact and use it to cut people.

边界的判据：执行是不是真被充裕了，判断是不是真能被外化

The boundary test: is execution truly made abundant, can judgment truly be externalized

把"在哪适用"讲精确，得回到命题的两个前提，一个个去检验它们在具体的域里成不成立。前提一：执行能被大规模充裕。这在计算生物（AlphaFold, Nature 2021）、材料筛选（A-Lab, Nature 2023）、文献综合里成立——跑一次实验/筛选/综述的边际成本趋近于零，还能并行。但在受物理或伦理限速的域里不成立：一个三期临床试验、一次要等三年生长周期的田野采样、一个罕见样本的湿实验，执行本身仍是真瓶颈，"判断被充裕、执行变便宜"这条前提整个塌了，这时候这卷的退守命题增益很小，因为稀缺根本没从执行端搬走。

前提二：判断能被外化。它要求"什么算值得相信、值得知道"这套判据能部分写下来、部分被机检。在判据高度共识的域里（某些结构化预测任务，"对"几乎没有争议），外化容易，但这卷的增益也小，因为压根没有价值分叉可言。这卷增益最大的甜区，恰恰是两个前提同时成立、判断又还没被外化的域：执行已经能充裕了，判断却还困在少数专家脑子里，没被结构化出来。把这两条当成门禁一个案例一个案例地过，比背一张"适用领域清单"可靠得多。

To state where it applies precisely, go back to the thesis’s two premises and test each against a given field. Premise one: execution can be made abundant at massive scale. This holds in computational biology (AlphaFold, Nature 2021), materials screening (A-Lab, Nature 2023), literature synthesis: the marginal cost of one experiment, screen, or review trends to zero and can run in parallel. It fails in fields rate-limited by physics or ethics: a phase-three clinical trial, a field sample that needs a three-year growth cycle, a rare-sample wet lab. Execution is still the real bottleneck there, the premise “judgment made abundant, execution cheap” collapses entirely, and the volume’s retreat thesis adds little, because scarcity never moved off the execution end.

Premise two: judgment can be externalized. It requires that the criteria for “worth believing, worth knowing” can be partly written down, partly machine-checked. In fields where those criteria are highly consensual (some structured-prediction tasks where “correct” is nearly uncontested), externalizing is easy, but the volume’s gain is small too, because there’s no value fork to speak of. The volume’s sweet spot is exactly the field where both premises hold and judgment isn’t yet externalized: execution can already be made abundant, yet judgment is still stuck, unstructured, in a handful of experts’ heads. Running these two as a gate, case by case, is far more reliable than memorizing a list of applicable fields.

最适用 · 命题最强Most applicable · thesis strongest

数据/计算密集、执行可并行的域（计算生物、材料筛选、文献综合）
Data/compute-intensive, parallel-execution fields (computational biology, materials screening, literature synthesis)
绿地研究项目，从零按"生成多·验证严"重画流程
Greenfield programs: redraw the workflow from zero around “generate much, verify hard”
已有可机检判据的域（结构化预测、可形式化证明）
Fields with machine-checkable criteria (structured prediction, formalizable proof)

不适合 / 须降权 · 别硬套Ill-fitting / down-weight · do not force

执行本身仍是真瓶颈的域（罕见样本田野、湿实验受物理/伦理限速、临床试验）
Fields where execution is still the real bottleneck (rare-sample fieldwork, wet labs rate-limited by physics/ethics, clinical trials)
价值判据高度共识的域——"值得知"无争议时，本卷的价值退守命题增益小
Fields with highly consensual value criteria: when “worth knowing” is uncontested, the value-retreat thesis adds little
把"提问被充裕"当已证现实去裁人，它是探索清单，不是已证（见 RES 02 待坐实）
Using “questioning is made abundant” as proven grounds to cut people: it is exploratory, not proven (see RES 02, to be grounded)

可读性问题：突破也许要以部分不可读为代价。适用边界还有一条更深、更不舒服的前沿命题，得诚实摆出来：如果真要 AI 出突破，损失一部分可读性可能是躲不开的。类比 AlphaZero（Silver 等，Science 2018）——它下出的某些棋"概念上不透明"，强过任何人类却没人能完整解释为什么。当 AI 在科学上做出类似的事，风险是发现被"搁浅"在一堆没人能看懂的产出里：你拿到一个比现有理论预测更准的模型，却没法把它翻译成人能理解、能据以行动、能排优先级的知识。这是研究卷一个真实的张力：它整个立论建在"人接住判断"上，可要是突破本身部分不可读，人接住的就只是一个黑箱的输出，不是它的理由。

The legibility problem: breakthroughs may cost some unreadability. The applicability boundary carries one deeper, less comfortable frontier claim that has to be put on the table honestly: if you really want AI to produce breakthroughs, some loss of legibility may be unavoidable. By analogy to AlphaZero (Silver et al., Science 2018): some of its moves are conceptually opaque, stronger than any human yet no one can fully explain why. When AI does something similar in science, the risk is that discoveries get stranded in a flood of output no one can parse: you’re holding a model that predicts more accurately than current theory, yet you can’t translate it into knowledge a human can understand, act on, or prioritize. This is a real tension for the research volume: its whole argument rests on humans catching the judgment, but if a breakthrough is itself partly unreadable, what a human catches is only a black box’s output, not its reasons.

一句话边界：别把探索清单当已证去裁人。这里最该当成硬门禁的，不是技术域的划分，是一条诚实纪律：这卷有不少命题标着"探索清单·待坐实"——提问被充裕（RES 02）、整合鸿沟急剧扩大（RES 05）、净知识 −40%（RES 02 的 ODE 预测），它们是有侧证支撑的推演，不是已证的事实。把这些还在探索的命题当成"已证现实"去做组织决策，尤其是拿去裁人，是这卷最危险的误用。一个组织如果以"提问已经被 AI 充裕了，所以不需要这么多研究员"为理由裁员，它其实是在把一条 Ⅴ 级的推演当 Ⅰ 级证据用——而 RES 06/07 反复强调的恰恰是：被充裕的是框架内提问，换框架的重构和价值判断不但没被充裕，反而在升值。把探索清单误当硬锚，结果是裁掉了正该守住的那批判断力。

The boundary in one line: do not cut people on an exploratory ledger. What should be treated as a hard gate here isn’t the partition of technical fields; it’s an honesty discipline. This volume carries several claims tagged “exploratory, to be grounded”: questioning made abundant (RES 02), the integration gap exploding (RES 05), net knowledge −40% (RES 02’s ODE prediction). These are side-evidenced projections, not proven facts. Treating them as proven reality for organizational decisions (especially to cut people) is this volume’s most dangerous misuse. An organization that lays off staff on the grounds that “questioning is already made abundant by AI, so we need fewer researchers” is using a grade-Ⅴ projection as grade-Ⅰ evidence, and what RES 06/07 keep stressing is exactly the opposite: what turns abundant is in-paradigm questioning, while paradigm-level reframing and value judgment don’t turn abundant, they appreciate. Mistake an exploratory ledger for a hard anchor, and you cut away the very judgment you meant to protect.

绿地直接重画，存量先切一条工作流试

Greenfield: redraw directly; incumbent: carve out one workflow first

落到"怎么开始"，适用边界自然分成两条路径，对应组织的两种起点。绿地：一个从零起步的研究项目，可以直接按这卷重画——第一步不是招更多研究员，是立一个最小可追溯证据库，把"什么算值得相信、值得知道"的判据显式写下来；然后用价值分诊（RES 10 的矩阵 + RES 09 的天平）决定哪些动作交给生成、哪些注入人类判断。绿地的好处是没有存量流程的惯性，能一次把次序立对——规格先于生成，整合优先于检索。这也是这卷那个"从头设计"押注真正能被实测的地方：不是把选题-调研-实验-结论那条旧流水线跑得更快，是把机构的记账单位从"发表了多少"换成"沉淀了多少可复现的知识"。

存量改造：一个已经在跑的实验室，绝不能推倒重来，只能一点点改：从一条工作流里切出"执行已充裕、判断还没被外化"的那个环节（文献综合、参数扫描是最常见的入口），只在这一段重画，跑顺了再往外扩。存量改造最要命的陷阱，是把②省下的工时拿去多产论文——RES 13 反复警告过这一条。两条路径共用一句边界判词：这卷适用于"执行已充裕、判断还没被外化"的研究域；执行仍是真瓶颈、或者判断已经高度共识的地方，直说这不是它的目标群体，别硬套。

Down to how to start, the applicability boundary naturally splits into two paths, matching an organization’s two starting points. Greenfield: a program starting from zero can be redrawn by this volume directly: step one is standing up a minimal traceable evidence base and writing down explicit criteria for “worth believing, worth knowing”; then using value-triage (RES 10’s matrix plus RES 09’s ledger) to decide which actions go to generation and which inject human judgment. Greenfield’s advantage is no legacy-process inertia, so you can get the order right in one go: spec before generation, integration before retrieval. This is also where this volume’s redesign bet actually gets tested: not running the old pick-question-survey-experiment-conclude pipeline faster, but swapping the institution’s unit of account from how much got published to how much reproducible knowledge got banked.

Transformation: a lab already running must never be torn down and rebuilt, only changed piece by piece: carve out of one workflow the single step where execution is already abundant but judgment isn’t yet externalized (literature synthesis, parameter sweeps are the usual entry points), redraw only that segment, and expand once it runs. Transformation’s deadliest trap is spending the hours ② saves on more papers: RES 13 warns against this repeatedly. The two paths share one boundary verdict: this volume applies where execution is already abundant but judgment isn’t yet externalized; where execution is still the real bottleneck or judgment is already highly consensual, say plainly this isn’t the target group — don’t force it.

对策是建一层翻译，不是减速。老实说，这是 Asimov Press（The Legibility Problem）等的一个推演，缺工程实证，标为前沿命题。可能的出路：建一层"解释/翻译层"，让 AI 的发现对人可读、能排优先级。这不是要求 AI 只产人能立刻理解的东西（那等于把它又锁回框架内），而是在它产出之后，专门投入去把不可读的发现翻译成可读的知识。这本身就是内核④"人回归意义"的一个新落点：当一阶发现可能不可读，人的稀缺贡献之一，就是去搭那座把黑箱输出译成人类理解的桥。它也回连 RES 05 的整合：可读性翻译是最难的一种整合——把一个借不到任何框架的发现，缝进人类已有的理解结构里。

The remedy is building a translation layer, not slowing down. Honestly, this is a projection from Asimov Press (The Legibility Problem) and others, lacking engineering empirics, flagged as a frontier claim. One possible way out: build an explanation/translation layer that makes AI’s discoveries legible and prioritizable for humans. This doesn’t require AI to produce only what humans can immediately understand (that would just lock it back into the paradigm); it means, after it produces, deliberately investing in translating unreadable discoveries into readable knowledge. This is itself a new landing point for kernel ④’s “humans return to meaning”: when first-order discovery may be unreadable, one of the human’s scarce contributions is building the bridge that translates a black box’s output into human understanding. It also wires back to RES 05’s integration: legibility translation is, in the end, the hardest kind of integration: stitching a discovery with no frame to borrow into humanity’s existing structure of understanding.

RES

SPECULATION · 未来推演

SPECULATION

推论 · 外推，非事实

Inference · Extrapolation, Not Fact

往后推演：当研究开始设计科学自己

The Projection: When Research Starts to Design Science Itself

2026 到 2032 会怎样？摊开一个可能性空间，不画一条加速曲线。

What might 2026–2032 hold? A possibility space laid open, not one acceleration curve.

一句话In one line

这一幕摊开一个 2×2：这卷押"执行充裕、人守议程"，反方押"连价值判断也被 RLCF 学走"。谁对谁错，靠一条判据裁：到 2032，AI 自己选的议程，长期引用能不能追平人类基线。This act opens a 2×2: this volume bets on execution abundant, humans holding the agenda; the counter-bet is that even value judgment gets learned away by RLCF. Who’s right gets settled by one test: by 2032, can AI-selected agendas match the human baseline on long-run citation.

本章性质 · 推论下面是根据 2024–2026 已公开的轨迹做的外推，不是事实陈述。它继承整卷的诚实纪律：被充裕的是框架内提问，由人定义的价值判断——"哪个真相值得知道"——会不会也被充裕，正是这一章押注、也最该被证伪的地方。推论要是站不住了，这一章应该第一个被改写。

Nature of this chapter · InferenceWhat follows extrapolates from the public trajectory of 2024–2026; it isn’t a statement of fact. It inherits the volume’s honesty discipline: what turns abundant is in-paradigm questioning. Whether constitutive value judgment (“which truth is worth knowing”) also turns abundant is exactly what this chapter bets on, and the first thing it should be falsified against. If the inference stops holding, this chapter should be the first one rewritten.

三股会聚的力，每股都带一条能推翻它的观测

Three converging forces, each with an observation that would overturn it

推演不是在预言"哪条线一定会发生"，是指出哪些力正在叠加、各自在什么观测下会被判错。要是下面三股力同时成立，研究的样貌会从"人提问、机器执行"，滑向"机器也提框架内的问题，人退到选议程、定何为真"。每股都配了一条先行指标和一条证伪条件——后者才是这股力的命门：看见它，就说明这股力被高估了。

Speculation here isn’t prophesying which line must happen; it’s pointing at which forces are stacking up, and under what observation each would be judged wrong. If the three forces below all hold at once, research’s face slides from “humans ask, machines execute” toward “machines also pose in-paradigm questions, humans retreat to selecting the agenda and defining what’s true.” Each carries a leading indicator and a falsification condition: the latter is the force’s real pressure point: see it, and the force was overrated.

力 1FORCE 1

自主实验闭环商品化Autonomous experiment loops commoditize

会聚：自驾实验室（self-driving lab）+ 编码 agent + 文献 agent 拼成"假设→实验→分析→下一假设"的整环，单位发现成本逐年掉。
先行指标：一个领域里"无人值守通过同行评审"的论文占比连续两年上升。
证伪：若到 2029 自主闭环仍只在窄域（材料筛选、超参搜索）有效，跨域复现率不升反降，则"整环商品化"被证为局部假象，而非通用力。Converging: self-driving labs + coding agents + literature agents assemble a full “hypothesis → experiment → analysis → next hypothesis” loop; unit cost of discovery falls year on year.
Leading indicator: in a field, the share of “unattended, peer-review-passing” papers rises for two consecutive years.
Falsified if: by 2029 autonomous loops still work only in narrow domains (materials screening, hyperparameter search) and cross-domain reproducibility falls rather than rises, then “whole-loop commoditization” was a local illusion, not a general force.

力 2FORCE 2

提问被部分充裕Question-asking partly made abundant

会聚：知识图谱 agent 在"知识边界上做最近邻搜索"——找空白、补缺环、提框架内好问题——逼近熟练博士生。
先行指标：顶刊里"问题由 AI 首先提出、人类筛选执行"的致谢条目出现并增多。
证伪：若 AI 提的问题在盲评里系统性偏"安全、框架内、引用密集"，且这种偏置三年不收敛，则提问里由人定的那一半未被充裕——力 2 只吃到了边角。Converging: knowledge-graph agents do “nearest-neighbor search on the knowledge frontier” — finding gaps, filling missing links, posing good in-paradigm questions — approaching a skilled PhD student.
Leading indicator: acknowledgments of the form “question first posed by AI, humans selected and executed” appear and multiply in top journals.
Falsified if: in blind review, AI-posed questions skew systematically toward “safe, in-paradigm, citation-dense” and that skew does not converge over three years, then the constitutive half of questioning was not made abundant; force 2 only ate the margins.

力 3FORCE 3

元科学成显学Meta-science goes mainstream

会聚：既然"什么规则让框架更优"还没有判据，加速执行不自动等于进步，于是"怎样的制度生得出更优框架"本身成为被资助、被实验的对象。
先行指标：出现把评审机制、资助规则、复现激励当变量做对照实验的注册研究（科学成了自己的模式生物）。
证伪：若加速十年后，突破性框架（非渐进）的产出率不升反平，且无人能把它归因到制度变量，则"元科学能撬动框架质量"这一假设缺乏可操作抓手。Converging: since there is still no criterion for “what makes one paradigm better,” accelerating execution does not automatically equal progress — so “which institutions generate better paradigms” itself becomes a funded, experimented-upon object.
Leading indicator: registered studies appear that treat review mechanisms, funding rules, and replication incentives as variables in controlled experiments (science becomes its own model organism).
Falsified if: a decade of acceleration later, the rate of breakthrough (non-incremental) paradigms plateaus rather than rises and no one can attribute it to institutional variables, then “meta-science can move paradigm quality” lacks an operable handle.

FIG. 14.1 / 未来推演：研究的可能性空间（不是一条线，是一个分支场）THE SPECULATION ACT: RESEARCH’S POSSIBILITY SPACE (a branch field, not a line)看懂：横轴＝自主闭环的可信度（弱→强），纵轴＝价值判断谁掌（人保留→交给系统）。四格是四种 2032 图景；本卷押注左上"人守议程"格，反方押注右上"判断也被学走"格。图号是稳定标识、锚定单张图件而非出现顺序，故全卷不连续——按标签读，勿据序号推先后；本图原与 FIG. 14.0 决策树撞号，现以小数位 .1 区分。Read: x-axis = credibility of the autonomous loop (weak→strong); y-axis = who holds value judgment (kept by humans→handed to the system). The four cells are four 2032 pictures; this volume bets on the top-left “humans hold the agenda” cell, the counter-bet on the top-right “judgment learned away” cell. Figure numbers are stable labels keyed to a single chart, not running order, so they are not globally sequential: read them as labels; this chart previously collided with the FIG. 14.0 decision tree and is now distinguished by the .1 decimal.

两轴是研究最不确定的两件事：自主闭环到底可不可信（横），以及"值得"的判断权最终在人还是在系统（纵）。本卷押注右上格——执行充裕、人守议程；反方押注右下格——连价值判断都被学走。注意两个左格：闭环一旦不可信，加速只会放大错误，把研究环变成 hypernormal science 的高速生成器。这张图的意义不在选定一格，而在给出每格的先行指标，让你能根据真实观测，判断世界正滑向哪一格。The two axes are research’s two least-certain things: whether the autonomous loop is credible at all (x), and whether the right to judge “worth” ends up with humans or the system (y). This volume bets on the top-right cell: execution abundant, humans hold the agenda; the counter-bet is the bottom-right: even value judgment is learned away. Note the two left cells: once the loop is not credible, acceleration only amplifies error, turning the research loop into a hypernormal-science fast generator. The figure’s value is not in picking a cell but in giving each cell’s leading indicator, so you can judge, from real observation, which cell the world is sliding toward.

2026→2028→2030→2032：研究的样貌一步步在变

2026→2028→2030→2032: research’s face deforms, step by step

NOW2026–2027

AI 当强力副驾，人仍握每一个判断闸

AI as a powerful copilot; humans still hold every judgment gate

文献综述、代码、初步分析大面积交给 agent；提问、实验设计的把关、"值不值得发"仍是人的活。可观测信号：顶刊投稿量已经在涨、评审带宽没跟上（作者估算·未入册·未独立核实）——张力开始显形，但判断闸仍在人手里。

Literature review, code, and first-pass analysis are handed wholesale to agents; questioning, experiment-design gatekeeping, and “is it worth publishing” remain human work. Observable signal: top-journal submissions are already rising while review bandwidth has not kept up (author estimate · not in registry · not independently verified): the tension surfaces, but the judgment gate is still in human hands.

NEAR2028–2029

框架内提问被部分充裕，评审制度先撑不住

In-paradigm questioning partly made abundant; review institutions buckle first

知识图谱 agent 能稳定提出"框架内的好问题"，自主闭环在窄域里无人值守跑通。最先变形的不是实验室，是评审与发表制度：生成端被加速、判断端没扩容，系统被迫退回最廉价代理（格式、相似度、引用数），净知识可能下行（arXiv:2604.05714 的 ODE 模型预测约 −40%，模型预测，非已证）。元科学的第一批对照实验在此期登场。

Knowledge-graph agents reliably pose “good in-paradigm questions,” and autonomous loops run unattended in narrow domains. The first thing to deform is not the lab but the review-and-publication institution: the generation end is accelerated, the judgment end is not scaled, and the system falls back on the cheapest proxies (format, similarity, citation counts); net knowledge may decline (the ODE model (arXiv:2604.05714) predicts about −40%, a model prediction, not proven). Meta-science’s first controlled experiments arrive in this window.

MID2030

自主实验室常态化，人退守到"选议程 + 定何为真"

Autonomous labs become normal; humans retreat to “select agenda + define what is true”

"执行→设计→选议程"阶梯上，前两阶大面积被吃掉，最后一阶（方向选择）仍最稀缺（FIG 12.0）。研究组织的人机比从个位数跳到两位数；考核口径从"产出量"转向"判断质量 + 上下文连贯"。这一年的关键分歧：AI 选的题在长期引用上能否追平人类基线，这正是横轴右移会不会带动纵轴下移的判据。

On the “execute → design → select-agenda” ladder, the first two rungs are largely eaten; the last (direction selection) stays scarcest (FIG 12.0). The human-to-machine ratio in research orgs jumps from single to double digits; the evaluation lens shifts from “output volume” to “judgment quality + context coherence.” The pivotal divergence of this year: whether AI-chosen agendas can match the human baseline on long-run citation: exactly the test of whether the rightward x-shift drags the y-axis down.

FAR2031–2032+

两条线分岔：人守议程，或"值得"也被系统化

Two lines fork: humans hold the agenda, or “worth” is systematized too

到这里本卷与反方正式分岔。本卷线：由人定义的价值判断没被充裕，人成了少数高密度的"议程守门人 + 真相裁判"，研究组织像一支判断密度极高的小团队。反方线：RLCF 之类终于学到偏离社群均值的前沿价值，"哪个真相值得"被系统化，人的最后守地塌掉。哪条成真，取决于 2030 那个引用判据，而不是谁的口才更好。

Here the volume and the counter-bet formally fork. Volume line: constitutive value judgment does not turn abundant; humans become a few high-density “agenda gatekeepers + truth referees,” and the research org looks like a tiny team of extreme judgment density. Counter-bet line: something like RLCF finally learns frontier value departing from the community mean, “which truth is worth it” is systematized, and the human’s last ground collapses. Which comes true turns on that 2030 citation test, not on who argues more eloquently.

一件虚构的 2031 自主实验室季报

A fictional 2031 autonomous-lab quarterly

只有断言的推演读着空。下面这件是设计虚构：一个明确标注为虚构的 2031 未来物件，把"研究退守到议程与裁判"做成可触摸的样子。它是把命题投影到 2031 的一种方式，不是预测。

Speculation made only of assertions reads thin. The piece below is design fiction: an explicitly fictional 2031 future artifact that makes “research retreating to agenda and refereeing” tangible. It is not a prediction; it is a way of projecting the thesis onto 2031.

SPECULATIVE · 虚构 · Fiction

ARTIFACT 01 · 自主实验室季报节选 · Excerpt from an Autonomous-Lab Quarterly

Meridian Autonomous Lab · 2031 Q3 研究季报（节选）

Meridian Autonomous Lab · 2031 Q3 Research Quarterly (Excerpt)

本季产出: 自主闭环生成候选假设 41,200 条 · 通过内部复现闸 2,140 条 · 投稿 96 篇 · 人类署名为通讯/责任作者 96 篇（100%）
This quarter’s output: 41,200 candidate hypotheses generated by autonomous loops · 2,140 passed the internal replication gate · 96 submitted · humans listed as corresponding/accountable author on all 96 (100%)
人机比: 7 名研究员 · 约 900 个常驻 agent（1 : 129；2028 为 1 : 14）
Human-to-machine ratio: 7 researchers · about 900 resident agents (1 : 129; was 1 : 14 in 2028)
人类时间去向: 选议程 38% · 独立复现与裁可信 41% · 给不可读发现造"解释层" 21%（写代码占比已 < 2%）
Where human time goes: Selecting the agenda 38% · independent replication & adjudicating credibility 41% · building a “legibility layer” for unreadable findings 21% (coding is now < 2%)
弃用指标: "论文产出量"已从季报删除，它由闭环近乎免费地产生，不再是稀缺信号
Retired metric: “Paper output volume” has been removed from the quarterly: the loop produces it near-free; it is no longer a scarce signal
新设指标: 议程命中率：本季所选方向中，三年后被独立团队接续/复现的比例（替代了"高引论文数"）
New metric: Agenda hit-rate: the share of this quarter’s chosen directions later picked up/replicated by independent teams within three years (it replaced “count of high-citation papers”)

「我们不再为产了多少论文骄傲——那是闭环的副产品。我们只对两件事负责：选对了哪些值得追的问题，以及哪些'结果'我们敢签字担保是真的。其余的，系统自己长出来。」——致理事会备忘

“We no longer take pride in how many papers we produced; that is a byproduct of the loop. We are accountable for only two things: which worth-chasing questions we chose correctly, and which ‘results’ we dare to sign off as true. The rest, the system grows on its own.” — memo to the board

必须写下的反方：判断也许只是又一种在等充裕的能力

The counter-bet, on record: judgment may just be one more capability waiting for abundance

诚实要求把最强的反方也记下来，不是只留对自己有利的那条线。这卷的主命题是：由人定义的价值判断——"哪个真相值得知道"——不会因为模型变强就被充裕，它的稀缺是结构性的，不是能力门槛。反方最锋利的一刀是：这条"结构性稀缺"，也许只是当前模型还不够强留下的临时假象。

Honesty requires writing down the strongest counter-argument too, not just the line that flatters this volume. Its central claim is: constitutive value judgment (“which truth is worth knowing”) won’t turn abundant just because models get stronger; its scarcity is structural, not a capability ceiling. The counter-bet’s sharpest cut: that “structural scarcity” may itself just be a temporary illusion left by models not yet being strong enough.

反方 · 与本卷对赌Counter-bet · against this volume

反方（RLCF，arXiv:2603.14473）：足够强的模型终将学走偏离均值的前沿价值。证伪它、本卷即成立：到 2032，AI 自选议程长期引用仍低于人类基线、不随规模收敛。Counter-bet (RLCF, arXiv:2603.14473, showed AI can learn the mean of scientific taste): a strong-enough model will eventually learn away the frontier value that departs from the mean. It is falsified, and the volume holds, if by 2032 AI-selected agendas stay below the human baseline on long-run citation and the gap does not converge with scale.

把反方写进正文，不是谦虚的修辞，是这卷方法本身的一部分：一个命题要是连自己都不肯接受一条证伪条件，它就不是知识，只是一个态度。这一章随时准备被 2032 那条引用判据改写——这恰恰是它配得上"研究方法论"这个名字的原因。

Putting the counter-bet in the body isn’t rhetorical modesty; it’s part of this volume’s method itself: a claim with no falsification condition its own author would accept isn’t knowledge, just an attitude. This chapter stands ready to be rewritten by that 2032 citation test, which is exactly why it earns the name “research methodology.”

RES

PLAYBOOK · 落地 + 最后一层

PLAYBOOK + THE LAST LAYER

行动 · 可执行

Action

落地 · 先立证据库，再守价值责任

Rollout · stand up the evidence base, then hold value accountability

把整卷收成一组能照做的原则，再加一层随充裕前线移动的动态判断。

The whole volume as one set of runnable principles, plus a layer that re-decides as the frontier moves.

一句话In one line

整卷收成三条原则——生成多、验证严；整合排在检索前面；守住价值责任——共用一条纪律：产出量从来不是指标。The volume reduces to three principles (generate much and verify hard, integration before retrieval, hold value accountability), sharing one discipline: output volume is never the metric.

01 / ↑

生成多·验证严 · 判断占比↑Generate much, verify hard · judging share↑

先立可追溯证据库；判断/复现占研究者时间的比例上升，产出量本身不是指标。Stand up a traceable evidence base; the share of time spent judging/replicating rises: output volume itself is not the metric.

02 / ↑

整合优先于检索 · 整合比率↑Integration over retrieval · integration ratio↑

写下"何为值得相信·值得知道"的判据；盯整合产物相对原始产出的比率。Write down the criteria for “worth believing / worth knowing”; watch integration artifacts vs raw output.

03 / ↓

守住价值责任 · 撤回/证伪率↓Hold value accountability · retraction/refutation↓

让"值得"有归属、不被生成层默认偏置替换；可信度命中率可测。Give “worth” an owner so the generation bias can’t replace it; credibility hit-rate becomes measurable.

最后这一层不给你一张一次性的清单，给的是一道动态的三分。三分不是把一个任务归一次类就完事，是随着充裕前线往右移，得不断重新判——同一个动作今天还在"在变"，明年可能已经滑进"不变"之外，或者滑进"不变"之内。

This last layer hands you no one-time checklist; it’s a dynamic three-way split. The split isn’t a single classification you make and forget; as the abundance frontier keeps moving right, you have to keep re-deciding. The same action sitting in “shifting” today may slide out of, or into, “invariant” next year.

不变INVARIANT

哪个真相值得知道Which truth is worth knowing

无对错、只有归属、由人定义的价值判断——AI 学得到平均，学不到异质。这是基岩。A constitutive value judgment with no right answer, only belonging: AI learns the average, not the heterogeneous. The bedrock.

在变SHIFTING

提问/验证被自动化Questioning/verifying automated

〔探索清单·Ⅲ〕peer review 净知识 −40%（ODE 模型预测，非已证）；框架内提问并入①充裕。[exploratory · Ⅲ] peer review’s net knowledge −40% (an ODE-model prediction, not proven); in-paradigm questioning joins ① abundance.

前沿FRONTIER

谁有权定研究方向Who owns the direction

〔探索清单〕当价值判断成稀缺资源，定方向的权力即治理问题——交棒组织卷，悬而未决。[exploratory] when value judgment is the scarce resource, the power to set direction is a governance question: handed to the Org volume, unresolved.

INSTRUMENT 09 · 研究价值分诊器 RESEARCH-VALUE TRIAGE

把一项研究动作放进双轴，看它该交给生成、由证据库规则定、还是必须人判——直接把内核②的双重退守（从"算不算真"到"值不值得知道"）做成可玩的分诊。

Drop a research action onto two axes and see whether it goes to generation, is decided by evidence-base rules, or must be judged by a human: the kernel’s double retreat (epistemic → axiological) made playable.

X · 可被 AI 执行 / 生成？AI-executable / generatable?

Y · 需人类价值判断？Needs human value judgment?

必人判 · 价值Human · value

可生成 × 需价值判断Generatable × value-laden

必人判 · 可信度Human · credibility

难自动 × 需价值判断Hard × value-laden

交给生成Hand to generation

可生成 × 可机检Generatable × checkable

知识图谱规则定Graph rules decide

难自动 × 可机检Hard × checkable

生产速度跟能消化的速度之间那道鸿沟，才是这卷真正押的注

The gap between production rate and digestion rate is this volume’s real bet

整卷收成一句话：研究卷押的不是"AI 做不了科学"，是"知识生产的速度会远远甩开人类能消化的速度，稀缺因此从生产端永久搬到了消化端——判断、整合、定值"。这个赌注有数量级撑着：科学文献已经差不多每年 250 万篇、每 9 年翻一倍，AI 又在上面叠了一层质变加速。生产曲线往上指数走，人类的认知带宽却近乎恒定，两条曲线之间那道剪刀差，就是这卷所有命题栖身的地方——RES 05 的整合鸿沟、RES 03 的判可信、RES 06 的定值，都是这道剪刀差的不同切面。有一天这个赌注要是被证伪了（人类的消化速度也能跟着 AI 等比例往上提，或者机器能无损接管消化端的判断），整卷就该退役。写得出退役的条件，它才是一个命题，不是一份信仰。

Collapse the whole volume into one line: the research volume isn’t betting that AI can’t do science. It’s betting that the rate of knowledge production will far outstrip the rate humans can digest it, so scarcity permanently migrates from the production end to the digestion end: judging, integrating, valuing. This bet has orders of magnitude behind it: the scientific literature already runs at roughly 2.5 million papers a year, doubling every 9 years, with AI layering a qualitative acceleration on top. The production curve climbs exponentially while human cognitive bandwidth stays nearly constant, and the scissors-gap between the two curves is where every claim in this volume lives: RES 05’s integration gap, RES 03’s credibility judgment, RES 06’s valuing are all facets of the same gap. If this bet is ever falsified (human digestion scaling proportionally with AI, or machines losslessly taking over digestion-end judgment), the whole volume should retire. Being able to write the retirement condition is what makes it a claim, not a faith.

三条原则内在是同一回事："生成多·验证严""整合优先于检索""守住价值责任"看着是三件事，其实是同一句话投在环的三个位置上。生成多——是承认①执行已经充裕了；验证严——是②判断退守之后那个承重的动作；整合优先——是④人接住带宽瓶颈的稀缺贡献；守值——是④把"值得"留给人，不交给生成层的默认偏置。三者共用一条度量纪律：产出量本身永远不是指标。该往上走的是判断/复现占时间的比例、整合产物相对原始产出的比率、研究方向价值决策的归属清不清楚；该往下走的是被撤回/证伪率、向已知解收敛的比率。把这套指标钉在墙上，你就有了一面镜子，随时照得出自己是不是正滑进 hypernormal。

The three principles’ inner unity: “generate much, verify hard,” “integration over retrieval,” “hold value accountability” look like three things, but they’re one line projected onto three positions of the loop. Generate much admits ① execution is already abundant; verify hard is the load-bearing act after ② judgment retreats; integration first is ④’s human catching the bandwidth bottleneck’s scarce contribution; hold value is ④ keeping “worth” with humans, not the generation layer’s default bias. All three share one measurement discipline: output volume itself is never the metric. What should rise: the share of time spent judging and replicating, the ratio of integration artifacts to raw output, how clearly research-direction value decisions are owned. What should fall: the retraction/refutation rate, the rate of convergence toward known solutions. Pin this set of metrics on the wall and you have a mirror that, at any moment, shows whether you’re sliding into hypernormal.

收束 · 全命题Closing · the whole thesis

可机检的判断会被充裕；由人定义的价值判断才是人最后的守地。敌人不是 AI 变强，是人自愿把"定义值得知道"交出去。Machine-checkable judgments turn abundant; the constitutive value judgment is the human’s last ground. The real enemy is not AI getting stronger, but people voluntarily handing over “defining what is worth knowing.”

耦合枢纽 · 接驳全系列The coupling hub · seams to the whole series

研究是系列里耦合最深的一卷：向上把"值得知道"交给创新，向下把"谁定方向"交给组织，与工程、设计同出一辙，与架构共享护栏。完整接线见体系总图。Research is the most deeply coupled volume: upward it hands “which truth is worth knowing” to Innovation, downward “who owns the direction” to the Organization, mirrors Engineering and Design, and shares a guardrail with Architecture. Full wiring in the system chart.

研究面 · 可执行 skill：ai-native-researchThe research surface, as an executable skill: ai-native-research

The research surface, as an executable skill: ai-native-research

这是可执行配套：它真的去做研究，按本卷的方式：大规模遍历文献、生成假设、跑标准分析、起草综述，然后担保哪条可信、定夺哪个真相值得知道。它不是"设计一个研究组织"（那是架构师 ai-native-architect），也不是把旧流水线加速的文献检索器；删掉 AI 它不会塌回"研究者读得更快"，因为环是围绕"充裕生成 × 承重验证（复现 + 可信度账）"重画的，可追溯证据库即规格。

This is the executable companion: it actually does the research the way this volume describes: traverse the literature at scale, generate hypotheses, run standard analyses, draft synthesis, then vouch for what is credible and decide which truth is worth knowing. It is not “design a research org” (that is the architect, ai-native-architect), nor a literature-search tool that merely speeds the old pipeline; delete the AI and it does not collapse to “a researcher reading faster,” because the loop is redrawn around abundant generation gated by a load-bearing verifier (replication + a credibility ledger), with a traceable evidence base as the spec.

# 在 Claude Code 里调用invoke inside Claude Code
$ /skill ai-native-research
> "把这 40 篇相互冲突的论文整合成一个判断：这个结论可信吗、值得我们押注吗？""Synthesize these 40 conflicting papers into one judgment: is this claim credible, and is it worth our bet?"

  → 一份研究发现档案 = 发现 + 可信度账（主张分 Ⅰ–Ⅴ、主张与证据不混）+ 知识图谱贡献 + 盲点登记a Research Finding Dossier = the finding + a credibility ledger (claims graded Ⅰ–Ⅴ, claims kept unmixed from evidence) + a knowledge-graph contribution + a blind-spot register

开源仓库：Open-source: github.com/watterfall/ai-native-architect/…/skills/ai-native-research ↗
安装：install: /plugin marketplace add watterfall/ai-native-architect

这是什么同一内核的七件系统里的一件研究面可执行配套：架构层（ai-native-architect）设计组织；六个配套件是六个面各一件、同一内核、彼此耦合、阅读无固定起点——本件是研究方法论的可执行版本。判断节点 + 止步线：遍历、综合、起草尽数交给 agent；但"哪个真相值得知道"与最终可信度判决须由人签字：等级可由工具起草，"我在此价值框架下愿为之担保"的判决不可外包。一个高"分数"不能代理物种/代理变量的跳跃；当一个发现喂给高风险不可逆决策（临床／法律／安全），可信度判决更要留给具名的人，而非更少。

What this is The research executable companion in a seven-piece system on one shared kernel: the architecture layer (ai-native-architect) designs the organization; the six companion pieces are one per surface, one kernel, mutually coupled, with no fixed reading entry: this is the executable form of the research methodology. Judgment node + stop-line: hand traversal, synthesis, and drafting fully to agents; but “which truth is worth knowing” and the final credibility verdict must be signed by a human: a grade can be drafted by a tool, the verdict “this is what I, in this value frame, am willing to vouch for” cannot be offloaded. A high “score” cannot proxy a species/surrogate jump; where a finding feeds a high-stakes irreversible decision (clinical, legal, safety), the credibility verdict is reserved for a named human more, not less.

SPEC.V / AI NATIVE METHODOLOGY / OWL METHODOLOGY SERIES

SCOPE / 一套方法论 · 完整组织光谱 N=1 → N=众多（一人公司至 agent 网络，同一套第一性原理）One methodology · the full organizational spectrum N=1 → N=many (from the one-person company to the agent network, on a single set of first principles)

SERIES / 六卷同一内核 · 本卷是其中一个面，完整接线见上方「方法论系列」。Six volumes, one kernel · this volume is one surface; the full wiring is above under “The Series.”

CONTACT / 案例投稿与合作洽谈：Case submissions and collaboration: contact@ai-native.build

FEEDBACK / 选中任意正文文字或悬停图表，点击浮出的 ⚑ 按钮即可直接提交反馈（免登录），或通过 GitHub 提交并跟踪进展。Select any text or hover a figure, then click the ⚑ button that appears to submit feedback directly (no account needed), or via GitHub to track progress.

APPENDIX · SOURCES / 证据与引用登记 —— 分级口径：Ⅰ 审计级实证（监管文件交叉验证）· Ⅱ 同行评审 · Ⅲ 理论模型／工作论文（引用须写"模型预测"，不得写"已证明"）· Ⅳ 从业者一手陈述 · Ⅴ 咨询预测（是预测，不是事实）。引用条目以本表为准；本轮 3 票对抗复核未发现被驳倒条目。Evidence and citation registry; grading key: Ⅰ audit-grade empirics (cross-checked against regulatory filings) · Ⅱ peer-reviewed · Ⅲ theoretical model / working paper (citations must read “the model predicts,” never “proven”) · Ⅳ practitioner first-hand account · Ⅴ advisory forecast (a forecast, not a fact). Citation rows are authoritative in this table; the current 3-vote adversarial review found no overturned source.

REF	级GR	SOURCE	承重论断Load-bearing claim
R1	Ⅰ	Open Science Collaboration《Estimating the reproducibility of psychological science》Science 349(6251) · 2015 · DOI 10.1126/science.aac4716	97 项有显著结果的研究里仅 36% 复现成功——复现是把"研究环"和"高速生成器"分开的承重验证器（RES 00 / 13）。Of 97 studies with significant results only 36% replicated: replication is the load-bearing verifier separating the research loop from a fast generator (RES 00 / 13).
R2	Ⅱ	Baker《1,500 scientists lift the lid on reproducibility》Nature 533(7604) · 2016 · DOI 10.1038/533452a	逾 70% 的科学家复现他人实验失败、逾 50% 复现自己的也失败——复现之墙是实证存在的，不是修辞（RES 05）。Over 70% of scientists failed to reproduce others’ experiments and over 50% failed to reproduce their own: the wall of reproducibility is empirical, not rhetorical (RES 05).
R3	Ⅳ	Karpathy《Software Is Changing (Again)》YC AI Startup School · 2025	前线随 agentic 执行的充裕而右移——可验证性梯度的左段被一路吃掉，右端不动（RES 02）。从业者一手陈述。The frontier moves right as agentic execution becomes abundant: the left of the verifiability gradient is eaten while the right end holds (RES 02). Practitioner first-hand account.
R4	Ⅳ–Ⅴ	Anthropic 研究 agent 自主性阶梯Anthropic ladder of research-agent autonomy · 2026 （公司自述，曲线为示意，非测量数据） (company self-account; the curve is illustrative, not measured)	最右端、最难自动化的一阶恰是研究议程选择（problem selection）——稀缺判断落在这一格（RES 06）。The rightmost, hardest-to-automate rung is research-agenda selection: the scarce judgment lands in this cell (RES 06).
R5	Ⅱ–Ⅲ	RLCF（社群偏好当 reward 的强化学习，arXiv:2603.14473 · 已核实 · arXiv/alphaXiv 2026-06）RLCF (RL from community feedback, arXiv:2603.14473 · verified · arXiv/alphaXiv 2026-06)	"科学品味的社群均值"可被外化、可被学走——可学的是均值（梯度左段），守地的是偏离均值的前沿（RES 06）；能否学反共识前沿尚缺直接实验。“The community mean of scientific taste” can be externalized and learned: what is learnable is the mean (the gradient’s left), what is held is the off-mean frontier (RES 06); whether anti-consensus frontier is learnable lacks a direct experiment.
R6	Ⅲ	同质化动力学 ODE 模型（arXiv:2604.05714 · 已核实 · arXiv/alphaXiv 2026-06）An ODE model of homogenization dynamics (arXiv:2604.05714 · verified · arXiv/alphaXiv 2026-06)	把"生成层向均值收敛"写成形式化动力学——比定性论证更尖锐，但仍是模型预测，非已证明（RES 08）。Formalizes “the generation layer converging to the mean” as dynamics: sharper than the qualitative argument, but still a model prediction, not proven (RES 08).
R7	Ⅳ + Ⅲ	Sakana AI《AI Scientist-v2》公司自述 + 第三方独立复核Sakana AI, “AI Scientist-v2,” company account + third-party independent review · 2025	一手信号：瓶颈正从"生成研究"搬向"判断可信"——生成已不稀缺，担保可信才稀缺（RES 01 / 02）。First-hand signal: the bottleneck is moving from “generating research” to “judging credibility”: generation is no longer scarce, vouching credibility is (RES 01 / 02).
R8	Ⅱ	AI Feynman（符号回归）AI Feynman (symbolic regression) · Udrescu & Tegmark · Science Advances 6(16) · 2020 · DOI 10.1126/sciadv.aay2631	100 条费曼方程全数重发现（旧软件 71 条），但都是已知方程：充裕化擅长"在已知框架内"，不等于跨框架的新理解（RES 06 / 11）。Recovered all 100 Feynman equations (older software got 71), but all are known equations: abundance excels “within a known frame,” not at cross-frame new understanding (RES 06 / 11).
R9	Ⅱ	Hao, Xu, Li & Evans《AI tools expand scientists' impact but contract science's focus》Nature 649(8099) · 2026 · DOI 10.1038/s41586-025-09922-y	约 4129.8 万篇论文的文献计量：用 AI 的科学家个人影响力上升，但科学整体主题覆盖收缩 4.63%、学者间互动下降——"加速 ≠ 进步"的硬锚（RES 03 / 08）。Bibliometrics over ~41.298 million papers: AI-using scientists’ individual impact rises, yet topic coverage contracts 4.63% and scholar-to-scholar interaction falls: the hard anchor for “acceleration ≠ progress” (RES 03 / 08).
R10	Ⅱ	Bornmann & Mutz《Growth rates of modern science》JASIST 66(11) · 2015 · DOI 10.1002/asi.23329	科学文献基数约 250 万篇/年、每 9 年翻倍——剪刀差产出侧的实证基线（RES 05 · FIG 7.0）。The scientific literature base is ~2.5 million papers/year, doubling every 9 years — the empirical baseline for the production side of the scissors gap (RES 05 · FIG 7.0).
R11	Ⅳ–Ⅴ	《The epistemic revolution of AI》（认识论综述／观点文）“The epistemic revolution of AI” (epistemology review / opinion piece)	论证 AI 同时扰动经验论／证伪／库恩范式，并指"知识生产速度超出单一人类认知"，为整合鸿沟提供综述侧证，但未给逐条实证（RES 05 / 06）。Argues AI simultaneously perturbs empiricism / falsification / Kuhnian paradigms and that “the rate of knowledge production outpaces single-human cognition” — review-grade side-evidence for the integration gap, with no item-by-item empirics (RES 05 / 06).
R12	Ⅱ	Doshi & Hauser《Generative AI enhances individual creativity but reduces the collective diversity of novel content》Science Advances 10(28) · 2024 · DOI 10.1126/sciadv.adn5290	给写作者 LLM 点子，个体故事更"有创意"，但故事彼此更相似——作者明确称之为"社会困境"（个人更好、集体更窄）：同质化最强的因果锚（RES 08）。Give a writer LLM ideas and individual stories get more “creative,” yet stories grow more similar: the authors call it a “social dilemma” (individually better, collectively narrower): the strongest causal anchor for homogenization (RES 08).
R13	Ⅱ–Ⅲ	Anderson 等（36 人实验）Anderson et al. (36-person experiment) · 2024	同质化是群体层效应：不来自个体固着，而来自 LLM 向不同用户建议相似点子——定位机理在"群体"而非"个人"（RES 08）。Homogenization is a group-level effect: not individual fixation but the LLM suggesting similar ideas to different users: locating the mechanism at the group, not the individual (RES 08).
R14	Ⅲ	《We're Different, We're the Same》“We’re Different, We’re the Same” · 2025	控制结构变量后，LLM 之间的相似度远高于人与人之间——跨模型同质，换个模型也救不了（RES 08）。Controlling for structural variables, LLMs resemble one another far more than humans do: cross-model homogeneity that switching models does not cure (RES 08).
R15	Ⅱ	March《Exploration and Exploitation in Organizational Learning》Organization Science 2(1) · 1991 · DOI 10.1287/orsc.2.1.71	探索（搜索／变异／冒险）与利用（精炼／选择／效率）争同一份资源，利用倾向于赢——"省下的产能不会自动变成 slack"的底座（RES 07）。Exploration (search / variation / risk) and exploitation (refinement / selection / efficiency) compete for one resource, and exploitation tends to win: the base for “freed capacity does not automatically become slack” (RES 07).
R16	Ⅱ	DeepMind《GNoME — Scaling deep learning for materials discovery》Nature 624 · 2023 · DOI 10.1038/s41586-023-06735-9	发现约 220 万种新晶体材料，但绝大多数是已知结构类型内的元素替换——充裕化扩张已知，不等于换描述层级（RES 11）。Discovered ~2.2 million new crystalline materials, but the vast majority are element substitutions within known structure types: abundance expands the known, it does not switch the level of description (RES 11).
R17	Ⅳ	历史锚：Harry Beck 1933 伦敦地铁图（重示意化）· William Farr 霍乱地图（围绕"空气质量"组织数据）Historical anchors: Harry Beck’s 1933 London Tube map (re-schematization) · William Farr’s cholera map (data organized around “air quality”)	Beck 抛掉地理精确、重画成电路图才是换框架的动作；Farr 的变量框架推不出"水传播微生物"——换框架要靠换变量，不是堆细节（RES 11）。Beck’s paradigm act was discarding geographic accuracy for a circuit diagram; Farr’s variable frame could not infer “waterborne microbes”: reframing comes from changing variables, not piling detail (RES 11).
R18	Ⅳ–Ⅴ	Asimov Press《地图隐喻》（观点综述，转引博尔赫斯一比一地图寓言）Asimov Press, “the map metaphor” (opinion essay, citing Borges’s one-to-one map parable)	细节拉满的一比一地图仍是同一种信息，没有变成新理解——整合不是更长的检索（RES 05）。观点文，其转引实证须各自回溯定级。A one-to-one map maxed on detail is still the same information, not new understanding: integration is not longer retrieval (RES 05). Opinion piece; its cited empirics each need tracing and grading.
R19	Ⅳ	历史锚：贝尔实验室 · 施乐 PARC · 剑桥 LMB（"小团队 + 制度性保护冗余探索"）Historical anchors: Bell Labs · Xerox PARC · the Cambridge LMB (“small teams + institutional protection of redundant exploration”)	保护"看似无用"探索的具体治理动作有历史证据支撑——散木的命运是条件性的，不是注定的（RES 07）。The governance acts that protect “seemingly useless” exploration have historical support: the fate of the useless tree is conditional, not fated (RES 07).
R20	Ⅱ	Kuhn《The Structure of Scientific Revolutions》University of Chicago Press · 1962（专著） (monograph)	范式转移按定义落在 AI 训练分布之外——价值论轴上"换框架"的判断不可被统计学习归纳（RES 06 · FIG 6.1 / 6.2）。A paradigm shift lies by definition outside AI’s training distribution: the axiological “reframe” judgment cannot be induced by statistical learning (RES 06 · FIG 6.1 / 6.2).
R21	Ⅱ	Hirsch《An index to quantify an individual's scientific research output》PNAS 102(46) · 2005 · DOI 10.1073/pnas.0507655102	h 指数把"高被引论文数"压成一个可数代理，它有"多发/多被引"两条都能刷的路径，是 RES 13① 中"产量当价值"机制的原始装置。The h-index compresses “count of highly cited papers” into one countable proxy with two gameable paths (publish more / get cited more): the original device of RES 13①’s “counting output as worth.”
R22	Ⅳ	Goodhart 定律（Strathern 1997 的常引转述："当一个度量成为目标，它就不再是好度量"）Goodhart’s law (Strathern 1997’s widely cited restatement: “when a measure becomes a target, it ceases to be a good measure”)	为 RES 13 整节提供机制底座：指标一旦成为目标即被优化而非反映其本意——AI 把"成为目标后失效"的时标从数年压到数周。常引格言级，非逐条实证。The mechanism base for all of RES 13: once a metric is a target it is optimized rather than reflective — and AI compresses the “fails-once-targeted” timescale from years to weeks. An aphorism-grade citation, not item-level empirics.
R23	Ⅳ	期刊影响因子（Garfield 1955 起源；后为 Clarivate JCR 商业指标）· 作者本人多次警告勿用于评单篇/单人The journal impact factor (Garfield 1955 origin; later the Clarivate JCR commercial metric) · its originator repeatedly warned against judging single papers/people by it	用"期刊均值"代理"单篇质量"是范畴错误（引用分布长尾）；它系统性偏好热点/阳性结果，是 RES 13② 把"追热点"焊进激励的装置（接 RES 08 同质化）。Using “a journal mean” to proxy “a single paper’s quality” is a category error (long-tailed citation distribution); it systematically prefers hype/positive results: RES 13②’s device welding “hype-chasing” into incentives (links to RES 08 homogenization).
R24	Ⅱ	Björk & Solomon《The publishing delay in scholarly peer-reviewed journals》Journal of Informetrics 7(4) · 2013 · DOI 10.1016/j.joi.2013.09.001	投稿到接收的中位延迟以月计——同行评审是一道吞吐量由人类专家数量（非投稿量）决定的串行闸，故在投稿无限时被结构性淹没（RES 13③ · 接 RES 05 剪刀差）。Median submit-to-accept delay runs in months — peer review is a serial gate whose throughput is set by the number of human experts (not submission volume), so it is structurally drowned when submissions are unbounded (RES 13③ · links to RES 05’s scissors gap).
R25	Ⅳ	FutureHouse Robin（端到端自主科研环；ripasudil / RPE 干性 AMD 方向）· Nature 2026（preprint arXiv:2505.13400, 2025-05）· DOI 10.1038/s41586-026-10652-y · 已核实FutureHouse Robin (end-to-end autonomous research loop; ripasudil / RPE dry-AMD direction) · Nature 2026 (preprint arXiv:2505.13400, 2025-05) · DOI 10.1038/s41586-026-10652-y · verified	AI agent Finch 对同一批流式数据自动量化 ripasudil 对 RPE 细胞吞噬作用的效应量约 7.5×（vs DMSO 对照）；人工以不同门控阈值复算同一数据得约 1.75×，这是同一份数据上"AI 量化 vs 人工量化"的差异，不是研究加速本身。它仍是可量化的"整合与担保不可外包"反例：把分散结果整合成一个敢签字的结论，仍要人复算、人担保（RES 01 · RES 14）。Robin 的"加速"另有其数——把原本以数月计的人工探索压缩到数小时量级（Nature 2026 正式版）；效应量须独立复核，候选药限临床前 / 体外。The AI agent Finch auto-quantified ripasudil’s effect on RPE-cell phagocytosis at ~7.5× (vs DMSO control); a manual recompute of the same flow-cytometry data with different gating thresholds gave ~1.75×: an “AI-vs-manual quantification” gap on one dataset, not research acceleration itself. It remains a quantifiable counter-case for “integration and vouching cannot be outsourced”: integrating scattered results into a conclusion someone dares sign still needs a human to recompute and vouch (RES 01 · RES 14). Robin’s real “acceleration” is a different number: compressing a research loop that would take a human months into a matter of hours (Nature 2026); the effect size needs independent re-check, and the candidate is preclinical / in-vitro.
R26	Ⅱ	AlphaFold2 / 3（Jumper 等 · Nature · 2021；Abramson 等 · Nature · 2024；2024 诺贝尔化学奖）· DOI AF2 10.1038/s41586-021-03819-2 · AF3 10.1038/s41586-024-07487-wAlphaFold2 / 3 (Jumper et al. · Nature · 2021; Abramson et al. · Nature · 2024; 2024 Nobel Prize in Chemistry) · DOI AF2 10.1038/s41586-021-03819-2 · AF3 10.1038/s41586-024-07487-w	蛋白质结构预测把"序列→结构"压到近免费——框架内充裕的最强正例；但预测结构非功能 / 疗效、须湿实验确证、扩张现有框架而非更换（RES 06 / 11）。Protein-structure prediction compresses “sequence → structure” to near-free: the strongest positive case of in-paradigm abundance; but it predicts structure not function / efficacy, needs wet-lab confirmation, and expands rather than switches the paradigm (RES 06 / 11).

完整调研档案（27 条主张 · 限定语全文 · 未竟项）：references/2026-06-深度调研-证据与引用.mdFull research dossier (27 claims · full qualifiers · open items): references/2026-06-deep-research-evidence-and-citations.md

REV. 2026-06 R20 / END OF DOCUMENT

AI Native 研究方法论

AI Native Research Methodology

研究文档包：从产知识到担保可信

Research Pack: from producing knowledge to vouching for belief

研究执行变充裕后，稀缺先退守到提问，再退守到“哪个真相值得知道”。

When research execution is abundant, scarcity retreats first to questions, then to which truth is worth knowing.

海量生成之后，需要的是证据与判断工件。

After mass generation, the needed artifacts are evidence and judgment artifacts.

别把“可信”压成一个总分。

Do not compress “believable” into one score.

先把一个研究问题拆成两层。

First split one research question into two layers.

从 AI 辅助研究，到AI-Native 研究

From AI-Assisted Research to AI-Native Research

"种类之别，非程度之别"——地图隐喻把它说透

“A difference of kind, not degree”: the map metaphor makes it concrete

三个误读，把"嫁接"伪装成"原生"

Three misreadings that disguise “grafting” as “native”

科学是资源分配问题，不是智能问题

Science is a resource-allocation problem, not an intelligence problem

同一瓶颈搬家，但研究的判断分叉得更深

The same bottleneck moves, but research’s judgment forks deeper

工程判"对不对"，研究判"值不值得"——同一个道理，却更深

Engineering judges “correct,” research judges “worth”: the same move, only deeper

"值得"从"算不算真"交到"值不值得知道"的那一刻，是全卷的枢轴

The pivot of the whole volume: the moment “worth” passes from epistemology to axiology

执行变富，提问变稀缺——但提问自己也会分叉

Execution gets cheap, questions get scarce, but questions fork too

不切这一刀，核心命题会被自己证伪一半

Without this cut, the core thesis half-falsifies itself

peer review 的危机：投稿在涨，净知识可能在跌

The peer-review crisis: submissions rise, net knowledge may fall

科学社区的价值，从产生知识转向担保可信

The community’s value shifts from producing knowledge to vouching for it

AI 科学案例账：把"已发生 / 正在发生 / 推演"分开记

An AI-science case ledger: book “happened / happening / projected” separately

知识图谱即护栏——让海量生成留在可追溯的结构里

The knowledge graph as guardrail: keeping mass generation traceable

四属性里，"可证伪"是最容易被偷工的一条

Of the four properties, “falsifiable” is the one most easily skimped

人不可外包的稀缺动作，是整合，不是检索

The human’s un-outsourceable scarce act is integration, not retrieval

生产速度与可消化速度的剪刀差，是整合升值的根

The scissors-gap between production and digestion is the root of integration’s rising value

缝合是一种换框架的动作，不是更长的检索

Stitching is not longer retrieval; it is a paradigm-level act

研究的终极问题，从"怎么发现真相"翻转为"哪个真相值得知道"

Research’s ultimate question flips from “how to find truth” to “which truth is worth knowing”

异质的"值得"学不到，因为它的样本只有一个

The heterogeneous “worth” cannot be learned, because its sample size is one

"值得"由人定义，不是又一种能力

“Worth” is a constitutive stipulation, not one more capability

问题选择的品味，是研究里最稀缺的判断

Taste in problem selection is research’s scarcest judgment

价值判断一旦落地，就是"谁有权定方向"的治理问题

Once value judgment lands, it becomes a governance question of “who owns the direction”

人回归意义，在研究面是把研究者还给值得追问的问题

On the research face, “humans return to meaning” means returning researchers to the questions worth asking

散木的命运：效率会自动吃掉冗余探索

The fate of the useless tree: efficiency eats redundant exploration by default

最危险的误用：加速把科学推得更窄，不是更深

The most dangerous way to go wrong: acceleration pushes science narrower, not deeper

预测准 ≠ 理解对：太阳系模型没长出"引力"

Accurate prediction ≠ correct understanding: the solar-system model never grew “gravity”

同质化的机制：写在权重里，推理时救不回

The mechanism of homogenization: written into the weights, unrecoverable at inference

当生成无限，每条主张都要先过一道可信度天平

When generation is unbounded, every claim first crosses a believability ledger

证据强度可补，框架距离要判——两条轴管两种动作

Evidence strength can be supplemented, paradigm distance must be judged: two axes for two acts

为什么"证据弱×框架远"这一格决定整台仪器的价值

Why the “weak × far” cell decides the whole instrument’s value

AI 当协作者，还是当裁判？一道必须先划的界

AI as collaborator, or as judge? a line you must draw first

同一个研究动作，范式内交给生成、范式级留给人

In-paradigm hands to generation, paradigm-level stays with the human: the same action, split

一条判据，同时回答三章的问题

One test answers three chapters’ questions at once

同一个词，左边右边说的是两件事

The same word means two different things on each side