When literature search, running experiments, and data analysis are all near-free, the scarce thing is no longer "doing the research" but "which question to ask" — and one layer deeper, "which answer deserves belief, which truth is worth knowing." This is the most deeply coupled volume in the series: it sits upstream. Research obeys the same kernel as engineering, yet runs deeper — engineering judges "is it correct," research retreats finally to "is it worth knowing," dropping from an epistemic judgment into a value one. Same discipline: tools are surface; what we want is the principle beneath.
本卷内核特化 · KERNEL ON THIS SURFACE ① 执行充裕(检索/实验/分析近乎免费)→ ② 判断沿可验证性梯度分叉(范式内提问并入充裕,"哪个真相值得知"下沉)→ ③ 可查询的证据库成基设 → ④ 人退守为"值得相信/值得知道"的担保人。不必读过组织卷,单页即可读懂。
KERNEL ON THIS SURFACE ① execution abundant (search / experiments / analysis near-free) → ② judgment forks along the verifiability gradient (in-paradigm questions join abundance, "which truth is worth knowing" sinks) → ③ a queryable evidence base becomes infrastructure → ④ people retreat to guarantors of what is worth believing and knowing. You need not have read the Organization volume — this page stands on its own.
AI-ENABLED RESEARCH→AI-NATIVE RESEARCH
速度
Speed
检索、总结、实验更快Search, summaries, and experiments get faster问题、证据账与探索账被分开管理Questions, evidence ledgers, and exploration ledgers are managed separately
可信
Credibility
把输出包装成结论Package output as conclusion为证据等级、复现路径和不确定性担保Vouch for evidence grade, replication path, and uncertainty
价值
Worth
知道更多事实Know more facts判断哪个真相值得知道、值得追问Judge which truth is worth knowing and pursuing
AI-Native research is not faster paper production; it turns discovery into a traceable, reproducible, integrable credibility system while explicitly keeping value-laden direction with people.
Research Artifacts
海量生成之后,需要的是证据与判断工件。
After mass generation, the needed artifacts are evidence and judgment artifacts.
问题分诊:范式内可自动化,范式级重构留给人。
Question triage: in-paradigm can automate; paradigm reframing stays human.
可信度账本:证据强度与范式距离分开登记。
Believability ledger: evidence strength and paradigm distance booked separately.
Strong evidence is not the same as paradigm novelty; paradigm distance is not the same as wrongness. Keep the evidence ledger separate from the exploration ledger: one carries reliability, the other leading indicators, scope, and falsifiers.
First Move
先把一个研究问题拆成两层。
First split one research question into two layers.
问:这只是既有范式内的最近邻空白,还是要换变量、换层级、换问题框架?前者让 AI 批量执行,后者先由人写方向与证伪条件。
Ask whether it is merely the nearest gap inside an existing paradigm, or whether it changes variables, level of description, or frame. Let AI execute the first; have people write direction and falsifiers for the second.
研究员用 AI 查文献、跑实验更快,仍然只是 AI 辅助:旧科研流程提速了,但提问、复现与价值判断的结构没有改变。AI-Native 研究承认研究执行已充裕,于是围绕"产物近乎免费"重画知识发现的图:稀缺的不是论文,是提问与价值判断。差别不是程度,是种类。
Researchers searching literature and running experiments faster with AI is still AI-assisted research: the old scientific process speeds up, but the structure of questioning, replication, and value judgment remains unchanged. AI-Native research accepts that research execution is abundant and redraws discovery around "the artifact is near-free": the scarce thing is not the paper but the question and the value judgment. The difference is not degree, but kind.
For years research's scarce resource was researcher hours — the hours of people who can read, compute, run experiments, and write it up. The whole scientific process was built for "doing research is expensive." Once search, execution, analysis, and even the batch generation of hypotheses are near-free and massively parallel, "producing a paper" no longer slows anyone, but the bottleneck does not vanish — it moves: first to asking the right question, then to judging which answer deserves belief, which truth is worth knowing. Fill the kernel's four steps with the specifics of truth and you get the whole thesis of this part.
① 充裕ABUNDANCE
检索 / 实验 / 分析 / 批量假设
Search / experiments / analysis / hypotheses
执行近乎免费、可大规模并行,"做出研究"不再稀缺。
Execution is near-free and parallel; "doing research" is no longer scarce.
② 判断JUDGMENT
提对问题 → 判可信 → 判值得知
Right question → believable → worth knowing
沿可验证性梯度分叉:范式内提问并入①充裕,范式级重构与"值得知"下沉④。
Forks along verifiability: in-paradigm questions join ① abundance, paradigm-level reframing and "worth knowing" sink to ④.
③ 上下文CONTEXT
知识图谱 / 证据库即护栏
Knowledge graph / evidence base as guardrail
可查询的证据库 = 研究生成的"规格",让海量生成可追溯、可证伪、可整合。
A queryable evidence base is the "spec" for generation — traceable, falsifiable, integrable.
④ 人MEANING
担保可信 · 整合 · 定何为值得知
Vouch for belief · integrate · define worth
人不再是知识生产者,而是"值得相信 / 值得知道"的担保人。
People are no longer producers of knowledge but the guarantors of what is worth believing and knowing.
在体系中的定位Position in the system
研究偏向认知(研究 · 学习 · 创新),组织 · 工程 · 设计偏向执行,两类彼此耦合、互相回流。阅读多从组织进入(最具体可施工),但入口 ≠ 逻辑顶点。Research is cognition-facing (research · learning · innovation); org · engineering · design are execution-facing, and the two families couple and feed back into each other. Most readers enter through the organization (most concrete, most buildable), but the entrance is not the logical apex.
"种类之别,非程度之别"——地图隐喻把它说透
"A difference of kind, not degree" — the map metaphor makes it concrete
This volume keeps saying "the difference is not degree but kind," but that line is easily heard as a slogan. Asimov Press's map metaphor drives it to the bone: Borges wrote a parable in which an empire's cartographers made the map ever more accurate until they produced one the size of the empire, one-to-one — detail maxed to the limit, yet it remained the same kind of information, never becoming new understanding. Draw the London Underground ever more accurately, marking every rail's true curvature and geographic coordinates, and it is still an ever-more-accurate geographic map. Until 1933, when Harry Beck did something of a different kind: he threw away geographic accuracy and redrew the whole network as a circuit diagram — straight lines, 45-degree angles, even station spacing, all "inaccurate," yet for the first time letting anyone see at a glance how to change trains. That is a paradigm: a re-schematization, not more detail. AI excels at making the map more accurate (filling blanks, adding detail, raising precision), but "what kind of map to redraw it as" is a different kind of act — it is not on the "more accurate" axis, so no amount of compute crosses to it automatically. This is the epistemological foundation of the whole volume's "bottleneck moving."
三个误读,把"嫁接"伪装成"原生"
Three misreadings that disguise "grafting" as "native"
把这一卷读窄,几乎都从同一个错位开始:把工具的更替当成稀缺的迁移。第一种误读是"更快即原生"——研究员用 AI 把六个月的文献综述压到六天,于是宣布自己 AI-Native 了。可它只把同一条流程的"执行"加速了,瓶颈仍卡在"这六天读完之后,谁来判断该信哪一条、该往哪个方向追"。第二种误读是"更多即更好"——把产出当成绩效,年产论文从 4 篇变 40 篇。但 RES 03 的 Nature 文献计量已经把这条路走到尽头:个人产出与影响力确实涨了,科学整体的主题覆盖却在收缩。第三种误读最隐蔽,"自动即自主"——以为接上一个 end-to-end 的 autonomous scientist 就等于把研究交了出去。可这条流水线唯一能用来给自己想法打分的代理,是"与既有范式的距离",于是它越自动,越把科学推向范式内的安全区。三种误读共用一个病灶:只搬了执行,没认出稀缺已经搬家。
Reading this volume too narrowly almost always begins from one dislocation: mistaking a change of tools for a migration of scarcity. The first misreading is "faster = native" — a researcher compresses a six-month literature review into six days with AI and declares themselves AI-Native. But that only sped the "execution" of the same pipeline; the bottleneck still sits at "after those six days of reading, who judges which thread to believe and which direction to chase." The second is "more = better" — treating output as performance, going from 4 papers a year to 40. RES 03's Nature bibliometrics already walked this road to its end: individual output and impact do rise, while science's topical coverage contracts. The third is the most insidious, "automatic = autonomous" — believing that wiring up an end-to-end autonomous scientist equals handing research away. Yet the only proxy that pipeline has for scoring its own ideas is "distance from the established paradigm," so the more autonomous it gets, the harder it pushes science into the in-paradigm safe zone. All three share one lesion: they moved execution without recognizing that scarcity had already moved.
科学是资源分配问题,不是智能问题
Science is a resource-allocation problem, not an intelligence problem
这卷之所以把研究当上游而非又一台 α 机器,背后有一个常被忽略的命题:"科学根本是一个资源分配问题,不是智能问题"(chenhaot 的形式化)。意思是:限制科学进步的,从来不是"算得不够快、读得不够多"这类智能瓶颈,而是"有限的注意力、经费、人才该投向哪些问题"这个分配瓶颈。AI 把"算、读、跑"的成本压到近零,并没有解决分配问题——它只是把分配瓶颈暴露得更彻底了。产出更多不等于知识更多:如果一万篇论文全挤在同一个数据丰富的热门角落,知识的边界一寸没动。这就是为什么本卷反复说"产出量本身永远不是指标"——在一个执行充裕的世界里,唯一还稀缺的资源是注意力,而注意力该投向哪里,正是 RES 06 那个"哪个真相值得知道"的价值判断。把科学看成智能问题,你会去堆更多算力;把它看成分配问题,你才会去守那个决定"投向哪里"的判断节点。
The reason this volume treats research as upstream rather than one more alpha machine rests on an often-overlooked claim: "science is fundamentally a resource-allocation problem, not an intelligence problem" (chenhaot's formalization). Meaning: what limits scientific progress was never an intelligence bottleneck like "not computing fast enough, not reading enough," but an allocation bottleneck — "which problems should finite attention, funding, talent go to." AI drives the cost of "compute, read, run" to near-zero and thereby does not solve allocation — it merely exposes the allocation bottleneck more starkly. Producing more does not equal knowing more: if ten thousand papers all cluster in the same data-rich hot corner, the edge of knowledge has not moved an inch. This is why the volume keeps saying "output volume itself is never the metric" — in a world of abundant execution, the one still-scarce resource is attention, and where attention should go is precisely RES 06's value judgment of "which truth is worth knowing." See science as an intelligence problem and you pile on more compute; see it as an allocation problem and you guard the judgment node that decides "where to point it."
So this volume's first cut is to redefine the word "research" from "producing something publishable" to "landing a believable, worth-knowing truth into a traceable structure." The former's scarcity is hours; the latter's is judgment. Once you fill the kernel's four steps (① execution abundant → ② judgment retreats → ③ context becomes infrastructure → ④ people return to meaning) with the specifics of truth, the whole thesis takes shape — and the figure below draws those four steps as a loop that self-corrects.
FIG. 0.1 / 研究环:复现是承重的验证器THE RESEARCH LOOP: REPLICATION IS THE LOAD-BEARING VERIFIER看懂:问题→假设→实验→评估→知识是一个环,唯一让它不空转的是"独立复现"那道闸——拆掉它,环就退化成高速生成器。Read: question→hypothesis→experiment→eval→knowledge is a loop; the one thing that keeps it from spinning free is the "independent replication" gate — remove it and the loop degrades into a fast generator.
同一个环,工程和研究都在跑。差别只在那道闸:工程的验证器问"对不对"(可机检、终将自动化);研究的验证器是独立复现 + 可信度判断,它问"值不值得信、值不值得知"——后者无法被环内的生成自我担保,必须由环外的人接住。把复现拆掉,五个箭头依然转,但转的是一台高速空转的生成器。The same loop runs in both engineering and research. The difference is only that gate: engineering's verifier asks "is it correct" (machine-checkable, eventually automated); research's verifier is independent replication plus a credibility judgment, asking "is it worth believing, worth knowing" — which the loop's own generation cannot self-vouch for and a human outside the loop must catch. Remove replication and the five arrows still turn, but what turns is a fast generator spinning in a vacuum.
RES
01
KERNEL · 内核特化
KERNEL
命题 · 与工程同构
Thesis · Isomorphic to Engineering
同一瓶颈搬家,但研究的判断分叉得更深
The same bottleneck moves, but research's judgment forks deeper
Engineering faces code; its scarce judgment is "is it correct" (verification, an epistemic judgment). Research faces truth; its scarce judgment is first "which question to ask," retreating finally to "which answer deserves belief, which truth is worth knowing" (a value judgment). Step ② is not a single retreat down one stair but a fork along the "verifiability gradient" — exactly where research runs deeper than engineering.
内核第②步的分叉(全卷承重):判断不是一整块"留给人"。它沿"能不能被机器检验"裂成两支——
The fork in kernel step ② (load-bearing for the whole volume): judgment is not one block "kept for humans." It splits along "can a machine check it" into two branches —
可机检的判断 → 并入 ① 充裕Machine-checkable → joins ① abundance
In-paradigm questioning/search/synthesis: finding the next checkable gap inside an existing theoretical frame, toward data-rich regions. AI is good at this and parallelizes it. It is no longer "kept for humans"; it becomes one more automated form of execution.
构成性判断 → 下沉 ④ 价值基岩Constitutive → sinks to ④ value bedrock
Paradigm-level reframing and "which truth is worth knowing": in sparse, value-laden domains there is no existing frame to borrow and no machine-checkable proxy for right. This is the human's last, un-outsourceable scarce contribution.
The two branches have utterly different fates, and conflating them errs in both directions. The machine-checkable branch — in-paradigm questioning, search, standard experiment design — is continuously and irreversibly folded into ① abundance: what needs a human today may be eaten by a stronger agent next year, its scarcity merely a temporary capability threshold. The constitutive branch — paradigm-level reframing, value judgment — does not move left as models get stronger, because its scarcity is not a capability threshold but the structural fact of "no machine-checkable right answer." Conflate the two and you commit two opposite errors at once: either defending the machine-checkable branch as "the last line of human dignity" (failing to abundify execution, stuck in hours for nothing), or handing the constitutive branch away early as "automatable sooner or later" (ceding value judgment to the generation layer's default bias). Seeing this fork clearly is, in essence, seeing "which scarcities time dissolves and which it does not" — the source of every operational test in the volume.
So the volume's falsifiable core thesis takes shape: execution becomes abundant → questioning becomes scarce → then a retreat to "which truth is worth knowing" (a value judgment). Its condition for being false is explicit: if one can show that "which truth is worth knowing" can be losslessly formalized, aggregated, or handed to a system to decide automatically — the thesis falls. Being able to write the condition that would refute it is what makes it a claim.
同构 / 深潜Isomorphism / dive
这条②分叉,与工程"可机检的对错并入充裕、品味与风险下沉给人"是同一招——只是研究的下沉支落到价值论而非认识论。见This step-② fork is the same move as engineering's "machine-checkable correctness joins abundance, taste and risk sink to humans" — only research's sinking branch lands in axiology, not epistemology. See 工程篇 ↗the Engineering chapter ↗。
工程判"对不对",研究判"值不值得"——同构而更深
Engineering judges "correct," research judges "worth" — isomorphic but deeper
Placing research beside engineering shows most clearly the weight of "isomorphic but deeper." Both run the same kernel: execution abundified, bottleneck moves to judgment, context becomes infrastructure, humans return to meaning. But their judgment lands on two different layers. Engineering faces code; its scarce judgment is "is it correct" — an epistemic judgment with objective right and wrong, tests, machine-checkable acceptance criteria, so it will eventually be largely automated by verification tooling. Research faces truth; its scarce judgment is first "which question to ask," retreating finally to "which answer deserves belief, which truth is worth knowing" — the first half still in epistemology (credibility, with an evidence gradient), the second half already fallen into axiology (worth, no right answer, only belonging). This is the exact meaning of "deeper": engineering's judgment gradient slides within the single coordinate system of "correct," while research's gradient slides out of epistemology and falls into axiology. The moment a judgment turns from "is this true" into "is this worth knowing," the coordinate system changes — and the new one has no machine-checkable right answer. This is the extra layer of depth the research volume has over the engineering one.
Research is the series' coupling hub, not just one more alpha machine. Placing research beside engineering and design, the easiest mistake is to read it as "one more machine for making execution cheap." It does make execution cheap, but its place in the series is not at the output end; it is upstream — what it produces is not code or interfaces but the hardest-to-outsource judgment of "which truth is worth knowing." Only with that judgment in hand do downstream engineering, design, and the organization know where to spend their now-cheap execution. Without the upstream, the downstream is a precise idle: a team that can ship anything in six days, with no one to answer "which truth to ship," merely walks the wrong direction faster. This is what "most deeply coupled" means — research is not just cited, it defines the series' input.
"值得"从认识论交到价值论的那一刻,是全卷的枢轴
The pivot of the whole volume: the moment "worth" passes from epistemology to axiology
If the whole volume could keep one sentence, it is this: research's scarce judgment retreats along a path — first from "execution" to "asking the right question" (a shift within epistemology), then from "asking" to "which answer deserves belief" (still epistemology, but nearing the edge), and finally to "which truth is worth knowing" (falling out of epistemology, into axiology). The first two steps are still in the "right or wrong" coordinate system — machine-checkable, with an evidence gradient, eventually partly automated; the last step changes the coordinate system — "worth" has no machine-checkable right answer, only belonging to "whom, under which value frame." This handover from epistemology to axiology is the volume's pivot: it is both the endpoint of kernel ②'s "judgment retreats" forking on the research face, and the interface where the research volume hands off to the Innovation volume (value discovery). See this pivot and you understand why research cannot be reduced to "faster science" — faster acts only on the epistemic half, while the scarcest, least-outsourceable judgment lives precisely after that one step into axiology.
This coupling is bidirectional, and each seam lands on a concrete SHEET. Upward, research hands "which truth is worth knowing" to Innovation (value discovery) at SHEET 07 — research spots gaps at the edge of knowledge, innovation judges the value those gaps point to. Downward, research hands "who owns the direction" to the Organization (governance) at SHEET 08 — a value judgment, once it has an owner, is a power question. Laterally, research is isomorphic to Engineering at SHEET 01 (both have "the bottleneck moving," but engineering judges correctness while research judges credibility and worth), shares the "un-outsourceable spec" with Design (design judges what is good, research what is true and worth knowing), and shares one guardrail with Architecture/Lineage at SHEET 04 (knowledge graph ↔ design system ↔ architecture boundaries — all specs that keep mass generation coherent, legible, verifiable). The figure below draws all six seams together.
FIG. 1.0 / 耦合枢纽:研究在上游产出"哪个真相值得知道",下游五卷据此投放执行THE COUPLING HUB: RESEARCH SITS UPSTREAM, PRODUCING "WHICH TRUTH IS WORTH KNOWING"; THE OTHER FIVE VOLUMES SPEND EXECUTION ON IT看懂:中央是研究;两条实线是承重的交棒(↑创新、↓组织),三条虚线是同构对照(工程/设计/架构)。研究不产出代码或界面,产出的是下游用来定方向的那条判断;没有它,下游是精密的空转。Read: research is the center; two solid lines are load-bearing hand-offs (↑Innovation, ↓Org), three dashed lines are isomorphic mirrors (Eng/Design/Arch). Research ships not code or interfaces but the direction-setting judgment the downstream consumes; without it, the downstream is a precise idle.
两条实线是单向承重的交棒:研究→创新(认识论的"空白"交给价值论的"值得")、研究→组织(价值判断落成治理)。四条虚线是双向的同构对照:同一内核作用在不同的面——对错(工程)、好坏(设计)、护栏(架构)、习得(学习)。这张图也是 本站体系总图 在研究视角的局部放大。The two solid lines are one-way, load-bearing hand-offs: research→innovation (the epistemic "gap" handed to the axiological "worth"), research→org (value judgment landing as governance). The four dashed lines are bidirectional isomorphic mirrors — one kernel acting on different faces: correctness (engineering), goodness (design), guardrails (architecture), acquisition (learning). This figure is also a research-view zoom of the site's system chart.
RES
02
MECHANISM · 执行变富 / 提问变稀缺
EXECUTION CHEAP, QUESTIONS SCARCE
机理 · 受力(含分叉)
Mechanism · Force analysis
执行变富,提问变稀缺——但提问自己也会分叉
Execution gets cheap, questions get scarce — but questions fork too
检索、跑实验、分析近乎免费;"提对问题"却没变便宜半分。但要诚实:提问本身可能也被充裕化。本张把"提问"再切两层——范式内提问会被 AI 充裕化(它擅长、且向数据丰富区聚集);范式级重构才是真稀缺。不切这一刀,核心命题会被半证伪。
Search, experiments, analysis are near-free; "asking the right question" has not gotten one bit cheaper. But be honest: questioning itself may also be abundified. This sheet cuts "questioning" into two layers — in-paradigm questions get abundified by AI (it is good at it and clusters toward data-rich regions); paradigm-level reframing is the truly scarce thing. Without this cut, the core thesis is half-falsified.
不切这一刀,核心命题会被自己证伪一半
Without this cut, the core thesis half-falsifies itself
这一张为什么必须存在,而不能直接从"执行充裕"跳到"提问稀缺"?因为如果把"提问"当成一整块铁板说成"永远属于人",命题会被一个明摆着的事实当场证伪一半:AI 已经能在既有框架内提出大量像样的、可检验的好问题——给它一张文献图,它列出的"下一步该测什么"往往比初级研究者更全。如果命题预言"提问稀缺"而现实是"提问也在被充裕",命题就有一半站不住。出路不是嘴硬,是诚实地把提问切成两层:承认范式内提问(向数据丰富区找下一个可检验空白)确实在被 AI 充裕化,把它从"留给人"里划出去、并入①充裕;同时指明真正稀缺的是范式级重构(换框架、问旧框架问不出的题),它落在 AI 训练分布之外。切了这一刀,命题反而更稳:它不再押注"提问永远属于人"这个会被证伪的强主张,而是押注一个更精确、更耐打的主张——提问的范式级那一半不会因模型变强而被充裕。一个好命题的标志,正是它敢于主动指出自己哪一半会被证伪,然后把承重移到不会塌的那一半上。
Why must this sheet exist rather than jumping straight from "execution abundant" to "questioning scarce"? Because treating "questioning" as one monolithic block declared "forever human" would let an obvious fact half-falsify the thesis on the spot: AI can already pose plenty of decent, checkable good questions inside an existing frame — give it a literature graph and its list of "what to test next" is often more comprehensive than a junior researcher's. If the thesis predicts "questioning is scarce" while reality is "questioning is also being abundified," half the thesis fails to stand. The way out is not stubbornness but to honestly cut questioning into two layers: admit that in-paradigm questioning (finding the next checkable gap toward data-rich regions) is indeed being abundified by AI, draw it out of "kept for humans" and fold it into ① abundance; while naming that the truly scarce thing is paradigm-level reframing (changing the frame, asking what the old frame cannot), which lies outside AI's training distribution. Make this cut and the thesis is sturdier: it no longer bets on the falsifiable strong claim "questioning is forever human" but on a more precise, more durable one — the paradigm-level half of questioning is not abundified by stronger models. The mark of a good claim is exactly that it dares to point out which of its halves will be falsified, then moves the load-bearing weight onto the half that will not collapse.
Mechanism: "asking a good question = spotting the most valuable gap at the edge of knowledge" is exactly the strength of large-scale knowledge-graph analysis — if "valuable" is narrowed to "the next step checkable against existing data, nearest to the known," AI may be better than humans at it. Why it half-stalls: if even step ② "questioning" is abundified, the researcher's last scarce contribution is no longer the epistemic act of "asking." The way out is to split questioning into two layers and see where the break lands.
范式内提问 · 会被充裕化In-paradigm questions · abundified
"在既有框架内、向数据最厚处找下一个可检验空白"——这是知识图谱上的最近邻搜索,AI 擅长且向数据丰富区聚集(见 RES 03 硬锚的实证机理)。
"Inside an existing frame, find the next checkable gap where data is thickest" — nearest-neighbor search on a knowledge graph; AI is good at it and clusters toward data-rich regions (see the empirical mechanism behind the hard anchor in RES 03).
"换一套框架、问一个旧框架里无法成立的问题"——没有既有数据可借、没有最近邻可循。库恩意义上的范式转换[R20],恰落在 AI 的训练分布之外。这才是断裂点的真正所在。
"Switch the frame, ask a question that could not even be posed inside the old one" — no existing data to borrow, no neighbor to follow. A Kuhnian paradigm shift [R20]lies precisely outside AI's training distribution. This is where the real break sits.
〔探索账〕此处的断裂点——"范式内被充裕、范式级仍稀缺"——目前是命题推演+认识论综述侧证(《The epistemic revolution of AI》论证 AI 正同时扰动经验论/证伪/库恩范式,但未给逐条实证),尚无单篇一手实证锚坐实"范式级重构不可被 AI 充裕";按双账本纪律标为"待坐实",给出先行指标:AI 主导的研究里"换框架"型贡献占比是否长期低位。RES 03 的 Nature 文献计量给出了"AI 向数据丰富区聚集、收缩主题覆盖"的强侧证。
[exploratory ledger] This break — "in-paradigm abundified, paradigm-level still scarce" — is for now thesis-derivation plus epistemology-review side-evidence (The epistemic revolution of AI argues AI simultaneously perturbs empiricism / falsification / Kuhnian paradigms, but offers no item-by-item empirics), and has no single first-hand empirical anchor nailing down "paradigm-level reframing cannot be abundified by AI"; per the two-ledger discipline it is marked "to be grounded," with a leading indicator: whether the share of "reframe" contributions in AI-led research stays durably low. The Nature bibliometrics in RES 03 give strong side-evidence that "AI clusters toward data-rich regions and contracts topical coverage."
提问被充裕的机理:好问题=知识边界上的最近邻搜索。为什么"提问"这个看起来最人性、最不可机械化的动作,会有一半被充裕化?机理藏在"好问题"的一个常见定义里:"好问题=在知识边界上识别最有价值的空白"。一旦把这个定义里的"价值"窄化成"可被现有数据检验、离已知最近的下一步",它就变成了一个知识图谱上的最近邻搜索问题——而这恰恰是大规模图谱分析的强项。给 AI 一张足够全的文献图,让它找"哪些相邻领域之间还没有人架过桥""哪个被反复提及却从未被直接测量的变量""哪条假设链缺最后一环",它能比大多数人更快、更全地列出这类范式内的好问题。这就是断裂点的左半截:范式内提问,本质是一种检索,会被充裕。右半截则完全是另一回事——"换一套框架、问一个旧框架里根本无法成立的问题",没有图可搜,因为那张图本身就是要被换掉的东西。把这一刀切清楚,才知道研究者最后的稀缺贡献,不在"提问"这个动作里,而在"提问"的范式级那一半里。
The mechanism of abundified questioning: a good question = nearest-neighbor search on the knowledge edge. Why would "asking a question," seemingly the most human and least mechanizable act, be half abundified? The mechanism hides in a common definition of "good question": "a good question = spotting the most valuable gap at the edge of knowledge." Once "valuable" in that definition is narrowed to "the next step checkable against existing data, nearest to the known," it becomes a nearest-neighbor search problem on a knowledge graph — exactly the strength of large-scale graph analysis. Give AI a full-enough literature graph and ask it to find "which adjacent fields no one has yet bridged," "which variable is repeatedly mentioned but never directly measured," "which hypothesis chain is missing its last link," and it will list such in-paradigm good questions faster and more comprehensively than most humans. This is the left half of the break: in-paradigm questioning is in essence a retrieval, and gets abundified. The right half is entirely different — "switch the frame, ask a question that could not even hold in the old one" — there is no graph to search, because that graph is the very thing to be replaced. Cut this clean and you see that the researcher's last scarce contribution lies not in the act of "asking" but in the paradigm-level half of "asking."
peer review 的危机:投稿在涨,净知识可能在跌
The peer-review crisis: submissions rise, net knowledge may fall
"提问被充裕、判断变稀缺"不是抽象推演,它在 peer review 这个具体制度上已经显形。Organization Science AI Task Force(2026-04)报告:投稿量 +42%,约 17% 的评审句带 AI 痕迹——生成端(写论文、写评审)都被加速,而判断端(决定哪篇值得发)没有等比例扩容。更尖锐的是一个 ODE 模型(arXiv 2604.05714)对评审系统动力学的预测:在生产加速、评审带宽恒定的参数下,系统的净知识可能损失约 40%——注意,这是模型预测,不是已证事实(Ⅲ 级,引用必须写"模型预测")。但它把张力指明了:当人人都能让 AI 批量产"看似合格"的论文,评审作为唯一的判断闸口会被淹没,而被淹没的评审只能退回最廉价的代理——格式合规、与既有文献相似度、引用数——这恰恰又喂回 RES 03 的结构性偏置:奖励范式内、惩罚范式级。
"Questions abundified, judgment scarce" is not abstract derivation; it has already surfaced in the concrete institution of peer review. The Organization Science AI Task Force (2026-04) reports submissions +42% and about 17% of review sentences bearing AI traces — the generation end (writing papers, writing reviews) is accelerated, while the judgment end (deciding which paper is worth publishing) has not scaled proportionally. Sharper still is an ODE model (arXiv 2604.05714) predicting review-system dynamics: under parameters of accelerating production and constant review bandwidth, the system's net knowledge may lose about 40% — note, this is a model prediction, not a proven fact (grade Ⅲ; citations must read "the model predicts"). But it names the tension: when anyone can have AI mass-produce "seemingly qualified" papers, review as the sole judgment gate gets flooded, and a flooded review can only fall back on the cheapest proxies — format compliance, similarity to existing literature, citation counts — which in turn feeds RES 03's structural bias: rewarding in-paradigm, penalizing paradigm-level.
It is a gradient, not a line. "In-paradigm / paradigm-level" reads like a clean cut, but the truth is more a continuous verifiability gradient: from the "has a ready benchmark, correctness machine-checkable" end, sliding smoothly to the "no right answer, only belonging, worth to whom under which value frame" end. In between is a wide grey band — machine-checkable yet the criterion itself needs a value trade-off ("how much evidence is enough" varies by field). Drawing it as a gradient rather than a line matters, because the frontier of abundance keeps moving right: "posing a good in-paradigm question," which needs a human today, may be eaten next year by a strong-enough knowledge-graph agent. The thesis does not bet that "some specific task stays human forever"; it bets that the rightmost segment — constitutive value judgment — does not move left as models get stronger, because its scarcity comes not from a capability threshold but from the structural fact of "no machine-checkable right answer." The spectrum below pins several real research actions onto that gradient.
FIG. 2.0 / 可验证性梯度:充裕前线在右移,最右端不动THE VERIFIABILITY GRADIENT: THE FRONTIER MOVES RIGHT, THE RIGHT END DOES NOT看懂:横轴左=可机检(被充裕),右=构成性价值判断(稀缺)。竖虚线是"充裕前线",它在右移,但越不过最右那段。Read: x-axis left = machine-checkable (abundified), right = constitutive value judgment (scarce). The dashed line is the "abundance frontier" — it moves right but cannot cross the rightmost band.
命题的赌注,不是"提假设永远属于人"——那一格的前线明天就可能被吃掉。赌注是最右那段:判断"哪个真相值得知道"没有可机检的对错代理,所以模型再强也越不过去。把这道梯度看成连续的,你就不会犯两个对称错误:把右端的价值判断硬塞进左格自动化(把"该往哪推研究"交给 AI),或把已被前线吃掉的左格还死守在右边(人还在手动追引文)。The thesis bets not that "hypothesizing stays human forever" — that cell's frontier may be eaten tomorrow. The bet is on the rightmost band: judging "which truth is worth knowing" has no machine-checkable proxy for right, so no stronger model crosses it. See the gradient as continuous and you avoid two symmetric errors: forcing the right-end value judgment into the left cell to automate ("let AI decide where to push research"), or defending an already-eaten left cell on the right (humans still tracing citations by hand).
RES
03
REDRAW · 从产知识到判可信
PRODUCING → VOUCHING
重画 · peer review 性质改变
Redraw · Peer review changes kind
科学社区的价值,从产生知识转向担保可信
The community's value shifts from producing knowledge to vouching for it
当 AI 数小时产出人类数年的研究量,社区的价值不再是"产生知识",而是"验证知识值得相信"。Peer review 从"评价研究质量"变成"评价 AI 研究者的可信度"——这是根本不同的工作。研究者从"产出一篇"转向"批量生成 + 验证哪个值得相信"。
When AI produces in hours what took humans years, the community's value is no longer "producing knowledge" but "verifying it is worth believing." Peer review shifts from "judging research quality" to "judging the credibility of an AI researcher" — a fundamentally different job. The researcher moves from "producing one" to "generating many, then verifying which deserves belief."
First-hand signal (the bottleneck is moving toward "judging credibility"): Sakana's AI Scientist-v2grade Ⅳ first-hand + Ⅲ independent assessmentProvenance: Sakana AI's own account + third-party independent review; exploratory side-evidence, not a hard anchor — hover to see its evidence grade and source. automated the full lifecycle "ideate → design experiments → run → write the paper → review," and a paper it generated passed peer review at an ICLR 2025 workshop (one mean score above the human acceptance threshold). But independent assessment notes its literature review relies on simple keyword search and its novelty judgment is weak (mistaking established concepts for new). Once execution is abundified, the weakest and scarcest things are exactly "judging credibility / novelty / worth" — precisely the bottleneck-move the thesis predicts. [grade Ⅳ first-hand account + Ⅲ independent assessment]
Hao, Xu, Li & Evans,《AI tools expand scientists' impact but contract science's focus》, Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y。对约 4129.8 万篇论文的分析:用 AI 的科学家个人影响力上升,但科学整体主题覆盖收缩 4.63%、学者间互动下降 22%、引用集中度上升(Gini 0.754 vs 0.690)。机理=AI 向数据丰富区聚集、自动化既有领域而非探索新领域。〔标选择效应〕同行评审 + 开放数据,但属观测性文献计量,因果须谨慎(用 AI 者本就可能集中于热门领域,是相关非因果)。它对 RES 02 的"提问分叉"是强侧证:生成层有保守偏置——加速 ≠ 进步。
Hao, Xu, Li & Evans, "AI tools expand scientists' impact but contract science's focus," Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y. An analysis of about 41.298 million papers: individual impact rises for scientists who use AI, yet science as a whole shows topical coverage contracting 4.63%, scholar-to-scholar interaction down 22%, and rising citation concentration (Gini 0.754 vs 0.690). The mechanism = AI clusters toward data-rich regions, automating existing fields rather than exploring new ones. [flag selection effect] Peer-reviewed with open data, but it is observational bibliometrics; causal claims need care (AI users may already concentrate in hot fields — correlation, not cause). It strongly side-supports RES 02's "questioning fork": the generation layer carries a conservative bias — acceleration ≠ progress.
peer review 的结构性冲突:若让 AI 自评"新颖性",它唯一能用的代理就是"与既有文献分布的距离"——而这恰恰把真正新颖的工作压低分(越偏离既有范式,越像"离群/不可信")。于是 AI 评审天然奖励范式内、惩罚范式级,把 RES 02 的保守偏置又放大一层。这就是为什么"评 AI 研究者的可信度"是与"评质量"根本不同的工作:前者要的是人去抵抗这条结构性偏置。
The structural conflict in peer review: if AI self-assesses "novelty," the only proxy available to it is "distance from the existing-literature distribution" — which is exactly what scores genuinely novel work low (the more it departs from the established paradigm, the more it reads as "outlier / not credible"). So AI review intrinsically rewards in-paradigm and penalizes paradigm-level, amplifying RES 02's conservative bias one more turn. This is why "judging the credibility of an AI researcher" is fundamentally different work from "judging quality": the former needs a human to resist this structural bias.
检验信号Test signal
判断/复现占研究者时间的比例上升;可信度评估命中率可测;被撤回/证伪率下降。反向证伪:若 AI 自评新颖性的命中率追平人类专家盲评,则"结构性冲突"前提松动。The share of researcher time spent judging/replicating rises; credibility-assessment hit-rate becomes measurable; retraction/refutation rates fall. Reverse-falsifier: if AI novelty self-assessment matches blinded human-expert review, the "structural conflict" premise weakens.
"评质量"与"评 AI 研究者的可信度"是两份不同的工作。peer review 的性质改变,不是"评审变多了"这种量变,是工作种类的质变。旧的 peer review 评的是这一篇研究做得好不好——它默认背后有一个会为自己声誉负责、会被同行追责的人类作者。当论文由一个能数小时产出人类数年量的 AI 研究者批量生成时,评审的对象悄悄从"这一篇"变成了"这个 AI 研究者的产出整体上有多可信"。这是两份工作:前者是逐篇的质量判断,后者是对一个生成源的可信度担保——更像信用评级,而不是论文打分。为什么这个区别承重?因为它决定了人该把判断投在哪里:如果还按"评质量"的老办法逐篇精读,带宽会被瞬间淹没(RES 09 的天平正是为此而设);只有认识到该评的是"生成源的可信度",才会去建可信度评估的机制——抽样复核、追踪某个源的历史命中率、对它的系统性偏置打补丁。把新工作当旧工作做,是 peer review 在 AI 时代失效的第一步。
"Judging quality" and "judging an AI researcher's credibility" are two different jobs. Peer review's change of kind is not the quantitative "more reviews" but a qualitative change of job. Old peer review judged how well this one study was done — defaulting to a human author behind it who answers for their reputation and is held accountable by peers. When papers are batch-generated by an AI researcher that produces in hours what took humans years, the object of review quietly shifts from "this one" to "how credible, overall, is this AI researcher's output." These are two jobs: the former is per-paper quality judgment, the latter is vouching for the credibility of a generation source — more like a credit rating than a paper grade. Why does this distinction bear weight? Because it dictates where the human should spend judgment: keep reading each paper closely the old "judge quality" way and bandwidth is instantly flooded (RES 09's ledger exists precisely for this); only by recognizing that what to judge is "the source's credibility" do you build credibility-assessment mechanisms — sample re-checking, tracking a source's historical hit-rate, patching its systematic bias. Doing the new job as the old job is peer review's first step toward failing in the AI era.
AI 科学案例账:把"已发生 / 正在发生 / 推演"分开记
An AI-science case ledger: book "happened / happening / projected" separately
命题最容易被两种姿态毁掉:用一个炫目的成功案例当成"自主科研已成"的证据,或用一个失败案例当成"AI 做不了科学"的反证。两者都把证据等级压平了。诚实的做法是建一张案例账,每一条都标三件事:它是什么时态(已发生 / 正在发生 / 推演)、它的证据等级(Ⅰ–Ⅴ)、以及它支持还是挑战命题。下面这张表把研究卷用到的主要 AI 科学案例摆在一起——读它的方式不是"看 AI 多强",是看瓶颈搬到哪去了:几乎每一条成功都在执行端,几乎每一条短板都在判断端(新颖性、可信度、方向选择)。这正是命题预言的形状。
The thesis is most easily wrecked by two postures: using one dazzling success as proof that "autonomous science has arrived," or using one failure as a counter-proof that "AI cannot do science." Both flatten the evidence grades. The honest move is to build a case ledger where every row is tagged with three things: its tense (happened / happening / projected), its evidence grade (Ⅰ–Ⅴ), and whether it supports or challenges the thesis. The table below sets the main AI-science cases this volume draws on side by side — the way to read it is not "how strong AI is" but where the bottleneck moved: almost every success sits on the execution side, almost every shortfall on the judgment side (novelty, credibility, direction selection). That is exactly the shape the thesis predicts.
A full-lifecycle-automated paper passed an ICLR 2025 workshop review; independent assessment notes keyword-based lit. review and weak novelty judgment.
happening
Ⅳ first-hand + Ⅲ indep.
supports: execution automated, bottleneck surfaces at "judge novelty / credibility."
DeepMind GNoME
2023 发现约 220 万新晶体材料,但绝大多数是已知结构类型内的元素替换。
已发生
Ⅱ–Ⅲ(须回溯 Nature 原文)
支持+限定:范式内填空被充裕,不等于范式级重画。
DeepMind GNoME
2023, ~2.2M new crystal materials, the vast majority element-substitutions inside known structure types.
The discipline for reading this table: booking the "happened" hard anchor (Hao's 41.3M is the only grade-Ⅱ bibliometrics) [R9]separately from the "projected" frontier claim (the model organism is Ⅴ) is what keeps this volume from being falsified in one shot. Sakana passing review is a real signal, but it is side-evidence (Ⅳ first-hand + Ⅲ independent) [R7], not a causal experiment; GNoME's 2.2M materials are a first-hand result, [R16] but the qualifier "the vast majority are substitutions inside known structures" is its true contribution to the thesis. The sharpest row is the solar-system model: it cleanly separates "predicting accurately" from "understanding correctly" — AI can predict orbits to arbitrary precision yet never grows a "gravity" representation internally, because its objective rewards only prediction error and never "is this variable even the right level of description."
RES
04
CONTEXT · 知识图谱即护栏
THE GRAPH AS GUARDRAIL
重画 · 规格
Redraw · Spec
知识图谱即护栏——让海量生成留在可追溯的结构里
The knowledge graph as guardrail — keeping mass generation traceable
A queryable knowledge graph / evidence base is the context infrastructure that keeps research generation traceable, falsifiable, integrable. The knowledge graph is the "spec" for generation: every new claim must land back into the evidence base's traceable chain. This is the same guardrail as engineering's "context as infrastructure," design's "system as guardrail," architecture's "legible to agents."
不是某个图谱工具赢,是底层四属性赢。凡满足这四条的证据载体都被放大,凡是锁在 PDF 截图、私有数据库、不可追溯综述里的都被边缘化——和工程那五条贯穿原理同源:
It is not a particular graph tool that wins, but four underlying properties. Any evidence carrier that meets these four gets amplified; anything locked in PDF screenshots, proprietary databases, or untraceable reviews gets marginalized — the same source as engineering's five through-lines:
证据库是规格,不是事后归档。最常见的把知识图谱用错的方式,是把它当成研究做完之后放结果的仓库——先生成、再归档。这恰恰颠倒了承重的次序。在 AI-Native 研究里,可追溯证据库是生成的规格,它必须先于生成存在并约束生成:你先在库里写下"什么算可信证据、什么算冲突、什么必须可追溯到原始数据",生成层才有一个明确的靶子去对齐。这和工程"上下文即基设、先于实现"是同一句话——上下文不是给模型的补充材料,它定义了什么算"做对了"。次序一旦颠倒,问题立刻显形:先让 agent 批量产出再想办法归档,你会得到一堆格式各异、来源残缺、彼此矛盾却无人察觉的主张——一座无法整合的垃圾山,清理它的成本远超当初省下的生成成本。先立库、后生成,库就成了一道实时护栏:无来源的当场被拦、与既有证据冲突的当场被标记、不可追溯的根本进不来。这就是为什么 RES 13 的研究环把"①框定(先立证据库)"放在"②生成"之前——不是流程洁癖,是命题本身。
The evidence base is the spec, not after-the-fact archiving. The most common way to misuse a knowledge graph is to treat it as a warehouse for results after the research is done — generate first, archive later. That inverts the load-bearing order. In AI-Native research, the traceable evidence base is the spec for generation; it must exist before generation and constrain it: you write into the base first what counts as credible evidence, what counts as a conflict, what must be traceable to raw data, and only then does the generation layer have a clear target to align to. This is the same line as engineering's "context is infrastructure, prior to implementation" — context is not supplementary material for the model; it defines what counts as "done right." Invert the order and trouble surfaces at once: batch-generate first and archive later, and you get a heap of claims in varied formats, with broken provenance, mutually contradictory yet unnoticed — an un-integratable garbage mountain whose cleanup cost far exceeds the generation cost you saved. Stand up the base first and the base becomes a real-time guardrail: the sourceless is blocked on the spot, conflicts with existing evidence are flagged on the spot, the untraceable never enters. This is why RES 13's loop puts "① FRAME (stand up the base)" before "② GENERATE" — not process fastidiousness, but the thesis itself.
对 agent 可读:主张、证据、来源是结构化、可被模型直接读写的节点,不是只能人读的散文。
Legible to agents: claims, evidence, and sources are structured nodes a model reads and writes directly, not prose only a human can parse.
可追溯:每条主张挂着它的证据边——能回到原始数据/论文,能被独立复现追踪。
Traceable: every claim carries its evidence edges — back to raw data/papers, trackable for independent replication.
可证伪:相互矛盾的主张在图里显形为冲突边,而不是被淹没在两篇互不引用的论文里。
Falsifiable: contradictory claims surface as conflict edges in the graph rather than drowning in two papers that never cite each other.
可整合:跨领域的主张能被缝合、比对、综合——为下一张"整合而非检索"做基设。
Integrable: claims across fields can be stitched, compared, synthesized — the infrastructure for the next sheet, "integration, not retrieval."
检验信号 / 同构Test signal / isomorphism
新生成的主张一次就落在证据库可追溯链内的比例上升;"无来源/不可追溯"主张被自动拦下的比例上升。知识图谱 ↔ 设计系统 ↔ 架构边界是同一招——都是让海量生成连贯、可读、可验证的规格。The share of newly generated claims that land inside the traceable chain on the first pass rises; the share of "sourceless/untraceable" claims auto-blocked rises. The knowledge graph ↔ the design system ↔ architecture boundaries are one move — specs that keep mass generation coherent, legible, verifiable.
规格在前,生成在后——次序是承重的
Spec first, generation second — the order is load-bearing
工程卷有一句反直觉的话:上下文不是给生成"补充材料",它是生成的规格,所以必须先于生成存在。研究卷把同一句话翻译成它自己的术语:可追溯证据库不是事后归档生成结果的地方,它是研究的规格——你先写下"何为值得相信、何为冲突、何为可追溯",生成才有东西可对齐。次序颠倒会立刻出问题:先让 agent 批量产一万条主张、再想办法归档,你得到的是一座无法整合的垃圾山;先立证据库、再让每条主张挂着证据边落进来,你得到的是一个能自动把关、能显形冲突、能被独立复现追踪的结构。这就是为什么 RES 13 的研究环把"框定(先立证据库)"放在"生成"之前——这一步不是行政流程,是命题。
Engineering has a counter-intuitive line: context is not "supplementary material" for generation, it is the spec, so it must exist before generation. Research translates the same line into its own terms: the traceable evidence base is not where you archive results after the fact, it is the spec for research — you write down "what is worth believing, what counts as a conflict, what counts as traceable" first, and only then does generation have something to align to. Reverse the order and trouble is immediate: let an agent batch out ten thousand claims and then figure out how to file them, and you get an un-integratable garbage mountain; stand up the base first and let each claim land carrying its evidence edges, and you get a structure that auto-gates, surfaces conflicts, and is trackable for independent replication. This is why RES 13's research loop puts "FRAME (stand up the base)" before "GENERATE" — that step is not administrative process, it is the thesis.
Force analysis · old / new: in the old world evidence scatters across PDF screenshots, proprietary databases, and reviews that never cite each other, and integration relies on accidental connections in a human head; in the new world any evidence carrier meeting the four properties — "legible to agents / traceable / falsifiable / integrable" — gets amplified and the rest marginalized — not because some graph tool won, but because those four properties are themselves the minimal sufficient condition for "mass generation not collapsing into noise." They share a source with engineering's five through-lines: readable/writable by models directly, traceable to origin, conflicts made visible, stitchable across fields. Any carrier missing one is selected against in a world of abundant generation.
四属性里,"可证伪"是最容易被偷工的一条
Of the four properties, "falsifiable" is the one most easily skimped
Of the four properties, "legible to agents" and "traceable" are the ones tool vendors love to talk about, because they demo well and sell well; the truly load-bearing yet most easily skimped is falsifiable. An evidence base that chases only "legible + traceable" becomes an efficient agreement machine: it can neatly store and query ten thousand mutually consistent claims yet never let two contradictory claims collide head-on. The concrete form of falsifiable is making a conflict visible as a conflict edge in the graph, rather than letting it drown in two papers that never cite each other — one says X, one says not-X, each cited, each "traceable," yet the system never reports that they contradict. When generation pushes claims to thousands an hour, such silent contradictions accrue exponentially, and what you end up with is not a knowledge base but a self-consistent illusion. So when building the base, "conflict detection" is not a nice-to-have advanced feature; it is the line that separates a "base" from a "pile."
Isomorphism, restated: this guardrail is the same move projected onto different faces — engineering's "context as infrastructure," design's "design system as guardrail," architecture's "boundaries legible to agents" — all answering one question: "when generation becomes near-free, massive, parallel, what structure keeps it from collapsing into noise?" The answer is isomorphic across four faces: engineering uses context and specs, design uses the design system and component library, architecture uses clear boundaries, research uses the knowledge graph and evidence base. See this and you stop treating "build a knowledge graph" as an isolated technical task within research, and recognize it as the series-wide shared infrastructure principle landing on the research face.
RES
05
REDRAW · 整合而非检索
INTEGRATION, NOT RETRIEVAL
推至极限 · 整合鸿沟
Pushed to the limit · The integration gap
人不可外包的稀缺动作,是整合,不是检索
The human's un-outsourceable scarce act is integration, not retrieval
The bottleneck of human understanding was never the quantity of knowledge but its integration. When generation pushes output toward the near-infinite, the gap between the rate of knowledge production and the rate it can be digested explodes — not something "read more" can fix; it is a bandwidth problem, not a stock problem. The researcher's scarce act shifts from "read more" to "synthesis across knowledge."
生产速度与可消化速度的剪刀差,是整合升值的根
The scissors-gap between production and digestion is the root of integration's rising value
为什么整合会从"研究的一个环节"升值成"人最稀缺的贡献"?因为两条曲线的剪刀差正在张开。科学文献的生产侧本就在指数增长——约 250 万篇/年、每 9 年翻倍;AI 在其上又叠了一层质变加速,把"产出一篇"的成本压到近零。可消化侧呢?人类的认知带宽近乎恒定:一个研究者一天能真正读懂、判断、并入自己理解结构里的论文数,几十年没变多少。生产曲线陡升、消化曲线平躺,两者之间的剪刀差就是"已生成但无人整合"的知识堆积量——它在以生产曲线的速率累积。这道剪刀差不是某个工具能补的,因为补它需要的恰恰是带宽,而带宽是被锁住的那一侧。所以整合的升值是结构性的:在一个生产近无限、消化近恒定的世界里,唯一还在涨价的,就是那个能把碎片缝成理解的认知主体的注意力。〔此为命题推演+数量级侧证,"堆积速率"缺一手实证,标待坐实〕[thesis-derivation + order-of-magnitude side-evidence; the "accrual rate" lacks first-hand empirics, flagged to be grounded]
Why does integration appreciate from "one step of research" into "the human's scarcest contribution"? Because the scissors-gap between two curves is opening. The production side of the scientific literature was already growing exponentially — ~2.5 million papers a year, doubling every 9 years; AI layered on a qualitative acceleration, driving the cost of "producing one" to near-zero. And the digestion side? Human cognitive bandwidth is nearly constant: the number of papers a researcher can truly understand, judge, and fold into their structure of understanding in a day has barely grown in decades. Production curve steep, digestion curve flat — the scissors-gap between them is the stock of "generated but un-integrated" knowledge, accruing at the production curve's rate. No tool fills this gap, because filling it requires precisely bandwidth, and bandwidth is the locked side. So integration's appreciation is structural: in a world of near-infinite production and near-constant digestion, the one thing still rising in price is the attention of a cognitive subject that can stitch fragments into understanding. 〔此为命题推演+数量级侧证〕
FIG. 7.0 / 剪刀差:产出曲线陡升,消化曲线平躺,张开的口就是"已生成但无人整合"的存量THE SCISSORS GAP: PRODUCTION CURVE STEEP, DIGESTION CURVE FLAT — THE WIDENING MOUTH IS THE UN-INTEGRATED STOCK看懂:两条曲线从同一点出发——产出曲线随 AI 指数上扬,消化曲线(人的认知带宽)几乎平。它们之间的阴影口在以产出速率累积,那就是"已生成却没人缝进理解"的知识山;整合升值,是因为这道口只能靠带宽补,而带宽是被锁住的那侧。Read: both curves start at one point — production rises exponentially with AI, digestion (human cognitive bandwidth) stays nearly flat. The shaded mouth between them accrues at the production rate; that is the mountain of "generated but never stitched into understanding." Integration appreciates because only bandwidth fills this mouth, and bandwidth is the locked side.
这张图把"整合升值"从口号变成几何:两条曲线同源出发,产出随 AI 指数离开,消化几乎贴着横轴爬。它们之间张开的口不是误差,是结构性存量——"已生成却没被任何人缝进理解"的知识,以产出曲线的速率堆积。补这道口需要的恰恰是认知带宽,而带宽是被锁死的那条平线。所以唯一还在涨价的,是能把碎片缝成理解的注意力。The figure turns "integration appreciates" from slogan into geometry: two curves leave one origin, production departing exponentially with AI, digestion crawling along the axis. The mouth opening between them is not error but a structural stock — knowledge "generated yet stitched into no one's understanding," accruing at the production curve's rate. Filling the mouth requires precisely cognitive bandwidth, and bandwidth is the locked flat line. So the one thing still rising in price is the attention that can stitch fragments into understanding.
推至极限:检索是"找到已存在的那条"——可被向量库 + agent 充裕化到近免费。整合是"把从未被并置的几条缝成一个新理解"——它要求一个能同时持有多个框架、判断哪些该缝、缝出的东西是否成立的认知主体。当 AI 每小时新增成千上万条可检索主张,"已生成但无人整合"的知识会堆积成山。研究者真正的杠杆,是站在这座山顶做综合,而不是在山脚多搬几块砖。
Pushed to the limit: retrieval is "find the one that already exists" — abundifiable to near-free by a vector store plus an agent. Integration is "stitch several never-juxtaposed claims into a new understanding" — it demands a cognitive subject that can hold several frames at once, judge which to stitch, and judge whether the stitch holds. As AI adds thousands of retrievable claims an hour, "generated but un-integrated" knowledge piles into a mountain. The researcher's real leverage is to synthesize from the summit, not to haul a few more bricks at the base.
检索 · 会被充裕化Retrieval · abundified
"在已存在的知识里找到相关的那条"——最近邻、向量搜索、RAG。AI 的强项,趋近免费。
"Find the relevant one inside existing knowledge" — nearest-neighbor, vector search, RAG. AI's strength, trending to free.
"Stitch cross-field, never-juxtaposed claims into new understanding" — holding several frames, judging what to stitch and whether it holds. A bandwidth problem, not a stock one.
〔探索账·待坐实〕"整合鸿沟急剧扩大"目前是命题推演+认识论综述侧证(《The epistemic revolution of AI》直指"知识生产速度超出单一人类认知"),尚无单篇一手实证锚量化"已生成未整合"的堆积速率;标先行指标:整合产物(综述/理论缝合)相对原始产出的比率,以及"已生成但无人整合"的知识堆积量。证伪条件:若自动综述系统能在专家盲评下产出被判"真正缝合而非拼贴"的整合,则整合的人类专属性松动。
[exploratory · to be grounded] "The integration gap explodes" is for now thesis-derivation plus epistemology-review side-evidence (The epistemic revolution of AI points directly at "the rate of knowledge production outpacing single-human cognition"), with no single first-hand empirical anchor quantifying the accrual rate of "generated-but-un-integrated"; leading indicators: the ratio of integration artifacts (reviews / theory-stitching) to raw output, and the stock of "generated but un-integrated" knowledge. Falsifier: if automated review systems produce, under blinded expert judging, integrations rated "genuinely stitched, not collaged," the human-exclusivity of integration weakens.
缝合不是更长的检索,是一种范式级动作
Stitching is not longer retrieval; it is a paradigm-level act
最容易把整合贬低成"高级检索"的误解是:以为只要把检索窗口拉得足够长、把上下文塞得足够满,模型就能"读完一切"然后自动综合。这混淆了两个种类不同的动作。检索是在同一个框架内找到相关条目——它有标准答案、有最近邻、可被向量库充裕化。缝合是跨框架地判断"这几条本来不在一起的主张,并置之后是否生出一个新理解"——它没有最近邻可循,因为"该把哪几条放在一起"这个判断本身就在框架之外。这正是 RES 02 那道可验证性梯度的右段:缝合往往要求一次小型的范式级动作——决定用哪个新框架来组织这些碎片。所以整合不是检索的延长线,它和"提一个范式级问题"是同一种稀缺判断的两种表现。
The easiest way to demote integration to "advanced retrieval" is to assume that a long-enough retrieval window and a full-enough context will let the model "read everything" and synthesize automatically. This conflates two different kinds of act. Retrieval finds relevant items within one frame — it has a right answer, a nearest neighbor, and is abundifiable by a vector store. Stitching judges across frames whether "these claims that were never together generate a new understanding once juxtaposed" — it has no neighbor to follow, because "which ones to put together" is itself a judgment outside any frame. This is precisely the right end of RES 02's verifiability gradient: stitching often demands a small paradigm-level act — deciding which new frame organizes the fragments. So integration is not retrieval extended; it and "posing a paradigm-level question" are two expressions of the same scarce judgment.
带宽问题,不是存量问题——这一区分有操作后果。如果整合是存量问题("读得不够多"),解法就是让 AI 多读、多产摘要,把存量补齐。但它是带宽问题("同时持有多个框架并判断如何缝合的认知容量有限"),那么多产摘要只会让堆积更快,让带宽更紧。操作上的差别巨大:把工时投回"让 AI 多读多摘",是在加速病因;把工时投回"让人专注做跨框架综合,AI 只负责把候选材料喂到手边",才是对症。所以 RES 13 的研究环把④整合设为承重瓶颈阀门——不是因为整合"重要"这种空话,而是因为它是唯一不能靠多产来缓解、反而会被多产加重的环节。盯住一个具体指标:整合产物(真正缝出新理解的综述/理论)相对原始产出的比率,是升还是降。
A bandwidth problem, not a stock problem — and this distinction has operational consequences. If integration were a stock problem ("haven't read enough"), the fix would be to have AI read more and summarize more, filling the stock. But it is a bandwidth problem ("limited cognitive capacity to hold several frames at once and judge how to stitch"), so producing more summaries only piles faster and tightens bandwidth further. The operational difference is large: reinvesting hours into "let AI read and summarize more" accelerates the cause; reinvesting into "let humans focus on cross-frame synthesis while AI only feeds candidate material to hand" treats it. This is why RES 13's loop sets ④ integration as the load-bearing bottleneck valve — not from the empty phrase that integration "matters," but because it is the one step that cannot be relieved by more output and is in fact worsened by it. Watch one concrete metric: whether the ratio of integration artifacts (reviews/theory that genuinely stitch a new understanding) to raw output is rising or falling.
Reproducibility is the wall: errors should not be deleted, they should feed back. The opposite of integration is quietly discarding what cannot be integrated. When generation adds thousands of claims an hour, the laziest handling is to tag everything "non-replicable," "conflicting with the base," "looking like an outlier" as noise and delete it — which is precisely the most dangerous move. Reproducibility in this volume is not a QC step; it is the wall that separates "a self-correcting research loop" from "a fast generator." A non-replicable claim has two possibilities — it is wrong (falsify it) or it reveals a blind spot in current evaluation methods (build a new eval). Delete it as noise and you lose both kinds of information; feed it back as a new rule, a new node type, or a new replication check in the evidence base, and the error becomes a guardrail, fewer next round. The figure below draws this "error feedback" as a closed loop.
FIG. 5.0 / 复现之墙:错误回流成新的评估与护栏THE WALL OF REPRODUCIBILITY: ERRORS FEED BACK AS NEW EVALS看懂:一条主张撞墙后不是被删,是分流——错的去证伪库,揭示盲区的去"造新 eval",两条都回流成下一轮的护栏。Read: a claim hitting the wall is not deleted but routed — the wrong goes to the refutation base, the blind-spot-revealing goes to "build a new eval"; both feed back as next round's guardrail.
复现把生成分成三流,而最有价值的不是"通过"那一流,是"揭示盲区"那一流——它催生新的评估方法,正是 RES 11 给护栏留的"换变量口子"在运行时的样子。一个只删噪声、不回流错误的系统,是开环的高速生成器;一个把每次撞墙都回流成新规则的系统,才是会自我纠偏的研究环。这与工程"错误回流成新测试/新护栏"完全同构。Replication splits generation into three streams, and the most valuable is not the "passes" stream but the "reveals a blind spot" stream — it spawns new evaluation methods, which is exactly what RES 11's "change-the-variable door" looks like at runtime. A system that only deletes noise and never feeds errors back is an open-loop fast generator; a system that feeds every wall-hit back as a new rule is a self-correcting research loop. This is fully isomorphic to engineering's "errors feed back as new tests / new guardrails."
RES
06
REDRAW · 从何为真退守到何为值得知
FROM TRUE TO WORTH KNOWING
断裂点 · 反转 → 接创新
The break · Reversal → to Innovation
研究的终极问题,从"怎么发现真相"翻转为"哪个真相值得知道"
Research's ultimate question flips from "how to find truth" to "which truth is worth knowing"
The break, head-on: if questioning is also abundified (RES 02 already cut it in two), the researcher's scarce contribution is no longer the epistemic act of "asking" but the axiological act of "judging which answer truly matters." Research's ultimate question flips from "how to find truth" to "why some truths are more worth knowing than others." Here it hands off upward to the Innovation volume (value discovery).
FIG. 6.1 / 坐标系翻转:从认识论的"对不对"跌进价值论的"值不值得知"THE PIVOT: FROM EPISTEMOLOGY'S "IS-IT-CORRECT" INTO AXIOLOGY'S "IS-IT-WORTH-KNOWING"看懂:左轴是认识论——有对错、可机检、终将自动化,前线一路右移。但研究的承重问题不在这根轴上:它垂直跌进另一根轴——价值论,问"哪个真相值得知道"。换轴不是换位置;新轴没有对错,只有"对谁、在哪个价值框架下值得"——这正是 AI 无法机检、人接住的那一格。Read: the left axis is epistemology — right/wrong, machine-checkable, eventually automated, frontier sliding right. But research's load-bearing question is not on that axis: it drops perpendicular onto another — axiology, asking "which truth is worth knowing." Pivoting axes is not moving position; the new axis has no right/wrong, only "worth to whom, under which value-frame" — exactly the cell AI cannot machine-check and a human holds.
瓶颈搬家(执行→验证)始终在同一根横轴上滑动——都在问"对不对",所以终将被自动化追上。这一步不同:承重的问题垂直跌进另一根轴。新轴上没有对错,只有归属——一个真相"对谁、在哪个价值框架下值得知道"。机器能把第一根轴的前线一路推到右端,却到不了第二根轴,因为那里没有可机检的判据。研究最坚固的守地,就是这个换轴动作本身。Bottleneck-moves (execution→verification) slide along one horizontal axis — all asking "is it correct," so automation eventually catches them. This step is different: the load-bearing question drops perpendicular onto another axis. On the new axis there is no right/wrong, only belonging — whether a truth is "worth knowing, to whom, under which value-frame." A machine can push the first axis's frontier all the way right yet never reach the second, because no machine-checkable criterion lives there. Research's most durable ground is the axis-switch itself.
异质的"值得"学不到,因为它的样本只有一个
The heterogeneous "worth" cannot be learned, because its sample size is one
为什么 AI 学得到平均的"值得",学不到异质的"值得"?根子在学习的前提:任何被学的东西,都要有足够多的、可被归纳的样本。平均的"值得"——一个领域里被反复表达、被大量论文共同认可的价值取向——有海量样本,所以可被外化、可被 RLCF 当 reward 学走。但真正承重的那种"值得",是只对某个个体、某个群体、在某个特定价值框架下才成立的——它的样本量本质上是一。一个研究者基于自己独特的处境、经历、所属共同体的关切,判断"这个真相对我们值得追",这个判断没有一个可被归纳的训练集,因为它构成性地绑定在那个独一无二的视角上。这正好对应 RES 08 讲的同质化机制的镜像:AI 默认拉向均值(regression to a domain prototype),而异质的价值判断按定义就是偏离均值的那部分。所以这不是"AI 现在还学不到、以后会"的能力问题——异质的"值得"在统计学习的框架里根本没有可学的对象,因为可学意味着可归纳,而它恰恰是不可归纳的。这是人在研究面最后、也最坚固的守地。
Why can AI learn the average "worth" but not the heterogeneous "worth"? The root is learning's premise: anything learned needs enough inducible samples. The average "worth" — a value orientation repeatedly expressed in a field, jointly endorsed by many papers — has vast samples, so it can be externalized and learned by RLCF as reward. But the truly load-bearing kind of "worth" is the one that holds only for a particular individual, a particular group, under a particular value frame — its sample size is essentially one. When a researcher, drawing on their unique situation, history, and community's concerns, judges "this truth is worth chasing for us," that judgment has no inducible training set, because it is constitutively bound to that one-of-a-kind vantage point. This mirrors the homogenization mechanism of RES 08: AI pulls toward the mean by default (regression to a domain prototype), and a heterogeneous value judgment is by definition the part that departs from the mean. So this is not a capability problem of "AI cannot yet, but will" — in the framework of statistical learning the heterogeneous "worth" simply has no learnable object, because learnable means inducible, and it is precisely non-inducible. This is the human's last and sturdiest ground on the research face.
Why this is a break, not another bottleneck-move: a bottleneck-move shifts position within one coordinate system — execution → verification, both still inside the epistemics of "right or wrong." Here the coordinate system itself changes: from "which answer is true" (has a right answer, machine-checkable, eventually automated) into "which truth is worth knowing" (no right answer, only belonging — worth to whom, to which group, under which value frame). This is where the kernel's constitutive branch of fork ② lands: it is not a capability (capabilities can be out-done by "more accurate") but a constitutive stipulation. AI can learn the average, oft-repeated "worth"; it cannot learn the heterogeneous "worth" that holds only for a particular individual or group.
"值得"是构成性规定,不是又一种能力
"Worth" is a constitutive stipulation, not one more capability
把"哪个真相值得知道"读成"人比 AI 更会判断价值的一种能力",是最常见也最危险的误读——因为能力可以被超越。如果"判断值得"只是一种能力,那么一个足够强的模型迟早会做得比人更好,整卷的人本立论就只是暂时的。命题真正主张的不是这个:"值得"不是一种能力,是一个构成性规定。区别在于——能力问"谁判得更准",构成性问"由谁来定义这件事算不算数"。"这个真相对我们值得知道"这句话里,没有一个外在的、可被更准的判断逼近的"正确答案";它的真值由提出它的那个价值框架构成。对一个把延长健康寿命当作首要善的群体,"衰老的分子机制"值得知道;对一个把生态完整当作首要善的群体,同一笔研究预算或许该投向别处。两者没有谁"判错了",因为"值得"是相对于价值框架被构成的,不是被发现的。AI 能学到被反复表达的、平均的"值得",但它无法越过这个构成性事实去给出一个对所有框架都正确的"值得"——因为那样的东西在逻辑上就不存在。
Reading "which truth is worth knowing" as "a capability at which humans judge value better than AI" is the most common and most dangerous misreading — because capabilities can be surpassed. If "judging worth" were merely a capability, a strong-enough model would eventually do it better than humans, and the volume's human argument would be only temporary. That is not what the thesis claims: "worth" is not a capability but a constitutive stipulation. The difference: a capability asks "who judges more accurately," a constitutive matter asks "who gets to define whether this counts at all." In the sentence "this truth is worth knowing to us" there is no external "right answer" that a more accurate judgment could approach; its truth value is constituted by the value frame that poses it. To a group that holds extending healthy lifespan as the first good, "the molecular mechanism of aging" is worth knowing; to a group that holds ecological integrity as the first good, the same research budget might better go elsewhere. Neither "judged wrong," because "worth" is constituted relative to a value frame, not discovered. AI can learn the oft-repeated, average "worth," but it cannot step past this constitutive fact to give a "worth" correct for all frames — because no such thing logically exists.
The reversal (the load-bearing merge upward): research's ultimate question may never have been "how to find truth" but "why some truths are more worth knowing." This makes the research volume merge naturally upward into Innovation (value discovery) — research spots gaps at the edge of knowledge, innovation judges the value those gaps point to. Their meshing point is exactly the moment the word "worth" passes from epistemology into axiology.
研究卷②的终点("哪个真相值得知道")就是创新卷的起点(价值判断):谁有权说"这值得知道"——以及这判断能否被无损系统化,是创新分叉的关键悬案。The endpoint of research's step ② ("which truth is worth knowing") is the starting point of the Innovation volume (value judgment): who has the standing to say "this is worth knowing" — and whether that judgment can be losslessly systematized is the open question at the heart of innovation's fork.
问题选择的品味,是研究里最稀缺的判断
Taste in problem selection is research's scarcest judgment
把"哪个真相值得知道"再往实操拉一格,它落地成一个具体动作:问题选择。Anthropic 2026 的"自主性阶梯"[R4]把研究 agent 的能力分级,结论很硬——最右端、最难自动化的一阶,恰是 research agenda selection(研究议程选择)。Claude 可以在"执行良定义的实验"上匹敌甚至超过熟练人类,但"选择该做哪些问题、哪些异常值得追、哪个诱人想法其实是死路"这件事,仍有明显差距。chenhaot 的"The Mirage of the AI Scientist"把这条形式化为:人类不可替代的角色是 Selector(选什么做)+ Evaluator(评质量/可信)——科学根本是一个资源分配问题,不是智能问题,产出更多不等于知识更多。Terence Tao 的那句话是同一件事的另一种说法:"当想法生成的成本被压到近零,瓶颈就变成 verify / evaluate"——注意力成为知识经济里最稀缺的资源。
Pull "which truth is worth knowing" one notch toward the operational and it lands as a concrete act: problem selection. Anthropic's 2026 "ladder of autonomy" [R4]grades a research agent's capabilities, and the conclusion is hard — the rightmost, hardest-to-automate rung is precisely research-agenda selection. Claude can match or exceed skilled humans at "executing a well-defined experiment," but "choosing which problems to work on, which anomalies are worth chasing, which seductive idea is actually a dead end" still shows a clear gap. chenhaot's "The Mirage of the AI Scientist" formalizes the irreplaceable human role as Selector (what to do) + Evaluator (quality / credibility) — science is fundamentally a resource-allocation problem, not an intelligence problem; producing more is not knowing more. Terence Tao's line is the same thing in other words: "when the cost of idea generation is driven to near-zero, the bottleneck becomes verify / evaluate" — attention becomes the scarcest resource in the knowledge economy.
品味不是不可讨论的直觉,它是一道可分级的梯度。从"选一个有数据、有 benchmark、稳出论文的问题"(低品味,AI 已能做),到"选一个别人觉得无聊但你直觉有矿的问题",到"选一个连提出来都需要换框架的问题"(高品味,AI 训练分布之外)。AI 能学到的是这道梯度的左半段——RLCF(用社群偏好当 reward)[R5]已经证明"科学品味的社群均值"可被外化、可被学。但它学到的恰恰是社群均值,而真正的问题选择品味,在于偏离当前社群均值的那部分前沿价值——这正好是同质化研究系统化的东西。下面这张图把问题选择品味画成一道梯度,并标出 AI 能学到哪一段、学不到哪一段。
Taste is not mysticism; it is a gradable gradient. From "pick a problem with data, a benchmark, and a steady paper yield" (low taste, AI already does it), to "pick a problem others find boring but your intuition says holds ore," to "pick a problem you cannot even pose without changing the frame" (high taste, outside AI's training distribution). What AI can learn is the left half of this gradient — RLCF (community preference as reward) [R5]has shown that "the community mean of scientific taste" can be externalized and learned. But what it learns is exactly the community mean, whereas real problem-selection taste lies in the frontier value that departs from the current community mean — precisely what homogenized research systematizes away. The figure below draws problem-selection taste as a gradient and marks which segment AI can learn and which it cannot.
FIG. 6.0 / 问题选择品味的梯度:AI 学得到均值,学不到偏离THE PROBLEM-SELECTION TASTE GRADIENT: AI LEARNS THE MEAN, NOT THE DEPARTURE看懂:横轴是品味从低到高;阴影区是 RLCF 可学的"社群均值",最右端"偏离均值的前沿价值"在阴影之外——若把它也系统化,系统化的恰是同质化。Read: x-axis is taste low→high; the shaded band is the "community mean" RLCF can learn; the rightmost "frontier value departing from the mean" lies outside it — systematizing it would systematize homogenization.
这道梯度解释了一个看似矛盾的事实:科学品味可学(RLCF 已证),同时问题选择又是人最后的守地。两者不矛盾——可学的是社群均值(阴影区),守地的是偏离均值的前沿(最右那块)。危险在于:若一个组织把"可学的均值品味"误当成"全部品味"去系统化,它就在系统化同质化本身。这条分叉是研究卷向创新卷交棒时悬而未决的关键实验:RLCF 能不能学到"偏离当前社群均值"的前沿价值?〔证据:RLCF 能学社群偏好为 Ⅱ–Ⅲ;能否学反共识前沿尚缺直接实验,标为前沿〕This gradient explains an apparent paradox: scientific taste can be learned (RLCF showed it), yet problem selection is still the human's last ground. No contradiction — what is learnable is the community mean (shaded band); what is held is the off-mean frontier (the rightmost block). The danger: if an organization mistakes "the learnable mean taste" for "all taste" and systematizes it, it systematizes homogenization itself. This fork is the key open experiment as research hands off to innovation: can RLCF learn frontier value that departs from the current community mean? [evidence: RLCF learning community preference is Ⅱ–Ⅲ; whether it learns anti-consensus frontier lacks a direct experiment, flagged as frontier]
FIG. 6.2 / 问题选择漏斗:充裕化逐层吃掉可机检的判断,沉到底的那一格才是稀缺判断THE PROBLEM-SELECTION FUNNEL: ABUNDANCE EATS EACH MACHINE-CHECKABLE LAYER; WHAT SETTLES AT THE BOTTOM IS THE SCARCE JUDGMENT看懂:一堆候选问题从上方倒进漏斗,每往下一层就有一个"为什么值得做"的判断被充裕化吃掉——有 benchmark 的、有数据的、社群已认可的,逐层被 AI 接管。漏到最底、谁都接不住的那一格,是"选一个连提出来都要换框架的问题"——这才是研究里最稀缺、最右端的判断。Read: candidate problems pour into the funnel; each layer down, one "why is this worth doing" judgment gets eaten by abundance — benchmarked, data-rich, community-sanctioned ones are taken over by AI layer by layer. The one cell that settles at the very bottom, that nothing automates, is "pick a problem that needs a new frame even to state" — research's scarcest, rightmost judgment.
把"哪个真相值得知道"拉到实操,它落地成问题选择,而问题选择不是一个动作、是一个漏斗。充裕化从上往下逐层吃:先吃掉有 benchmark 的,再吃掉数据丰富、社群已认可的,连"别人无聊你有矿"那层也被 RLCF 勉强够到。真正漏到底、谁都接不住的,只剩"选一个连提出来都要换框架的问题"——它在训练分布之外,没有可归纳的样本。这一格不是因为难而稀缺,是因为构成性地不可归纳而稀缺,所以它是人在研究面最坚固的守地。Pull "which truth is worth knowing" to the operational and it lands as problem selection — and problem selection is not one act but a funnel. Abundance eats top-down: first the benchmarked, then the data-rich and community-sanctioned, and even the "boring-to-others, ore-to-you" layer is barely reached by RLCF. What truly settles at the bottom, that nothing catches, is "pick a problem that needs a new frame even to state" — outside the training distribution, with no inducible sample. This cell is scarce not because it is hard but because it is constitutively non-inducible, which is why it is the human's most durable ground in research.
RES
07
REDRAW · 谁为研究方向的价值负责
WHO OWNS THE DIRECTION
反转 → 接组织 · 人本主线
Reversal → to Org · the human through-line
价值判断一旦落地,就是"谁有权定方向"的治理问题
Once value judgment lands, it becomes a governance question of "who owns the direction"
The moment "which truth is worth knowing" lands in reality, it becomes "who judges, who is accountable, who vouches." The human's return = owning the value and credibility of the research direction. Here it hands off downward to the Organization volume (who has the authority to set direction). It is also where the human through-line lands on the research surface: AI-Native research is not about piling up papers faster, but about returning the researcher to questions worth asking.
人回归意义,在研究面是把研究者还给值得追问的问题
On the research face, "humans return to meaning" means returning researchers to the questions worth asking
内核第④步"人回归意义"不是一句温情的收尾,它在研究面有一个非常具体、可检验的落点:把研究者从"多产论文"的跑步机上解放出来,还给那些值得追问、却不可度量的问题。这正是人本立论在研究卷的全部分量——更便宜的执行,从来不是目的本身。AI 把执行做便宜,目的不是让研究者在单位时间里产出更多论文(那只是把人更深地绑在 exploitation 的轮子上),而是让研究者腾出认知带宽,去做那个机器做不了、也最值得人做的动作:判断哪个真相值得知道、守住一个不被产量绑架的研究方向。一个把 AI 用对了的研究组织,它的研究者花在"追问、判断、整合"上的时间应该变多,而不是花在"赶产出"上的时间变多。这条人本主线与组织卷"让人回归组织中心"是同一句话的两个面:组织面是让人回到判断节点,研究面是让人回到值得追问的问题。两者咬合处,是贯穿全系列的同一件事:把执行交出去省下的认知带宽,落回人手里去问、去判断、去负责,而不是反过来把人绑在产出指标上。
Kernel step ④, "humans return to meaning," is not a sentimental closer; on the research face it has a very concrete, testable landing point: freeing researchers from the "more papers" treadmill and returning them to the questions worth asking yet unmeasurable. This is the full weight of the human argument in the research volume — cheaper execution is never the point in itself. AI makes execution cheap not so that researchers produce more papers per unit time (that only binds people deeper to the wheel of exploitation), but so that researchers free up cognitive bandwidth for the act the machine cannot do and that is most worth a human doing: judging which truth is worth knowing, holding a research direction not hijacked by output. In a research organization that uses AI rightly, researchers' time on "asking, judging, integrating" should increase, not their time on "chasing output." This human through-line is two faces of one sentence with the Org volume's "put people back at the organization's center": the org face returns people to the judgment node, the research face returns people to the questions worth asking. Their meshing point is the one thing that runs through the whole series: the cognitive bandwidth freed by handing execution away lands back in human hands — to ask, to judge, to take responsibility — rather than binding people to output metrics.
正交退守——补盲区:当一阶研究被充裕化,人退守的不止"值得知",还有两个正交方向。其一,退守到设计科学本身:当跑实验近乎免费,"该跑哪个实验、用什么判据算证据、什么样的研究设计能真正区分假设"这套元层方法判断,反而升值。其二,退守到元科学:研究"研究本身怎么被 AI 改写"——RES 03 的 Nature 文献计量正是这一退守的产物。两条都不是"做得更快",是把判断节点抬高一层。
Orthogonal retreats — filling the blind spot: when first-order research is abundified, the human retreats not only to "worth knowing" but in two orthogonal directions. First, to the design of science itself: when running an experiment is near-free, the meta-level methodological judgment — "which experiment to run, what counts as evidence, what research design actually discriminates between hypotheses" — appreciates. Second, to meta-science: studying "how research itself is being rewritten by AI" — RES 03's Nature bibliometrics is a product of exactly this retreat. Neither is "doing it faster"; both raise the judgment node one level.
The generation layer's conservative bias (nailed once more: acceleration ≠ progress): RES 02's question-clustering and RES 03's topical contraction and novelty-penalty point to one thing — the generation layer accelerates by default toward the "safe, data-rich, paradigm-consistent" direction. It makes science run faster while possibly running narrower. Owning the value of a research direction is exactly how to resist this bias: let "worth" be set by humans (within some value frame), not by "nearest to the known." This is why SHEET 08 must hand value accountability to the organization's governance structure — a value judgment with no owner gets quietly replaced by the generation layer's default bias.
交棒锚 → 组织(谁有权定方向) / 人本主线Hand-off anchor → Org (who owns direction) / the human through-line
研究方向的价值责任落到组织里,就是治理:谁来判、谁担保、谁为方向负责。见The value accountability of a research direction, once inside an organization, is governance: who judges, who vouches, who owns the direction. See 组织篇(阅读入口)↗the Organization reading entry ↗。把研究者还给"值得追问的问题",与组织卷"让人回归组织中心"是同一条人本主线。. Returning researchers to "questions worth asking" is the same human through-line as the Org volume's "put people back at the center."
散木的命运:效率会自动吃掉冗余探索
The fate of useless-wood: efficiency eats redundant exploration by default
"Who owns the direction" is not an abstract governance question; it has a very concrete failure mode: redundant exploration space quietly eaten by efficiency. Multiple sources converge on one thing — what AI amplifies is exploitation (refinement, efficiency, execution), not exploration. Organizational structure tilts toward exploitation by nature, because it is predictable, measurable, fast-feedback; every AI rollout emits a clean "progress" signal (CFO-friendly), while exploration's story is fuzzy, demands imagination, and pays off unmeasurably. March's 1991 frame [R15]is still the base: exploration (search / variation / risk) and exploitation (refinement / selection / efficiency) compete for one budget, and exploitation tends to win. The subtler mechanism: "freed capacity does not automatically become slack" — saved hours usually get reallocated to more of the same (more volume), not to something different. "What gets measured gets managed; what cannot be measured gets cut" — slack, being unmeasurable, is cut first. This is why "owning the value of a research direction" must be an owned, deliberately protected governance act and cannot be expected to survive on its own.
Freed capacity does not automatically become slack — and this mechanism is governance's true enemy. Many organizations assume "AI saved us time, so naturally we now have spare capacity to explore," which is the most common misjudgment. Capacity that technology frees gets reallocated by default to more of the same, not turned into slack available for free exploration — because slack is unmeasurable while "X% more papers" is measurable. "What gets measured gets managed; what cannot be measured gets cut" — under this law, the moment freed hours appear they are absorbed by measurable output targets, and an exploration budget never even forms. So "protecting exploration" cannot be expected to emerge naturally as a by-product of time-saving; it must be a governance decision that is explicitly reserved, deliberately protected, and owned by someone accountable for its long-term return. This closes RES 07's two halves: owning value (holding "which direction is worth it") and protecting useless-wood (holding "a budget for unmeasurable exploration") are two faces of one governance act — both resisting the gravity by which "value / slack, being unmeasurable, is eaten by the default bias."
价值判断没有归属,就被默认偏置悄悄替换。"谁有权定方向"之所以是治理问题而非技术问题,关键在一个容易被忽视的动力学:价值判断不会停留在真空里,它要么有归属人、被有意识地行使,要么被生成层的默认偏置悄悄填补。没有中间态。当一个研究组织不明确"谁来判这个方向值不值得追",这个判断不会消失——它会被默认地交给"哪个方向有数据、有 benchmark、能稳出结果",也就是 RES 02/03/08 反复指认的那条向已知收敛的保守偏置。于是组织以为自己在"中立地跟随数据",实际上是在无人负责的情况下,让生成层的结构性偏置替它做了方向选择。治理的全部意义,就是把这个判断从默认偏置手里夺回来,交给一个具名的、要为长期后果负责的人。这也是为什么 SHEET 08 必须把研究方向的价值责任明确交给组织的治理结构——不是为了多设一个审批岗,是为了堵住"价值判断被偏置接管"这个无声的漏洞。
A value judgment with no owner gets quietly replaced by the default bias. "Who owns the direction" is a governance question, not a technical one, because of a dynamic easily overlooked: a value judgment does not stay in a vacuum — either it has an owner exercising it consciously, or the generation layer's default bias quietly fills it. There is no middle state. When a research organization leaves "who judges whether this direction is worth chasing" unspecified, the judgment does not vanish — it is handed by default to "which direction has data, a benchmark, a steady yield," exactly the conservative bias toward the known that RES 02/03/08 keep naming. So the organization believes it is "neutrally following the data" while in fact, with no one accountable, the generation layer's structural bias has made the direction choice for it. The whole point of governance is to take that judgment back from the default bias and give it to a named person accountable for the long-term consequences. This is why SHEET 08 must explicitly hand research-direction value accountability to the organization's governance structure — not to add an approval seat, but to plug the silent leak of "value judgment captured by the bias."
But useless-wood can be protected — this is conditional, not fated. "Replacing humans with tokens = exploitation; augmenting humans with tokens = exploration" — the same AI capability, dropped into different incentive structures, ends opposite ways. The concrete acts of protecting redundant exploration are governance acts: stand up independent exploration units, appraise on "learning / novelty" rather than "output," explicitly budget for "seemingly useless" directions (the historical evidence of Bell Labs, Xerox PARC, the early Cambridge LMB all points to "small teams + institutional protection"). Wire this back to kernel ④: the human's return to meaning, on the research face, lands concretely as freeing researchers from the exploitation treadmill of "more papers" and returning them to the unmeasurable questions that might redraw the map. Cheaper execution is never the point in itself; what we want is to let people ask the question worth asking.
FIG. 8.0 / 协作者与裁判的边界:AI 当协作者(生成那侧),人当裁判(担保那侧)THE COLLABORATOR–JUDGE BOUNDARY: AI AS COLLABORATOR (THE GENERATING SIDE), HUMAN AS JUDGE (THE VOUCHING SIDE)看懂:一条竖线把研究的动作分成两侧。左侧是 AI 当协作者——生成、执行、检索、起草,全在"对不对/有没有"的可机检面,向充裕一路坍缩。右侧是人当裁判——担保可信、选值得追的问题、为方向负责、判哪个真相值得知道,全在不可机检面。这条线不是工具分工,是责任归属:跨过它的,人才签字。Read: one vertical line splits research's acts into two sides. Left is AI as collaborator — generate, execute, retrieve, draft, all on the machine-checkable face of "is-it-correct / does-it-exist," collapsing toward abundance. Right is human as judge — vouch credibility, select the worth-chasing problem, own the direction, decide which truth is worth knowing, all on the un-checkable face. This line is not a division of tools but of accountability: only what crosses it does a human sign.
整卷的角色分工可以收进一条竖线。左侧,AI 是协作者:它生成、执行、检索、起草,全部落在"对不对、有没有"的可机检面——这一面向充裕一路坍缩。右侧,人是裁判:担保可信、选值得追的问题、判哪个真相值得知道、为方向具名负责,全部落在不可机检面。这条线不是"谁干哪些活"的工具分工,而是责任归属:跨过它的判断,必须有一个具名的人签字——否则它不会消失,只会被生成层的默认偏置无声接管。把这条线守住,研究环就还在自我纠偏;守不住,它就退化成一台高速空转的生成器。The whole volume's role-split collapses into one vertical line. On the left, AI is the collaborator: it generates, executes, retrieves, drafts — all on the machine-checkable face of "is-it-correct, does-it-exist," the face that collapses toward abundance. On the right, the human is the judge: vouching credibility, selecting the worth-chasing problem, deciding which truth is worth knowing, owning the direction by name — all on the un-checkable face. This line is not a "who does which chores" division of tools but a division of accountability: a judgment that crosses it must be signed by a named person — otherwise it does not vanish, it is quietly taken over by the generation layer's default bias. Hold the line and the research loop still self-corrects; lose it and the loop degrades into a fast generator spinning free.
RES
08
FAILURE · 超常规科学
HYPERNORMAL SCIENCE
失败模式 · 执行充裕的暗面
Failure mode · The dark side of abundance
最危险的误用:加速把科学推得更窄,不是更深
The most dangerous way to go wrong: acceleration pushes science narrower, not deeper
前面七张都在讲"瓶颈搬到判断"。但有一个前置故障在判断节点交还给人之前就发生了:生成层本身有保守偏置。当 AI 把执行做到近免费,它默认朝"安全、数据丰富、与既有范式一致"的方向加速——预测力上升,"提新类问题的能力"反而下降。这是 hypernormal science:看起来产出爆炸,实则探索空间在收窄。本卷最该设防的,不是 AI 变弱,是它太擅长范式内。
The first seven sheets all argue "the bottleneck moves to judgment." But one failure precedes the judgment node being handed back to humans: the generation layer itself carries a conservative bias. Once AI makes execution near-free, it accelerates by default toward the "safe, data-rich, paradigm-consistent" — predictive power rises while the ability to "pose new kinds of question" falls. This is hypernormal science: output looks explosive while the exploration space contracts. What this volume must guard against is not AI getting weaker, but how good it already is inside the paradigm.
预测准 ≠ 理解对:太阳系模型没长出"引力"
Accurate prediction ≠ correct understanding: the solar-system model never grew "gravity"
Hypernormal science is most easily underestimated because it often disguises itself as "success." A model that predicts accurately looks like good science — but "predicting accurately" and "understanding correctly" are two things, and AI's objective rewards only the former. The cleanest example: a foundation model trained on 10 million simulated solar systems predicts planetary orbits to very high precision, yet never grows the concept of "gravity" in its internal representation — what it learned is a patchwork of statistical regularities that reproduce the observations, not the one law of mechanics that unifies all orbits. By the metric "predicts accurately," it is perfect; by "understands correctly," it understood nothing. This is hypernormal's disguise: when the only evaluation criterion left is "a higher score on the existing benchmark," a system that perfects in-paradigm prediction while never touching the level of description gets waved through, even celebrated as a paradigm-level breakthrough. Holding this distinction — a gain in predictive power is not a deepening of understanding — is the first cognitive defense against being fooled by hypernormal science. It also explains why the volume insists "output volume / prediction accuracy itself is not the metric": what should be asked is whether topical breadth widened and whether the level of description was challenged.
受力分析 · 为何会这样:机器学习靠"对预先定义好的变量/标签最小化预测误差"——它擅长预测当前数据,但被锁进所学数据的概念词汇。地图隐喻最锋利:把伦敦地铁画到与城市等大、细节拉满,仍是同一种信息;1933 年 Beck 抛掉地理精确、把网络重画成电路图,才是范式——一次"重新示意化",不是堆更多细节。AI 能把地图上的空白填满(DeepMind GNoME 发现 220 万新材料,绝大多数是已知结构类型内的元素替换;ESM3 设计新荧光蛋白=在地图上填空,不是画新地图),但它不会去问"现在这套描述层级是不是错的"。William Farr 的霍乱地图把数据围绕"空气质量"组织,再聪明的 AI 也推不出"水传播微生物"这个没人记录过的变量——germ theory 要靠换显微镜、换仪器、换变量。
Force analysis · why this happens: machine learning works by "minimizing prediction error against pre-defined variables/labels" — good at predicting current data, but locked into the conceptual vocabulary of what it was trained on. The map metaphor is sharpest: drawing the London Underground at the city's full size with maximal detail is still the same information; only when Beck (1933) threw away geographic accuracy and redrew the network as a circuit diagram did a paradigm appear — a re-schematization, not more detail. AI can fill the blanks on the map (DeepMind's GNoME found 2.2 million new materials, the vast majority element-substitutions inside known structure types; ESM3 designed novel fluorescent proteins, filling gaps on the map, not drawing a new one), but it will not ask "is this whole level of description wrong." Farr's cholera map organized data around "air quality," and no AI, however clever, could infer "waterborne microbes," a variable no one had recorded — germ theory required changing the microscope, the instrument, the variable.
Hao, Xu, Li & Evans,《AI tools expand scientists' impact but contract science's focus》, Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y。对约 4129.8 万篇论文的分析印证了这条暗面的"已发生"形态:用 AI 的科学家个人发表 3.02×、被引 4.84×、当项目负责人早 1.37 年,但科学整体主题覆盖收缩 4.63%、学者间互动下降 22%、引用集中度上升(Gini 0.754 vs 0.690),知识广度在六大学科 70% 以上子领域一致收缩。机理=AI 向数据丰富区聚集、自动化既有领域而非探索新领域。〔标选择效应〕它是观测性文献计量(用 LLM 分类器 F1=0.875 识别"AI 增强"论文,用 AI 者本就可能集中于热门域,相关非因果)——但作为"加速 ≠ 进步"的先行信号已足够硬。
Hao, Xu, Li & Evans, "AI tools expand scientists' impact but contract science's focus," Nature 649(8099), 2026, DOI 10.1038/s41586-025-09922-y. An analysis of about 41.298 million papers gives this dark side its "already happened" form: AI-using scientists publish 3.02×, are cited 4.84×, and lead projects 1.37 years earlier, yet science as a whole shows topical coverage contracting 4.63%, scholar-to-scholar interaction down 22%, and rising citation concentration (Gini 0.754 vs 0.690), with knowledge breadth contracting consistently across more than 70% of subfields in six disciplines. The mechanism = AI clusters toward data-rich regions, automating existing fields rather than exploring new ones. [flag selection effect] It is observational bibliometrics (an LLM classifier at F1=0.875 labels "AI-augmented" papers; AI users may already concentrate in hot fields — correlation, not cause), but as a leading signal for "acceleration ≠ progress" it is hard enough.
The dark side happens before judgment is handed back to humans. The logic of the first seven sheets is "execution abundified → judgment retreats to humans," which sounds like a clean relay: the machine finishes execution, the human takes over judgment. But hypernormal science reveals a pre-failure — the dark side happens before judgment is handed back. The reason: the generation layer does not neutrally spread all candidates and wait for a human to pick; it already carries bias at the moment of generation — it preferentially produces "safe, data-rich, paradigm-consistent" candidates and pushes "change-the-frame, change-the-variable" candidates to the tail or never generates them. So when the human arrives to judge, the candidate set in front of them has already been quietly narrowed. The human believes they are "selecting the most worthy from all possibilities" while actually "selecting from a subset already filtered by the conservative bias." This is why the research loop cannot be read simply as "neutral generation, load-bearing judgment" — generation itself carries a value tilt, and that tilt runs toward the in-paradigm. The act of owning value must therefore move earlier: not only guarding at judgment time but, at generation time, actively demanding that the candidate set include paradigm-level options — otherwise the human guards a menu that has already been rigged.
同质化的机制:写在权重里,推理时救不回
The mechanism of homogenization: written into the weights, unrecoverable at inference
Hypernormal science is not just a soft "AI leans conservative" tendency; it has a hard mechanism written into the weights. The strongest causal anchor is Doshi & Hauser (Science Advances 2024) [R12]: give a writer LLM ideas and individual stories get more "creative," yet the stories grow more similar to each other — they explicitly call it a "social dilemma" (better individually, narrower collectively). Homogenization is a group-level effect (Anderson et al. 2024, a 36-person experiment) [R13]: it comes not from individual fixation but from the LLM suggesting similar ideas to different users. Harsher still is cross-model homogeneity ("We're Different, We're the Same," 2025) [R14]: controlling for structural variables, LLMs resemble each other far more than humans resemble each other — switching models does not save you. The decisive mechanism-level evidence: post-training diversity collapse is written into the weights and unrecoverable at inference (arXiv 2604.16027, three Olmo 3 lineages); recursively training on synthetic data causes model collapse with the distribution tails vanishing (Shumailov et al., Nature 2024). This mechanism makes "real human interaction data" an ever more precious resource — directly endorsing kernel ④'s "humans as the source of heterogeneity."
真实交互数据,因此成了愈发珍贵的反同质化资源。这条机制有一个常被忽略的推论:既然 diversity collapse 写在权重里、且 recursive 训练合成数据会让分布尾部消失(model collapse),那么未被 AI 中介过的、真实的人类交互数据就成了愈发稀缺、愈发珍贵的资源——它是分布尾部、是异质性的最后蓄水池。这对研究组织有直接的操作含义:当所有人都在用同几个模型生成假设、写综述、做评审时,那些仍由人独立产生、未被模型均值拉平的判断与观察,恰恰是组织最该刻意保存、而非急于"用 AI 提效"掉的东西。它也回连内核④"人是异质性来源":人之所以不可替代,不是因为人比 AI 聪明,而是因为人群携带着 AI 分布尾部已经丢失的多样性。守住这份多样性,需要刻意施力——在流程里留出"不经 AI 中介"的判断节点,在数据上珍惜真实人类信号,在激励上奖励偏离均值的探索。这三条合起来,就是抵抗 hypernormal 的组织级动作。
Real interaction data thus becomes an ever more precious anti-homogenization resource. This mechanism has an often-overlooked corollary: since diversity collapse is written into the weights, and recursively training on synthetic data makes the distribution tails vanish (model collapse), then real human interaction data not mediated by AI becomes an ever scarcer, ever more precious resource — it is the distribution tail, the last reservoir of heterogeneity. This has a direct operational implication for research organizations: when everyone generates hypotheses, writes reviews, and reviews with the same few models, the judgments and observations still produced independently by humans, not flattened to the model mean, are exactly what the organization should deliberately preserve rather than rush to "make efficient with AI." It also wires back to kernel ④'s "humans as the source of heterogeneity": humans are irreplaceable not because they are smarter than AI but because the human population carries the diversity AI's distribution tail has already lost. Holding this diversity takes deliberate force — keeping "non-AI-mediated" judgment nodes in the workflow, treasuring genuine human signal in the data, rewarding off-mean exploration in incentives. Together, these three are the organization-level acts that resist hypernormal.
但它是默认引力,不是铁律。诚实地把反向证据摆上:同质化依任务/prompt/暴露方式而变,在高暴露的动态实验里集体多样性反而能升。所以正确的命题表述不是"AI 必然让科学同质",而是"AI 默认把研究拉向均值,须刻意施力才能偏离"(regression to a domain prototype)。开放式/QD 算法(novelty-search、MAP-Elites、POET)证明:只要放弃单一目标函数,机器也能产异质。这把命题从"异质性只能来自人"收紧成更稳的版本:异质性的敌人是单一目标的过度优化,不是机器本身——人定义"什么值得不同",机器在那个定义下产生多样。这条限定既守住人的角色,又抗住"AI 终将学会创意"这个证伪。
But it is a default gravity, not an iron law. Put the counter-evidence on the table honestly: homogenization varies by task / prompt / exposure mode, and in high-exposure dynamic experiments collective diversity can even rise. So the correct statement is not "AI inevitably homogenizes science" but "AI by default pulls research toward the mean and needs deliberate force to depart" (regression to a domain prototype). Open-ended / QD algorithms (novelty-search, MAP-Elites, POET) prove that machines too can produce heterogeneity — as long as the single objective function is abandoned. This tightens the thesis from "heterogeneity can only come from humans" into a sturdier version: the enemy of heterogeneity is the over-optimization of a single objective, not the machine itself — humans define "what is worth differing on," the machine generates diversity under that definition. This qualifier both holds the human's role and withstands the falsifier "AI will eventually learn creativity."
反指标 · 怎么知道你正在滑进 hypernormalCounter-indicators · how to tell you are sliding into hypernormal
先行反指标:研究组合的主题覆盖广度在缩而产量在涨;"换框架/换变量"型贡献占比长期低位;引用越来越集中在少数热门节点(Gini 升);团队只盯"在现有 benchmark 上更高分"。这些一起出现,就是生成层的保守偏置已经把可选项悄悄收窄到范式内——指标全绿、科学却变窄。〔来源:Asimov 文为观点综述 Ⅳ–Ⅴ;其转引实证另行回溯定级〕Leading counter-indicators: the portfolio's topical breadth shrinks while output rises; the share of "reframe / new-variable" contributions stays durably low; citations concentrate on a few hot nodes (rising Gini); teams chase only "a higher score on the existing benchmark." When these co-occur, the generation layer's conservative bias has already narrowed the options to in-paradigm — every metric green while science gets narrower. [source: the Asimov essay is opinion/review, Ⅳ–Ⅴ; its cited empirics are traced and graded separately]
RES
09
JUDGMENT · 可信度天平
THE BELIEVABILITY LEDGER
决策矩阵 · 逐条判可信
Decision matrix · claim-by-claim
当生成无限,每条主张都要先过一道可信度天平
When generation is unbounded, every claim first crosses a believability ledger
RES 03 说瓶颈搬向"判可信"。这张把它变成可照做的动作:面对一批 AI 生成的主张,不是逐篇精读(带宽不够),而是按"证据强度 × 与既有范式的距离"分诊,把人的判断只投在天平真正吃紧处。两条轴不能合并——离范式远不等于不可信,恰恰相反,那可能是范式级重构。
RES 03 says the bottleneck moves to "judging credibility." This sheet turns that into a doable action: facing a batch of AI-generated claims, do not read each closely (bandwidth will not allow it) but triage by "evidence strength × distance from the established paradigm," spending human judgment only where the scale is genuinely tight. The two axes must not be merged — far from the paradigm does not mean not credible; on the contrary, it may be a paradigm-level reframing.
证据强度可补,范式距离要判——两条轴管两种动作
Evidence strength can be supplemented, paradigm distance must be judged — two axes for two acts
两条轴之所以不能合并,深层原因是它们对应两种性质完全不同的动作。"证据强度"这条轴是可补的、可机检的:一条主张证据弱,处置很清楚——去补证据(多跑复现、找原始数据、查证据链是否完整),这是个有标准答案、可外包给生成与图谱规则的动作。"范式距离"这条轴则要人来判,且没有可补一说:一条主张离既有范式远,这件事本身不是缺陷、也不是优点,它只是一个需要构成性判断的信号——人要判它是范式外的噪声,还是范式级的重构。把两条轴合成一个"可信分",等于把"可补的执行动作"和"不可外包的判断动作"搅成一锅,结果是两种动作都做不好:该补证据的没去补(因为分数已经替它下了结论),该人判的被自动判了(因为分数把范式远直接折算成低可信)。分开记账的全部意义,就是让每条轴触发它对应的那种正确动作——证据弱→补证据(左动作),范式远→人来判(右动作)。这正好是 RES 10 那条"能写可机检验收标准的归左、写不出的归右"判据在单条主张层面的应用。
The deep reason the two axes cannot be merged is that they correspond to two acts of completely different nature. The "evidence strength" axis is supplementable and machine-checkable: if a claim's evidence is weak, the disposition is clear — go get more evidence (run replications, find raw data, check the evidence chain's completeness), an act with a right answer, outsourceable to generation and graph rules. The "paradigm distance" axis must be judged by a human, with no "supplementing": a claim being far from the established paradigm is neither a defect nor a merit in itself, only a signal requiring constitutive judgment — a human must judge whether it is out-of-paradigm noise or a paradigm-level reframing. Merging the two into one "credibility score" stirs "a supplementable execution act" and "an un-outsourceable judgment act" into one pot, and then both are done badly: the evidence that should be sought is not sought (the score already concluded for it), and what a human should judge is auto-judged (the score converts paradigm-distance straight into low credibility). The whole point of booking them separately is to let each axis trigger its corresponding correct act — weak evidence → seek evidence (the left act), far from paradigm → a human judges (the right act). This is exactly RES 10's "machine-checkable criterion goes left, otherwise right" test applied at the single-claim level.
为什么要两条轴而非一条"可信分":把可信度压成单一分数,正中 RES 03 的结构性陷阱——AI(和被它训练影响的评审)会用"与既有文献分布的距离"当唯一代理,于是离范式越远评分越低,把真正新颖压成"离群/不可信"。天平的解法是把"证据强度"与"范式距离"分开记账:证据弱要补证据(可机检),范式远要人来判它是噪声还是重构(构成性)。四象限给四种处置,照着走就不会把"新"误杀成"错"。
为什么"证据弱×范式远"这一格决定整台仪器的价值
Why the "weak × far" cell decides the whole instrument's value
Three of the four quadrants are intuitive: strong × in-paradigm → integrate; weak × in-paradigm → noise; strong × far → human focuses. What truly tests a research system's mettle is the fourth — weak evidence × far from paradigm. Intuition and a single credibility score both sentence it to death: "absurd and unsupported, delete." Yet nearly every paradigm shift in the history of science sat exactly in this cell at birth: Einstein's 1905 special relativity and Lorentz's ether contraction both merely fit the same data at first, and Einstein's version was both "far from paradigm" and lacking decisive experimental evidence at the time. Darwin's natural selection — its core mechanism (pangenesis, gemmules) later proved wrong, yet the idea survived because it was useful. Had a credibility-score-only system existed then, it would have judged all of these "weak × far → delete." So the correct disposition for this cell is not the binary "believe / disbelieve" but a third act: suspend, and go find the decisive evidence that separates "it is noise" from "it is a reframing." Whether an instrument is worth using turns entirely on its handling of this cell — make it "delete" and it is hypernormal science's automatic strangler; make it "suspend + targeted evidence-seeking" and it is a ledger that can actually catch a paradigm shift.
Why two axes, not a single "credibility score": collapsing credibility into one number walks straight into RES 03's structural trap — AI (and the reviewers it has shaped) use "distance from the existing-literature distribution" as the only proxy, so the further from the paradigm, the lower the score, miscoding the genuinely novel as "outlier / not credible." The ledger's fix is to book "evidence strength" and "paradigm distance" separately: weak evidence needs more evidence (machine-checkable), paradigm distance needs a human to judge whether it is noise or a reframing (constitutive). The four quadrants give four dispositions; follow them and you will not kill "new" as "wrong."
First drag "generation rate" to watch the integration deficit explode; then drop a claim onto two axes for a disposition verdict — tension two (peer review changes) plus tension three (the integration gap) made into one adjustable bench.
生成速度(相对人类整合带宽)Generation rate (relative to human integration bandwidth) · 10×
生成GEN
可整合DIGEST
X · 证据强度?Evidence strength?
Y · 距既有范式?Distance from paradigm?
人判 · 可能是重构Human · maybe a reframing
证据强 × 范式远strong × far
别急着杀 · 先补证据Do not kill yet · seek evidence
证据弱 × 范式远weak × far
可信 · 入库整合Believe · integrate
证据强 × 范式内strong × near
存疑 · 范式内噪声Doubt · in-paradigm noise
证据弱 × 范式内weak × near
关键反陷阱The key anti-trap
最该警惕的格是"证据弱 × 范式远"——单一可信分会直接判它死,但范式级重构在诞生时证据必然薄(爱因斯坦 1905 与洛伦兹起初都只是拟合数据)。处置不是"信"或"不信",是"挂起,定向去找能区分它与噪声的关键证据"。把它当噪声删掉,正是 hypernormal science 的扼杀动作。The cell to watch most is "weak evidence × far from paradigm" — a single credibility score sentences it to death, yet a paradigm-level reframing is necessarily thin on evidence at birth (Einstein in 1905 and Lorentz both merely fit the data at first). The disposition is not "believe" or "disbelieve" but "suspend, and go find the decisive evidence that separates it from noise." Deleting it as noise is exactly hypernormal science's killing move.
The ledger does not score claims; it ranks how much human bandwidth to spend. The ledger is most easily misused as "compute a credibility score per claim and sort by score." That is exactly what it avoids. Its real product is not a score but a disposition — for each claim, answering "what to do with it next," not "how credible it is." The four quadrants give four dispositions, not four score tiers: strong × in-paradigm → integrate directly (no human needed); weak × in-paradigm → hold as noise (no human needed); strong × far-from-paradigm → a human judges whether it is a reframing (worth looking); weak × far-from-paradigm → suspend and go find discriminating evidence (most worth looking). The design's purpose is to free the human's scarce bandwidth from "reading each one closely" and spend it only on the two genuinely tight cells. When generation runs at thousands an hour, reading each is physically impossible; the ledger's value is precisely that it does the "which ones need no human at all" triage for you.
AI 当协作者,还是当裁判?一道必须先划的界
AI as collaborator, or as judge? a line you must draw first
天平回答"这条主张可不可信",但它背后压着一个更根本的问题:这一道判断,到底该不该让 AI 来当裁判?把 AI 当协作者(铺候选、查文献、跑实验、提反例)几乎总是安全的——它在执行端,输出还要过人的判断。把 AI 当裁判(让它定"哪个值得信、哪个值得发、哪个该资助")则是另一回事,因为裁判位是价值与可信度的归属位,一旦交出去,RES 03 的结构性偏置会从"建议"升级成"判决"。这道界不能拍脑袋划,要看三件事:判据能不能机检(能 → AI 可裁)、判断是否价值负载(是 → 留人)、判错的代价可不可逆(不可逆 → 留人)。一个干净的判据:当这道判断的"对"只能诉诸"对谁、在哪个价值框架下",AI 只能当协作者,不能当裁判。
The ledger answers "is this claim credible," but underneath it presses a more fundamental question: should AI be the judge of this decision at all? Using AI as a collaborator (spreading candidates, searching literature, running experiments, raising counter-examples) is almost always safe — it sits on the execution side and its output still passes through human judgment. Using AI as a judge (letting it set "which to believe, which to publish, which to fund") is another matter, because the judge's seat is the seat of value and credibility ownership, and once handed over, RES 03's structural bias is promoted from "suggestion" to "verdict." This line cannot be drawn by gut; it depends on three things: whether the criterion is machine-checkable (yes → AI may judge), whether the judgment is value-laden (yes → keep human), whether the cost of a wrong call is reversible (irreversible → keep human). A clean test: when the "right" of this judgment can only appeal to "for whom, under which value frame," AI can only be a collaborator, never a judge.
把界划错的两种症状:一种是把裁判位悄悄让出去——团队用"AI 评分高"代替"我读过、我担保",于是没有人真正为某条主张的可信度负责,偏置在无人察觉中累积(这正是 RES 07"价值判断没有归属就被默认偏置替换"的运行态)。另一种是把协作位也死死攥住——出于不信任,连"铺候选、查文献"这种纯执行都不敢交给 AI,于是没把执行充裕化,团队还困在工时瓶颈里。两种都丢杠杆:前者交出了不该交的判断,后者守住了不必守的执行。正确的姿势是把这两个位子分开:执行位尽量交,裁判位审慎留——而审慎与否,由上面那三条(可机检 / 价值负载 / 代价可逆)逐案判定。
Two symptoms of drawing the line wrong: one is quietly vacating the judge's seat — the team substitutes "AI scored it high" for "I read it, I vouch," so no one truly owns any claim's credibility and bias accrues unnoticed (exactly RES 07's "a value judgment with no owner gets replaced by the default bias," at runtime). The other is clutching even the collaborator's seat — out of distrust, the team will not even hand "spreading candidates, searching literature" (pure execution) to AI, so it never abundifies execution and stays stuck at the hours bottleneck. Both forfeit leverage: the first hands away judgment that should not be handed away, the second holds execution that need not be held. The right posture is to separate the two seats: hand the execution seat freely, keep the judge's seat with care — and the care is adjudicated case by case by the three above (machine-checkable / value-laden / cost reversible).
FIG. 9.0 / 可信度账本:一条主张如何沿证据级被晋升或卡住THE BELIEVABILITY LEDGER: HOW ONE CLAIM IS PROMOTED — OR STALLED — ALONG THE EVIDENCE GRADES看懂:横轴是证据级 Ⅴ→Ⅰ;主张从"假设"出发,每过一道闸晋一级;唯一让它"承重"的是 Ⅱ→Ⅰ 那道独立复现闸——过不了就停在"已发表未复现",不许当地基。Read: the x-axis is evidence grade Ⅴ→Ⅰ; a claim starts as a hypothesis and gains a grade at each gate; the one thing that makes it "load-bearing" is the Ⅱ→Ⅰ independent-replication gate — fail it and it parks at "published, unreplicated," not to be built on.
这张图把"逐条判可信"展开成时间维度:一条主张不是一锤定音地"可信/不可信",而是沿证据级被逐道闸晋升。AI 让左半段(Ⅴ–Ⅳ,假设与单点报告)的产出近乎免费,于是真正的瓶颈整体右移到唯一那道承重闸——Ⅱ→Ⅰ 的独立复现。过不了这道闸的主张不是被删,而是停在"已发表未复现",可以被引用、被讨论,但不许被当成地基往上盖。这正是 FIG 0.1 那道复现闸在单条主张尺度上的展开,也是 INSTRUMENT 10 双轴矩阵在"证据强度"那条轴上的纵深。This figure unrolls "judging credibility claim-by-claim" along a time axis: a claim is not credible-or-not in one stroke; it is promoted through gates along the evidence grades. AI makes the left segment (Ⅴ–Ⅳ, hypotheses and single field reports) near-free to produce, so the real bottleneck shifts wholesale to the one load-bearing gate — the Ⅱ→Ⅰ independent replication. A claim that fails this gate is not deleted but parked at "published, unreplicated": citable, discussable, but not to be built upon as a foundation. This is FIG 0.1's replication gate unrolled at single-claim scale, and the depth of INSTRUMENT 10's two-axis matrix along its "evidence strength" axis.
RES
10
MATRIX · 范式内 / 范式级分诊
IN-PARADIGM / PARADIGM-LEVEL
决策矩阵 · 哪步交 AI / 哪步留人
Decision matrix · to AI / to human
同一个研究动作,范式内交给生成、范式级留给人
In-paradigm hands to generation, paradigm-level stays with the human — the same action, split
RES 02 把提问切成两层。这张把那一刀落到每个具体研究动作上:检索、提假设、设计实验、分析、判结论——每个都有"范式内"的一半(可充裕)和"范式级"的一半(稀缺)。看清这条分界线,就知道把人放在哪一格、上下文怎么从生成流回判断——一张照做的分诊表。
RES 02 cut questioning into two layers. This sheet lands that cut on each concrete research action: search, hypothesize, design experiments, analyze, judge conclusions — each has an "in-paradigm" half (abundifiable) and a "paradigm-level" half (scarce). See the dividing line and you know which cell to place the human in and how context flows from generation back to judgment — a triage table you can run.
一条可操作判据,同时回答三张 SHEET 的问题
One operational test answers the questions of three sheets at once
这张矩阵给出的不只是一张分诊表,更是一条贯穿全卷的可操作判据:能写出可机检验收标准的动作,归左格(交给生成或交给图谱规则);写不出、只能诉诸"对谁、在哪个价值框架下成立"的动作,归右格(留给人)。这条判据的价值在于它同时回答了三张不同 SHEET 各自的问题,让整卷的操作逻辑收敛成一句话。它回答 RES 04 的护栏该把关到哪——能机检的就交给证据库自动把关。它回答 RES 09 的天平该挂起什么——写不出验收标准、又离范式远的,正是该挂起去找证据的那一格。它回答 RES 13 的研究环该把人放在哪一步——人只接住右格的节点。三张 SHEET 看似各管一摊(护栏、天平、工作流),其实共用这同一条判据;记住它,你就不必背三套规则,只需对每个动作问一句:"这能写出可机检的验收标准吗?"答案就同时告诉你该把它放进护栏、天平、还是人的判断里。
This matrix gives not just a triage table but an operational test running through the whole volume: an action for which you can write a machine-checkable acceptance criterion goes to the left cell (to generation or to graph rules); an action for which you cannot, and can only appeal to "for whom, under which value frame," goes to the right cell (to humans). The test's value is that it answers the distinct questions of three different sheets at once, converging the volume's operational logic into one line. It answers where RES 04's guardrail should gate — what is machine-checkable goes to the base's auto-gate. It answers what RES 09's ledger should suspend — what has no writable acceptance criterion and is far from paradigm is exactly the cell to suspend and seek evidence for. It answers which step RES 13's loop should place the human in — the human catches only the right-cell nodes. The three sheets seem to manage separate things (guardrail, ledger, workflow), yet share this one test; remember it and you need not memorize three rule sets, only ask of each action: "can I write a machine-checkable acceptance criterion for this?" The answer tells you at once whether it belongs in the guardrail, the ledger, or human judgment.
范式内 · 交给生成(可机检、向数据丰富区聚集)In-paradigm · hand to generation (machine-checkable, clusters to data-rich)
Cross-sensory analogy: wire an idea to embodied intuition (the 16-year-old Einstein imagining riding a light beam, the "frozen wave" that felt physically wrong) — humans have cross-modal breadth.
判结论是否值得知:在稀疏、价值负载的域里定"哪个真相重要"——无对错、只有归属。
Judge whether a conclusion is worth knowing: in sparse, value-laden domains, set "which truth matters" — no right answer, only belonging.
判新颖是重构还是噪声:抵抗"离范式远 = 不可信"的结构性偏置(接 RES 09 天平的关键格)。
Judge whether novelty is reframing or noise: resist the "far from paradigm = not credible" structural bias (see RES 09's key cell).
上下文怎么流(照做的接法):生成侧把检索/假设/实验/分析的产物,全部落进 RES 04 的可追溯证据库(每条主张挂证据边);人侧只从证据库取那些"范式级"必须人判的节点——天平挂起的、范式距离远的、价值负载的。关键不是"先生成后审",是让证据库做范式内的自动把关,把人的带宽省给范式级。一句操作判据:能写出可机检验收标准的,归左格;写不出、只能诉诸"对谁、在哪个价值框架下成立"的,归右格。
How context flows (a connectable recipe): the generation side drops the outputs of search/hypothesis/experiment/analysis into RES 04's traceable evidence base (each claim carrying its evidence edges); the human side draws from that base only the "paradigm-level" nodes that require human judgment — the ones the ledger suspended, far in paradigm distance, value-laden. The point is not "generate then review" but letting the evidence base auto-gate the in-paradigm so human bandwidth is saved for the paradigm-level. One operational test: if you can write a machine-checkable acceptance criterion, it goes left; if you cannot, and can only appeal to "for whom, under which value frame," it goes right.
有效 / 失效的信号Right vs wrong signals
有效:人均工时在范式内动作上持续下降、在范式级判断上持续上升;左格产物一次落进可追溯链的比例升。失效:把范式级动作硬塞进左格("让 AI 决定该往哪推进研究"),或把范式内动作留在右格(人还在手动追引文)——前者是把价值判断交出去,后者是没把执行充裕化,两头都丢了杠杆。Right: human-hours on in-paradigm actions keep falling while hours on paradigm-level judgment keep rising; the share of left-cell outputs landing in the traceable chain on first pass rises. Wrong: forcing paradigm-level actions into the left cell ("let AI decide where to push the research"), or leaving in-paradigm actions in the right cell (humans still tracing citations by hand) — the first hands value judgment away, the second fails to abundify execution; both forfeit the leverage.
同一个动作词,左右两半的含义完全不同
The same action verb means entirely different things on its two halves
This matrix is most easily misread as "some actions go to AI, some stay with humans." It is not. The load-bearing insight is: the same action verb refers to two different things in the left cell and the right. "Hypothesize" in the left cell means "find the next checkable gap inside an existing frame" (nearest-neighbor search, AI's strength); in the right cell it means "pose a hypothesis that could not even hold in the old frame" (change the level of description, outside AI's training distribution). "Analyze" in the left means "fit predefined variables" (AI Feynman re-discovering known equations); in the right it means "judge whether this very set of variables is wrong" (Farr's air quality → waterborne microbes). So triage cuts not by "action type" but by "where this cut lands — the machine-checkable side or the constitutive side." An action verb usually straddles both cells — only by splitting its two halves do you know which half goes to generation and which half a human must catch.
上下文怎么从生成流回判断(接法的细节):关键不是"先生成、后人审"这种朴素流水线,而是让 RES 04 的证据库做范式内的自动把关——左格产物挂证据边自动入库、自动查冲突、自动拦截无来源,人完全不必经手;人只从证据库取那些被自动逻辑挂起的节点:天平判为"证据弱 × 范式远"的、与既有库强冲突的、价值负载无可机检判据的。这样人的带宽不是被均匀摊在所有产物上,而是集中投在矩阵右格。一句可操作判据反复用:能写出可机检验收标准的,归左;写不出、只能诉诸"对谁、在哪个价值框架下成立"的,归右。这条判据同时回答了 RES 04 的护栏该把关到哪、RES 09 的天平该挂起什么、RES 13 的研究环该把人放在哪一步。
两个对称的误用:交出不该交的、攥住不必攥的。这张分诊矩阵真正的用处,是同时防住两个方向相反、却同样致命的错误。错误一:把范式级动作硬塞进左格。典型症状是"让 AI 决定该往哪个方向推进研究"——把"选方向"这个最右端的构成性判断当成可生成的执行交出去。结果 RES 06 的价值判断、RES 12 的方向选择被默认偏置接管,组织在指标全绿中越走越窄。错误二:把范式内动作死死攥在右格。典型症状是人还在手动追引文、手动跑参数扫描、手动做本可自动的文献综合——出于不信任或惯性,把早已可充裕化的执行死守在人手里。结果是没把执行充裕化,团队仍困在工时瓶颈,②省不出工时投回③④。两个错误是对称的:前者交出了不该交的判断,后者守住了不必守的执行;前者丢的是方向,后者丢的是杠杆。矩阵的价值,就在于它给每个动作一个明确的归格判据,让这两类错误都无处藏身。
Two symmetric ways to go wrong: hand away what you should not, clutch what you need not. The real use of this triage matrix is to guard simultaneously against two opposite yet equally fatal errors. Error one: forcing paradigm-level actions into the left cell. The classic symptom is "let AI decide which direction to push the research" — handing the rightmost constitutive judgment "set direction" away as generatable execution. The result: RES 06's value judgment and RES 12's direction selection get captured by the default bias, and the organization walks ever narrower with every metric green. Error two: clutching in-paradigm actions in the right cell. The classic symptom is humans still tracing citations by hand, running parameter sweeps by hand, doing literature synthesis by hand that could be automated — out of distrust or inertia, holding long-abundifiable execution in human hands. The result: execution is never abundified, the team stays stuck at the hours bottleneck, and ② frees no hours to reinvest into ③④. The two errors are symmetric: the first hands away judgment it should not, the second holds execution it need not; the first loses direction, the second loses leverage. The matrix's value is that it gives each action a clear placement test, leaving neither error a place to hide.
How context flows from generation back to judgment (the connecting detail): the point is not the naive pipeline "generate first, human-review after," but letting RES 04's evidence base auto-gate the in-paradigm — left-cell outputs land carrying evidence edges, auto-check conflicts, auto-block the sourceless, with no human touch at all; the human draws from the base only the nodes the automatic logic suspended: those the ledger judged "weak evidence × far from paradigm," those strongly conflicting with the base, those value-laden with no machine-checkable criterion. This way human bandwidth is not spread evenly over all outputs but concentrated on the matrix's right cells. One operational test, used repeatedly: if you can write a machine-checkable acceptance criterion, it goes left; if you cannot, and can only appeal to "for whom, under which value frame," it goes right. This single test simultaneously answers where RES 04's guardrail should gate, what RES 09's ledger should suspend, and which step RES 13's loop should place the human in.
RES
11
BOUNDARY · 护栏的反面
THE GUARDRAIL'S BLIND SPOT
边界 · 知识图谱也会锁死
Boundary · the graph can lock you in
同一道护栏,用错就把描述层级锁死
The same guardrail, misused, locks in the level of description
RES 04 说知识图谱是研究生成的规格、是好护栏。但护栏有反面:若它只组织"已记录的变量",就会复制 Farr 霍乱图的盲区——把整套生成固定在"空气质量"这一层,再多的可追溯、可证伪都推不出"水传播微生物"。护栏让范式内更可信,却可能让范式级更不可能。这一张专讲怎么留口子。
RES 04 says the knowledge graph is the spec for research generation, a good guardrail. But a guardrail has a flip side: if it only organizes "already-recorded variables," it replicates Farr's cholera blind spot — pinning all generation to the "air quality" level, where no amount of traceability or falsifiability could yield "waterborne microbes." The guardrail makes the in-paradigm more credible while possibly making the paradigm-level more impossible. This sheet is about leaving the door open.
schema 越自洽,盲区越隐形——这是它最危险的地方
The more self-consistent the schema, the more invisible the blind spot — its most dangerous trait
护栏变笼子有一个反直觉的加剧因素:schema 越完整、越自洽,它的盲区就越难被察觉。一个粗糙、漏洞百出的图谱反而经常报错、经常有东西塞不进去,这些"塞不进去"恰好提醒人"框架可能不对"。但一个精心设计、覆盖完整、查询高效的图谱,会让一切都顺滑地落进现有节点——它从不报错,因为它把所有观测都成功地解释进了现有变量空间。问题在于:"成功解释进现有空间"和"解释对了"是两回事。Farr 的霍乱图就是这种"完美自洽"的笼子:每一例死亡都被成功归因到空气质量的某个梯度,模型拟合得很好、预测也不差,整个系统平滑运转、毫无报错——正因如此,没人会怀疑"空气质量"这个描述层级本身错了。这就是为什么 RES 08 的覆盖广度反指标必须当例行体检:当一个系统所有局部指标都健康、却在系统性地看不见某一类东西时,唯一能发现盲区的办法,是主动去问"我们的图谱在结构上不可能表示什么",而不是等它报错——因为它永远不会报错。
Guardrail-becoming-cage has a counter-intuitive amplifier: the more complete and self-consistent the schema, the harder its blind spot is to notice. A crude, leaky graph errors often and frequently has things that will not fit, and those "will-not-fit" cases happen to warn humans that "the frame may be wrong." But a carefully designed, fully covering, efficiently querying graph lets everything slide smoothly into existing nodes — it never errors, because it successfully explains every observation into the existing variable space. The catch: "successfully explained into the existing space" and "explained correctly" are two different things. Farr's cholera map is exactly this "perfectly self-consistent" cage: every death was successfully attributed to some gradient of air quality, the model fit well, predictions were not bad, the whole system ran smoothly with no errors — and precisely because of that, no one would suspect that the "air quality" level of description was itself wrong. This is why RES 08's breadth counter-indicator must be a routine checkup: when a system's local metrics are all healthy yet it is systematically blind to a class of things, the only way to find the blind spot is to actively ask "what is our graph structurally incapable of representing," not to wait for it to error — because it never will.
Mechanism · why a guardrail becomes a cage: the knowledge graph structures "claim — evidence — source" on the premise that the types of those nodes and edges are already defined. Once the schema is fixed, generation can only recombine within the existing variable space — exactly what makes it efficient, and exactly its blind spot. Farr's data was organized around miasma; the schema had no "pathogen" node at all, so no query, however clever, could ask for a variable outside the schema. AI accelerating on this layer only drills the wrong "air quality" level deeper, more self-consistent, harder to overturn — the guardrail builds in-paradigm credibility into a paradigm-level wall.
Leave a "change the variable" door: the schema must be writable, not read-only — keep a human-initiated channel to propose new node/edge types, and periodically review whether the schema is still the right level of description.
Surface "anomalies no node explains": do not let observations that fail to enter the graph be silently dropped; these out-of-schema residuals are often the doorway to a new map (the cholera death-clusters that the schema could not place).
给护栏配一个"反护栏"复盘:定期问"我们的图谱在系统性地看不见什么"——把 RES 08 的覆盖广度反指标接进来当例行体检。
Pair the guardrail with an "anti-guardrail" review: regularly ask "what is our graph systematically blind to" — wiring RES 08's breadth counter-indicator in as a routine checkup.
证据锚 / 失败模式Evidence anchor / failure mode
案例锚:William Farr 的霍乱地图把数据围绕"空气质量"组织,结构完整、可追溯,却推不出水传播——germ theory 要换显微镜、换变量(Asimov 案例,观点综述 Ⅳ–Ⅴ;历史史实部分另可回溯)。失败模式:把"主张全部落进可追溯链的比例"当唯一成功指标——它只度量范式内的整洁,完全测不到"图谱看不见的那一层"。护栏建得越漂亮,越要警惕它在帮你把错的层级修得无懈可击。Evidence anchor: Farr's cholera map organized data around "air quality" — structurally complete, traceable, yet unable to yield waterborne transmission; germ theory needed a new microscope and new variables (the Asimov case, opinion/review Ⅳ–Ⅴ; the historical facts can be traced separately). Failure mode: treating "share of claims landing in the traceable chain" as the only success metric — it measures only in-paradigm tidiness and cannot detect "the layer the graph cannot see." The more beautiful the guardrail, the more you must watch it perfecting the wrong level for you.
残差才是入口:把"图谱解释不了的"专门留住。一个健康研究系统与一个完美笼子的全部差别,落在它怎么对待残差——那些无法被任何现有节点解释、入不了图的观测。笼子的本能是把残差当噪声清扫掉,因为它们破坏了图的整洁、拉低了"主张落进可追溯链的比例"这个看起来很正的指标。但科学史一再表明:换地图的入口,几乎总是从残差进的。霍乱在 miasma schema 外的死亡聚集是残差;迈克尔逊-莫雷实验"测不到以太风"是残差;黑体辐射在经典理论下的"紫外灾难"是残差。每一个后来引发范式转移的,最初都是"现有框架解释不了、看上去像误差"的东西。所以护栏设计里最反直觉、也最承重的一条,是给残差专门建一个不被自动清扫的收容区,并定期人工复审:这些残差是测量误差,还是在集体指向一个 schema 外的新变量?这正是 RES 05 复现之墙"揭示盲区 → 造新 eval"在知识图谱上的具体落地,也是 RES 11 给护栏留的"换变量口子"的运行态。一个系统每清扫掉一批残差,就可能正在删掉它下一次范式转移的种子。
The residual is the doorway: deliberately keep "what the graph cannot explain." The entire difference between a healthy research system and a perfect cage lands on how it treats the residual — observations that no existing node can explain and that fail to enter the graph. The cage's instinct is to sweep residuals away as noise, because they spoil the graph's tidiness and lower the seemingly virtuous metric "share of claims landing in the traceable chain." But the history of science shows repeatedly: the doorway to a new map is almost always entered through the residual. The cholera death-clusters outside the miasma schema were a residual; Michelson–Morley's "no detectable ether wind" was a residual; black-body radiation's "ultraviolet catastrophe" under classical theory was a residual. Each thing that later triggered a paradigm shift began as something "the current frame could not explain, looking like error." So the most counter-intuitive and most load-bearing rule of guardrail design is to build a holding area for residuals that the auto-sweep does not clear, and review it manually on a schedule: are these residuals measurement error, or are they collectively pointing at a new out-of-schema variable? This is exactly how RES 05's wall "reveal a blind spot → build a new eval" lands on the knowledge graph, and the runtime form of RES 11's "change-the-variable door." Every batch of residuals a system sweeps away may be the seed of its next paradigm shift.
护栏与笼子是同一个东西在两种次序下的样子
The guardrail and the cage are one thing under two orderings
RES 04 说知识图谱是好护栏,这一张说它会变笼子——看起来矛盾,其实是同一个机制在两种使用次序下的两副面孔。当 schema 服务于生成、且对人保持可写时,它是护栏:让海量生成落进可追溯结构、显形冲突、可被复现。当 schema 反过来定义了什么算"研究"、且只读不可写时,它是笼子:生成只能在既有变量空间里组合,schema 外的观测被默默丢弃。决定它是护栏还是笼子的,不是图谱本身的精巧程度,恰恰相反——图谱越精巧、越自洽,它当笼子时就越严密。Farr 的霍乱图正是一个"完美的笼子":结构完整、数据可追溯、查询高效,唯独没有"病原体"这个节点,于是它把整个研究共同体钉在"空气质量"这一错误层级上,钉得无懈可击。
RES 04 says the knowledge graph is a good guardrail; this sheet says it becomes a cage — seemingly contradictory, but actually one mechanism wearing two faces under two orderings. When the schema serves generation and stays writable by humans, it is a guardrail: it lands mass generation in a traceable structure, surfaces conflicts, supports replication. When the schema instead defines what counts as "research" and is read-only, it is a cage: generation can only recombine within the existing variable space, and out-of-schema observations are silently discarded. What decides guardrail-or-cage is not the graph's sophistication — quite the opposite: the more sophisticated and self-consistent the graph, the tighter it is as a cage. Farr's cholera map is exactly a "perfect cage": structurally complete, data traceable, queries efficient, lacking only a "pathogen" node — and so it pinned the whole research community to the wrong "air quality" level, pinned it flawlessly.
留口子的具体做法(接 RES 05 的错误回流):给 schema 配一条"换变量通道"不是一句口号,它有可执行的形态——把那些无法被任何现有节点解释的观测专门收集起来(而不是当噪声丢弃),定期复审它们是否在暗示一个 schema 外的新变量。这与 RES 05 复现之墙的"揭示盲区 → 造新 eval"是同一条回流:撞墙的、入不了图的、被现有框架判为异常的残差,恰恰是换地图的入口。霍乱在 miasma schema 外的死亡聚集,就是那个被框架判为"无法解释"、却指向新变量的残差。一个健康的研究系统,必须给这类残差留一条不被自动清扫的通道,并定期问:"我们的图谱在系统性地看不见什么?"——把 RES 08 的覆盖广度反指标接进来当例行体检。
The concrete way to leave the door open (continuing RES 05's error feedback): giving the schema a "change-the-variable channel" is not a slogan; it has an executable form — specially collecting the observations that no existing node can explain (rather than discarding them as noise), and periodically reviewing whether they hint at a new out-of-schema variable. This is the same feedback as RES 05's wall "reveal a blind spot → build a new eval": the residuals that hit the wall, fail to enter the graph, are judged anomalous by the current frame, are precisely the doorway to a new map. The cholera death-clusters outside the miasma schema are exactly that residual — judged "unexplainable" by the frame yet pointing to a new variable. A healthy research system must keep a channel for such residuals that the auto-sweep does not clear, and periodically ask: "what is our graph systematically blind to?" — wiring RES 08's breadth counter-indicator in as a routine checkup.
FIG. 11.0 / 图谱盲区:引文/自动化流水线在结构上看不见什么THE GRAPH BLIND SPOT: WHAT CITATION / AUTOMATION PIPELINES STRUCTURALLY CANNOT SEE看懂:内圈是 schema 能表示的变量空间(生成在这里顺滑、可追溯、永不报错);圈外是 schema 无法表示的残差(被当噪声扫掉)。换地图的入口,几乎总在圈外。Read: the inner ring is the variable space the schema can represent (generation here is smooth, traceable, never errors); outside the ring are the residuals the schema cannot represent (swept away as noise). The doorway to a new map is almost always outside the ring.
护栏和笼子是同一张图谱:箱内是 schema 能表示的变量空间,AI 在这里加速只会把现有层级钻得更自洽、更难推翻;箱外是它结构上无法表示的残差。Farr 把每一例霍乱死亡都成功归因到"空气质量"的某个梯度,系统平滑、永不报错——正因如此没人怀疑"空气质量"这层本身错了。盲区不是图谱不够好,恰恰是它太好:越完整自洽,越把人固定在错的描述层级上。唯一的出路不在圈内的任何查询,而在那道虚线门——人发起的"换变量"。这也是 RES 04 护栏命题的承重反面。The guardrail and the cage are one graph: inside the box is the variable space the schema can represent, where AI accelerating only drills the existing level more self-consistent and harder to overturn; outside is the residual it structurally cannot represent. Farr successfully attributed every cholera death to some gradient of "air quality," the system smooth and never erroring — and precisely because of that, no one suspected the "air quality" level was itself wrong. The blind spot is not the graph being too poor; it is the graph being too good: the more complete and self-consistent, the harder it pins humans to the wrong level of description. The only way out lies in no in-ring query but in that dashed door — a human-initiated "change the variable." This is the load-bearing flip side of RES 04's guardrail thesis.
RES
12
FRONTIER · 元科学的模式生物
META-SCIENCE'S MODEL ORGANISM
前沿 · 把判断节点抬高一层
Frontier · raise the judgment node a level
当一阶研究近免费,人退守到设计科学本身
When first-order research is near-free, the human retreats to designing science itself
RES 06 把人退守到"何为值得知"(价值论)。但还有一条正交的退守线,本卷必须并置:退守到元科学——研究"什么规则让一个范式优于另一个、什么条件孕育范式转移"。这不是价值论,是方法论/制度论。当跑实验近乎免费,"该跑哪个实验、用什么算证据、什么研究设计能真正区分假设"反而升值。判断节点被抬高了一层。
RES 06 retreats the human to "what is worth knowing" (axiology). But there is an orthogonal retreat this volume must place alongside it: a retreat to meta-science — studying "what rules make one paradigm better than another, what conditions breed a paradigm shift." This is not axiology but methodology / the design of institutions. When running an experiment is near-free, "which experiment to run, what counts as evidence, what design actually discriminates between hypotheses" appreciates. The judgment node is raised a level.
正交退守,不是替代而是并置:RES 06 问"哪个真相值得"(价值),这里问"用什么机制才生得出颠覆性真相"(设计)。两者不矛盾——一个定方向,一个定怎样的科学制度能让方向被生出来。候选的形式判据:简单性(符号回归 AI Feynman 找全 100 条费曼方程 vs 旧软件 71 条;最小描述长度原理)、类比(好范式善作跨域有效类比)——但都不完备(J.J. Thomson 葡萄干布丁模型既简单又类比却全错)。诚实地说:我们还不知道"什么规则让范式更优",所以加速不会默认带来颠覆。这正是为什么元科学升值。
An orthogonal retreat — placed alongside, not replacing: RES 06 asks "which truth is worth it" (value); here we ask "by what mechanism do disruptive truths even get generated" (design). The two do not conflict — one sets direction, the other sets what scientific institution lets directions be born. Candidate formal criteria: simplicity (symbolic regression's AI Feynman recovered all 100 Feynman equations vs an older tool's 71; the minimum-description-length principle), analogy (good paradigms make effective cross-domain analogies) — but none is complete (J.J. Thomson's plum-pudding model was simple and analogical yet wholly wrong). Honestly: we do not yet know "what rule makes one paradigm better," so acceleration does not bring disruption by default. That is exactly why meta-science appreciates.
Djajadikerta,《Designing AI for Disruptive Science》, Asimov Press, 2026-03-23, DOI 10.62211/29ej-27et。提出"AI 科学家或许给元科学第一个模式生物":现实中无法对科研机构做对照实验,但可让 AI agent 种群在不同研究条件下并行运行、细测哪种条件出更多概念重组。历史锚(可另行回溯):Bell Labs、Xerox PARC、早期剑桥 LMB=受体制保护、能追"看上去无用"想法的小团队,与 AlphaZero 独立自博弈下出原创棋(21.Bg5)同构。〔本条为观点文论断,勿当数据;其转引实证各需回溯原始文献定级〕
Djajadikerta, "Designing AI for Disruptive Science," Asimov Press, 2026-03-23, DOI 10.62211/29ej-27et. It proposes that "the AI scientist may give meta-science its first model organism": one cannot run controlled experiments on research institutions in reality, but one can let populations of AI agents run in parallel under different research conditions and finely measure which conditions yield more conceptual recombination. Historical anchor (traceable separately): Bell Labs, Xerox PARC, the early Cambridge LMB — small teams, institutionally protected, free to chase "seemingly useless" ideas, isomorphic to AlphaZero's original move (21.Bg5) under independent self-play. [this is an essay's argument, not data; its cited empirics each need tracing to original sources for grading]
模式生物:第一次能对"科学制度"做对照实验
A model organism: for the first time, controlled experiments on "scientific institutions"
元科学长期是一门没有实验台的学问:你无法把同一个科学共同体复制十份、给每份换一套激励结构、再看哪份出更多颠覆——现实里只有一份,且跑得太慢、变量太多。AI 科学家可能第一次给元科学一个模式生物:让 AI agent 的种群在不同研究条件下并行运行——这群按产量考核、那群按新颖考核;这群层级森严、那群扁平自治;这群只准在范式内、那群被显式鼓励换框架——然后细测哪种条件下涌现更多概念重组。这是历史上第一次,"什么组织结构孕育颠覆"从一个只能靠案例(Bell Labs、Xerox PARC、剑桥 LMB 这些"小团队 + 体制保护"的轶事)回答的问题,变成一个可以做对照实验的问题。它和 AlphaZero 在独立自博弈下走出人类从未下过的原创棋(21.Bg5)是同构的:把系统从既有范式的训练数据里解放出来,让它在受保护的环境里自由探索。〔此为 Asimov 观点文的推演,Ⅴ 级;模式生物尚未被大规模实证,标为前沿命题〕[a projection from the Asimov essay, grade Ⅴ; the model organism is not yet broadly demonstrated, flagged as a frontier claim]
Meta-science has long been a discipline without a bench: you cannot copy one scientific community ten times, give each a different incentive structure, and see which breeds more disruption — in reality there is only one, running too slowly with too many variables. The AI scientist may give meta-science its first model organism: run populations of AI agents in parallel under different research conditions — this group appraised on output, that group on novelty; this group steeply hierarchical, that group flat and autonomous; this group confined to the paradigm, that group explicitly encouraged to switch frames — then finely measure which conditions yield more conceptual recombination. For the first time in history, "which org structure breeds disruption" turns from a question answerable only by cases (the "small teams + institutional protection" anecdotes of Bell Labs, Xerox PARC, the Cambridge LMB) into one open to controlled experiment. It is isomorphic to AlphaZero playing an original move no human had played (21.Bg5) under independent self-play: free the system from the training data of the existing paradigm and let it explore freely in a protected environment.
两条退守线并置,不是二选一。这一卷有意把人退守的方向画成两条而非一条,因为只画"退守到价值(哪个真相值得)"会漏掉一半。第一条是 RES 06 的价值论退守:当提问被充裕,人退守到"定哪个真相重要"。第二条是这一张的方法论退守:当跑实验近乎免费,人退守到"设计科学本身——该跑哪个实验、用什么算证据、什么研究设计能真正区分假设"。两条退守不矛盾,是正交的:一条定方向(往哪走值得),一条定制度(怎样的科学机器能让值得的方向被生出来)。一个只退守价值、不退守方法的研究者,会知道该追什么真相,却没有能生出颠覆的科学制度;一个只退守方法、不退守价值的研究者,会有精良的方法机器,却不知道该把它指向何处。两条都退守,才补全了内核④"人回归意义"在研究面的全貌:意义既是方向的意义,也是制度的意义。
Two retreat lines placed side by side, not either-or. This volume deliberately draws the human's retreat as two lines, not one, because drawing only "retreat to value (which truth is worth)" misses half. The first is RES 06's axiological retreat: as questioning is abundified, the human retreats to "setting which truth matters." The second is this sheet's methodological retreat: as running experiments goes near-free, the human retreats to "designing science itself — which experiment to run, what counts as evidence, what design actually discriminates between hypotheses." The two retreats do not conflict; they are orthogonal: one sets direction (where it is worth going), one sets institution (what scientific machine lets worthy directions be born). A researcher who retreats only to value and not to method knows which truth to chase but has no institution that breeds disruption; one who retreats only to method and not to value has a fine method-machine but does not know where to point it. Retreating on both completes the full picture of kernel ④'s "humans return to meaning" on the research face: meaning is both the meaning of direction and the meaning of institution.
"模式生物"第一次让"什么组织结构/激励/层级出颠覆"可被实验——这是研究卷向组织卷交棒的一手素材(接 RES 13)。互联网先例的警示:它让知识可搜索,却因职业激励等更深的结构性低效,没在规模上带来更快科学,反而(在线期刊)收窄了引用。AI 可能在更大尺度重演此模式,除非把"为颠覆设计科学"当作刻意的研究计划。The "model organism" lets "which org structure / incentives / hierarchy breed disruption" be experimented on for the first time — first-hand material for the research-to-org hand-off (see RES 13). The internet's cautionary precedent: it made knowledge searchable yet, owing to deeper structural inefficiencies like career incentives, did not bring faster science at scale, and (online journals) narrowed citation instead. AI may re-run this at larger scale unless "designing science for disruption" is taken up as a deliberate research program.
我们还不知道"什么规则让范式更优"——所以加速不自动等于进步
We do not yet know "what rule makes a paradigm better" — so acceleration does not equal progress by default
Meta-science appreciates for a reason rooted in an uncomfortable admission: we still have no complete formal criterion for "what rule makes one paradigm better than another." The historical candidates are all incomplete. "Simplicity" is one — symbolic regression's AI Feynman recovered all 100 Feynman equations (an older tool got only 71), and the minimum-description-length principle partly formalizes "simpler = more likely right"; but simplicity is no guarantee of truth — J.J. Thomson's plum-pudding atom was simple and elegant and wholly wrong. "Analogy" is another — good paradigms often make effective analogies across unrelated domains (Einstein borrowing the image of light, Darwin borrowing Lyell's geology and Malthus's economics); but analogy just as easily leads to a false isomorphism. Since we lack even a criterion for "what makes a paradigm better," then "accelerating execution ten-thousand-fold" will not automatically yield better paradigms — it will only run faster within the current one. This is the deepest grounding of the thesis "acceleration ≠ progress," and the reason meta-science (studying "what scientific institution breeds better paradigms") shifts from luxury to necessity.
自主性阶梯:方向选择是最后、也最难的一阶
The autonomy ladder: agenda selection is the last and hardest rung
把"元科学升值"画得更精确,要借 Anthropic 2026 的自主性阶梯。它把研究 agent 的能力从左到右排成一道梯子:执行良定义的实验(最左,已可匹敌人类)→ 设计实验 → 综合发现 → 选择研究议程(最右,最难)。这道梯子和 RES 02 的可验证性梯度同形——越往右,可机检的对错代理越稀薄,剩下的越是"无最近邻可循"的构成性判断。下面这张图把它和"生成充裕 vs 判断稀缺"叠在一起看:横轴是自主性从执行到方向,竖轴是该能力当前的"充裕度",曲线显示——生成侧(执行、设计实验)已逼近充裕饱和,判断侧(综合、选向)仍陡峭地稀缺。这正是命题的形状:能力沿梯子上移,但最右那阶——方向选择——的稀缺不随模型变强而消失。
To draw "meta-science appreciates" more precisely, borrow Anthropic's 2026 ladder of autonomy. It arrays a research agent's capability left to right: execute a well-defined experiment (leftmost, already matches humans) → design experiments → synthesize findings → select the research agenda (rightmost, hardest). This ladder is isomorphic to RES 02's verifiability gradient — the further right, the thinner the machine-checkable proxy for right, the more what remains is constitutive judgment with "no neighbor to follow." The figure below overlays it with "generation-abundant vs judgment-scarce": the x-axis is autonomy from execution to direction, the y-axis is that capability's current "abundance," and the curve shows — the generation side (execution, experiment design) nears abundance saturation while the judgment side (synthesis, direction) stays steeply scarce. That is the thesis shape: capability climbs the ladder, but the rightmost rung — agenda selection — does not lose its scarcity as models get stronger.
FIG. 12.0 / 自主性阶梯 × 充裕度:生成侧饱和,判断侧仍稀缺AUTONOMY LADDER × ABUNDANCE: GENERATION SATURATES, JUDGMENT STAYS SCARCE看懂:横轴从"执行"到"选方向",曲线是当前充裕度。左段已逼近天花板(AI 匹敌人类),右段陡降——方向选择是最后一阶。Read: x-axis from "execute" to "select direction," the curve is current abundance. The left climbs near the ceiling (AI matches humans), the right drops steeply — agenda selection is the last rung.
这条曲线把整卷的命题压成一张图:AI 的能力沿自主性阶梯上移(左段已饱和),但"充裕度"在右段陡降——方向选择是最后、也最难自动化的一阶,因为它要的是 taste(判断哪些问题重要、哪些异常值得追、哪些诱人想法是死路),而 taste 的稀缺是结构性的、不随算力消失。元科学之所以升值,正因为它研究的恰是"怎样的科学制度能把右段那条曲线抬起来"。〔Anthropic RSI 阶梯为公司自述,Ⅳ–Ⅴ;曲线为示意,非测量数据〕This curve compresses the whole volume's thesis into one figure: AI's capability climbs the autonomy ladder (the left has saturated), but "abundance" drops steeply on the right — agenda selection is the last and hardest rung to automate, because what it needs is taste (judging which problems matter, which anomalies are worth chasing, which seductive ideas are dead ends), and taste's scarcity is structural, not dissolved by compute. Meta-science appreciates precisely because what it studies is "what scientific institution can lift that right-side curve." [the Anthropic RSI ladder is a company's own account, Ⅳ–Ⅴ; the curve is schematic, not measured data]
RES
13
CRITIQUE · 旧学术机器
THE OLD ACADEMIC MACHINE
结构批判 · 点名的失效件
Structural critique · named failing parts
旧学术机器的每个承重件,都是为"产出稀缺"调校的——而现在产出过剩
Every load-bearing part of the old academic machine was tuned for scarce output — and output is now in surplus
现代学术建制的五个核心装置——"不发表就出局"、用 h 指数与影响因子当代理、串行同行评审、经费周期的保守偏好、PI/课题组的金字塔——都不是天经地义,而是一套在"做研究很贵、发表很慢、产出天然稀缺"的世界里调出来的稳态。它们当年解决了真问题。但每一个都把"稀缺"焊进了自己的假设里:当执行近免费、生成无上限,这些装置不是失灵那么简单,而是被反向利用——同一个机制,原本拦低质、现在被高速生成器拿来批量制造指标表现。这一节逐个点名,给出失效机理(不是抱怨情绪),并标出它在 AI 充裕下具体怎么从"过滤器"退化成"放大器"。
The five core devices of the modern academic establishment — "publish or perish," using the h-index and the impact factor as proxies, serial peer review, the grant cycle's conservative bias, and the PI/lab pyramid — are not laws of nature but a steady state tuned for a world where research was expensive, publishing was slow, and output was naturally scarce. Each solved a real problem in its day. But each also welded "scarcity" into its own assumptions: when execution is near-free and generation is unbounded, these devices do not merely stop working — they get turned against their purpose, the very mechanism that once filtered low quality now harnessed by a fast generator to mass-game the metric. This section names each one, gives the failure mechanism (not a complaint), and marks exactly how it degrades from a filter into an amplifier under AI abundance.
为什么这五个装置会一起坏——它们共用一个被打穿的前提
Why all five break together — they share one assumption that has been punctured
把这五件单独看,像五个无关的毛病;放到一起看,它们其实共用同一个前提,而那个前提刚刚被 AI 打穿。这个前提是:"产出量"是诚实信号——写出一篇论文、攒够一批引用、跑完一个项目,都贵到足以证明背后有真实的智力投入,所以拿"量"当"质"的代理,误差可接受。整套建制的代理链都建在这块地基上:发表数代理生产力,引用数代理影响力,h 指数把两者打包代理"学者价值",影响因子代理"期刊质量",经费规模代理"研究重要性"。每一环都是"用一个便宜可数的量,代理一个昂贵难判的质"。当执行很贵,这条代理链误差有限——因为刷不动:你没法低成本地伪造一百篇看起来像样的论文。AI 恰恰把这件事变便宜了。代理链最怕的不是作弊者,是作弊的边际成本趋零:一旦"看起来像样的产出"可以近免费批量生成,所有以"量"为代理的指标同时失去鉴别力——这就是 Goodhart 定律的极端形态(当一个度量成为目标,它就不再是好度量),只不过 AI 把"成为目标后失效"的速度从数年压缩到数周。所以下面五件不是五个孤立故障,是同一块地基塌了之后,盖在上面的五个房间一起裂。
Seen one by one, the five look like five unrelated ailments; seen together, they share one assumption — and that assumption has just been punctured by AI. The assumption is: "volume of output" is an honest signal — writing a paper, accruing a batch of citations, finishing a project were each expensive enough to prove real intellectual investment behind them, so using "quantity" as a proxy for "quality" carried tolerable error. The whole establishment's proxy chain is built on this bedrock: publication count proxies productivity, citation count proxies influence, the h-index bundles both to proxy "a scholar's worth," the impact factor proxies "a journal's quality," grant size proxies "a study's importance." Every link is "a cheap countable quantity standing in for an expensive hard-to-judge quality." When execution was expensive, this proxy chain's error was bounded — because you could not game it: you could not cheaply fake a hundred plausible-looking papers. AI is precisely what made that cheap. A proxy chain's worst enemy is not the cheater but the marginal cost of cheating going to zero: once "plausible-looking output" can be mass-generated near-free, every metric that proxies via "quantity" loses its discriminating power at once — this is Goodhart's law in its extreme form (when a measure becomes a target it ceases to be a good measure), except AI compresses the "fails-once-targeted" timescale from years to weeks. So the five below are not five isolated faults but five rooms cracking together after the one foundation beneath them gave way.
FIG. 13.0 / 代理链:当伪造产出的边际成本趋零,每个代理同时反转THE PROXY CHAIN: WHEN THE MARGINAL COST OF FAKING OUTPUT GOES TO ZERO, EVERY PROXY INVERTS看懂:每一行是一个"便宜可数量 → 昂贵难判质"的代理;左列是装置,中列是它代理的东西,右列是 AI 把伪造成本压到零后它反转成的样子。整条链共用最底下那条"量=诚实信号"的地基。Read: each row is one "cheap countable quantity → expensive hard-to-judge quality" proxy; the left column is the device, the middle what it proxies, the right what it inverts into once AI drives the cost of faking to zero. The whole chain rests on the bottom "quantity = honest signal" bedrock.
这张图的论点不是"指标不好",而是更精确的一句:每个指标的失效时刻,都是它"被瞄准的边际成本"跌破某个阈值的时刻。执行昂贵时,这些代理是有效过滤器;执行近免费时,同一个代理变成高速生成器的刷分通道。注意右列全是橙色——它们不是新毛病,是旧装置在新成本结构下的镜像反转。下面五小节逐件展开机理。This figure's claim is not "metrics are bad" but something sharper: each metric's moment of failure is the moment its "marginal cost of being targeted" drops below some threshold. When execution was expensive, these proxies were effective filters; when execution is near-free, the same proxy becomes a fast generator's scoring channel. Note the right column is all orange — these are not new ailments but the mirror-inversion of old devices under a new cost structure. The five subsections below unfold each mechanism.
① "不发表就出局" + h 指数:把产量当价值,正中高速生成器的下怀
① "Publish or perish" + the h-index: counting output as worth — exactly what a fast generator wants
装置原意:"publish or perish" 与 h 指数〔Hirsch 2005,PNAS,R21,证据级 Ⅱ〕都想解决同一个治理难题——评委不可能读完每个候选人的全部工作,于是用"可数的产出"当价值代理:发得多、被引得多,大概率是个高产且有影响力的学者。在执行昂贵的世界里,这个代理误差有限,因为产量本身就是努力的证据。失效机理:h 指数把"高被引论文的数量"压成一个数,而这个数有两条都能刷的路径——多发(分母)、多被引(分子)。Goodhart 定律〔Strathern 1997 对 Goodhart 的转述,R22,证据级 Ⅳ〕说得很干:一旦这个数成为目标,学者就会优化这个数而非它本想代理的东西。切香肠式发表(salami-slicing,把一个研究拆成多篇最小可发表单元)、引用环(citation ring,互引刷分)、自引,都是对 h 指数的合理优化。AI 怎么把它推到极端:过去刷 h 指数受限于"写论文很贵";现在 LLM 能近免费地批量产出语法正确、格式齐全、看起来像样的稿件。当"看起来像研究的东西"可以批量生成,任何以产量为代理的指标都瞬间失去鉴别力——它本是用来挡"没干活的人",现在反而最便利"用机器刷量的人"。这就是 RES 03 那条"AI 用与既有分布的距离当唯一代理"在激励层的镜像:指标越是奖励"可数的像样产出",高速生成器越是它的最优解。
What the device meant: "publish or perish" and the h-index 〔Hirsch 2005, PNAS, R21, grade Ⅱ〕 both try to solve one governance problem — a committee cannot read every candidate's complete work, so it uses "countable output" as a worth-proxy: prolific and highly cited probably means a productive, influential scholar. In an expensive-execution world this proxy's error was bounded, because volume was itself evidence of effort. The failure mechanism: the h-index compresses "the count of highly cited papers" into one number, and that number has two gameable paths — publish more (denominator), get cited more (numerator). Goodhart's law 〔Strathern 1997's restatement of Goodhart, R22, grade Ⅳ〕 puts it plainly: once the number becomes a target, scholars optimize the number, not what it was meant to proxy. Salami-slicing (splitting one study into several least-publishable-units), citation rings (reciprocal citation), and self-citation are all rational optimizations of the h-index. How AI pushes it to the extreme: gaming the h-index used to be limited by "papers are expensive to write"; now an LLM can mass-produce, near-free, grammatically correct, fully formatted, plausible-looking manuscripts. When "things that look like research" can be batch-generated, any metric that proxies via volume instantly loses discrimination — it was built to keep out "people who did no work," yet now it most conveniences "people gaming volume with a machine." This is the incentive-layer mirror of RES 03's "AI uses distance-from-the-existing-distribution as the only proxy": the more a metric rewards "countable plausible output," the more a fast generator is its optimal solution.
② 影响因子:用期刊均值代理单篇质量,把"追热点"焊进激励
② The impact factor: proxying a single paper's quality by a journal mean, welding "hype-chasing" into incentives
装置原意:期刊影响因子(Garfield 1955 提出,后成 JCR 商业指标,R23,证据级 Ⅳ)本是给图书馆选订阅期刊用的——一个期刊近两年文章的平均被引数。它从未被设计来评单篇论文或单个学者,Garfield 本人多次警告过这种误用。失效机理:把"期刊均值"当"单篇质量"是统计学上的范畴错误:期刊被引分布极度长尾(少数文章贡献绝大多数引用),用均值代理任一篇的质量,误差大到没有意义。但因为影响因子可数、可排序、跨学科可比,它被招聘、评职、经费评审广泛采用,于是学者的理性反应是"往高影响因子期刊投",而高影响因子期刊系统性偏好新颖、热点、阳性结果——可靠但不性感的工作(复现、阴性结果、方法学订正)被结构性地挤出。这正是 RES 08 hypernormalization 的激励侧来源:不是有人想做窄,是指标在奖励"热"而非"对"。AI 怎么放大:当生成近免费,"追当前热点、批量产出符合高影响因子期刊口味的稿件"成为可自动化的策略。AI 最擅长的恰是"拟合已有分布"(RES 06),而影响因子奖励的正是"贴近当前热点分布"——两者一拍即合,把科学进一步推向"全都在追同一批热问题"的同质化深渊。可靠性与新颖性本是两件事,影响因子把它们混成一个"高被引=好"的单一信号,而 AI 让追逐这个信号的成本趋零。
What the device meant: the journal impact factor (proposed by Garfield in 1955, later a JCR commercial metric, R23, grade Ⅳ) was built for librarians choosing subscriptions — the mean citations of a journal's articles over the prior two years. It was never designed to judge a single paper or a single scholar, and Garfield himself repeatedly warned against this misuse. The failure mechanism: treating "a journal mean" as "a single paper's quality" is a statistical category error: journal citation distributions are extremely long-tailed (a few articles supply most citations), so a mean is a meaningless proxy for any one paper's quality. But because the impact factor is countable, sortable, and cross-disciplinary comparable, it was widely adopted in hiring, tenure, and grant review, so a scholar's rational response is "submit to high-IF journals," and high-IF journals systematically prefer novelty, hype, and positive results — reliable-but-unsexy work (replications, null results, methodological corrections) is structurally squeezed out. This is the incentive-side origin of RES 08's hypernormalization: no one wants to go narrow; the metric rewards "hot," not "right." How AI amplifies it: when generation is near-free, "chase the current hot topic, mass-produce manuscripts to high-IF taste" becomes an automatable strategy. What AI does best is precisely "fit the existing distribution" (RES 06), and what the impact factor rewards is precisely "hug the current hot distribution" — the two click together, pushing science deeper into the homogenized pit where "everyone chases the same hot questions." Reliability and novelty are two different things; the impact factor blends them into a single "highly-cited = good" signal, and AI drives the cost of chasing that signal toward zero.
③ 同行评审:一道为"稿件稀缺"设计的串行闸,正被无限投稿淹没
③ Peer review: a serial gate designed for scarce manuscripts, now drowning in unbounded submissions
装置原意:同行评审是科学的质量承重墙——让领域同行在发表前把关,挡掉错误、夸大、不可靠的工作。它的整个吞吐量假设是"稿件以人类写作速度到达",评审以人类阅读速度处理,二者大致匹配。中位审稿周期常以月计〔Björk & Solomon 对审稿时长的实证,R24,证据级 Ⅱ〕,但在稿件稀缺时这个延迟可接受。失效机理:同行评审是一道串行闸:每篇稿件要占用 2–4 位领域专家各数小时,而合格评审人的总带宽是固定且稀缺的(还无偿)。这条闸的吞吐量天花板由人类专家数量决定,不由投稿量决定。当投稿量暴涨,闸不会变快,只会变出更长的队和更草率的评审。AI 怎么压垮它:生成端可以近免费地把投稿量乘以十倍、百倍——而评审端的人类专家带宽一点没变。这是 RES 05"剪刀差"在评审环节的具体爆发:生成无上限,审读带宽近恒定,缺口只会张大。更糟的是,AI 还能批量生成"看起来该认真审"的稿件,逼真到评审人必须投入真实时间才能判断真伪——于是稀缺的评审带宽被"鉴别 AI 垃圾"大量消耗。把同行评审当成能拦住一切的万能闸,在投稿无限时是物理上不可能的;唯一的出路是把"判可信"从"逐篇串行精读"改成 RES 09 那种"按证据强度×范式距离分诊、人只投到吃紧两格"的并行分流——但这要求重构激励,而不是让评审人加班。
What the device meant: peer review is science's quality load-bearing wall — domain peers gatekeep before publication, blocking errors, exaggeration, and unreliable work. Its whole throughput assumption is "manuscripts arrive at human writing speed," processed at human reading speed, the two roughly matched. Median review cycles are often counted in months 〔Björk & Solomon's empirics on review times, R24, grade Ⅱ〕, but when manuscripts were scarce that delay was tolerable. The failure mechanism: peer review is a serial gate: each manuscript consumes 2–4 domain experts for hours each, and the total bandwidth of qualified (and unpaid) reviewers is fixed and scarce. This gate's throughput ceiling is set by the number of human experts, not by submission volume. When submissions surge, the gate does not speed up; it produces only longer queues and sloppier reviews. How AI crushes it: the generation side can near-free multiply submissions tenfold, a hundredfold — while the review side's human-expert bandwidth has not budged. This is RES 05's "scissors gap" erupting at the review stage: generation is unbounded, reading bandwidth near-constant, the gap only widens. Worse, AI can mass-produce manuscripts that "look worth reviewing seriously," lifelike enough that a reviewer must spend real time to tell real from fake — so scarce review bandwidth is consumed identifying AI slop. Treating peer review as a universal gate that catches everything is physically impossible when submissions are unbounded; the only way out is to shift "judging credibility" from "serial close-reading of each one" to RES 09's parallel triage ("sort by evidence strength × paradigm distance, spend the human only on the two tight cells") — but that requires reworking incentives, not making reviewers work overtime.
④ 经费周期:为"昂贵实验"调的保守偏好,正好杀死现在最廉价的探索
④ The grant cycle: a conservatism tuned for expensive experiments now kills the cheapest exploration
装置原意:经费评审的保守偏好不是恶意,而是理性的风险管理:当一个实验要花数百万、数年,评委有责任把钱投给"大概率成"的项目——要求充分的前期数据、清晰的可行性、与既有文献的连续性。这套机制在"执行昂贵"时是负责任的:你不能拿纳税人的钱去赌一个十有八九失败的疯点子。失效机理:但这套机制把"与既有范式的连续性"焊成了硬门槛——它系统性地偏好范式内的渐进工作,而把范式级的重构(按定义离既有文献远、前期数据必然薄,见 RES 09 那个"证据弱×范式远"格)挡在门外。这正是 RES 06/07 反复说的:真正的新颖在诞生时一定"看起来不靠谱",而经费机制把"看起来不靠谱"直接判死。March 的探索/利用框架〔R15,证据级 Ⅱ〕在制度层就是这条:利用(渐进、可预测)总在争资源时赢过探索(冒险、可能颗粒无收)。AI 怎么改变了账:充裕化恰恰把"探索"的成本压下来了——很多过去要数月数十万才能试的想法,现在能近免费地先跑一轮验证。这意味着经费机制的核心假设("探索很贵,所以要保守")正在失效:当探索变廉价,理性的探索配比应该上移(RES 07),保守偏好反而成了把最廉价的新颖机会拒之门外的结构性浪费。换句话说,经费周期是为"昂贵实验时代"调的旋钮,而 AI 把实验变便宜了,旋钮却没跟着拧——结果是制度在该放手探索时仍在收紧。贝尔实验室、PARC、剑桥 LMB〔R19〕之所以高产,正因为它们用制度性保护对冲了这种保守偏好。
What the device meant: the grant cycle's conservatism is not malice but rational risk management: when an experiment costs millions and years, reviewers have a duty to fund "likely to succeed" projects — demanding ample preliminary data, clear feasibility, continuity with the existing literature. This was responsible when execution was expensive: you cannot bet taxpayers' money on a wild idea that will likely fail. The failure mechanism: but this mechanism welds "continuity with the existing paradigm" into a hard threshold — it systematically prefers in-paradigm incremental work and bars paradigm-level reframings (by definition far from the existing literature, necessarily thin on preliminary data — RES 09's "weak × far" cell). This is exactly what RES 06/07 keep saying: genuine novelty necessarily "looks unreliable" at birth, and the grant mechanism sentences "looks unreliable" to death. March's explore/exploit frame 〔R15, grade Ⅱ〕 at the institutional layer is just this: exploitation (incremental, predictable) always beats exploration (risky, possibly yielding nothing) when they compete for resources. How AI changes the arithmetic: abundance is precisely what drives the cost of "exploration" down — many ideas that once took months and a hundred thousand dollars to try can now be near-free to validate in a first pass. This means the grant mechanism's core assumption ("exploration is expensive, so be conservative") is failing: when exploration becomes cheap, the rational explore-share should move up (RES 07), and the conservative bias becomes structural waste that bars the cheapest novelty opportunities. Put differently, the grant cycle is a dial tuned for "the era of expensive experiments," AI made experiments cheap, but the dial was not turned with it — so the institution keeps tightening exactly when it should let exploration loose. Bell Labs, PARC, and the Cambridge LMB 〔R19〕 were prolific precisely because they used institutional protection to hedge this conservative bias.
⑤ PI/课题组金字塔 + "复现吃力不讨好":把承重验证器留在没人愿意干的位置
⑤ The PI/lab pyramid + "replication is thankless": leaving the load-bearing verifier where no one will do it
装置原意:PI(首席研究员)/课题组的金字塔结构,是为"训练 + 分工"设计的:资深 PI 定方向、拉经费、担署名责任,博士生博后做执行。这在执行昂贵时高效——执行是稀缺资源,把它集中在受训的年轻人手里、由经验把关方向,是合理分工。同时,这套激励把"原创新发现"放在金字塔顶端的奖励位,把"复现别人的工作"放在没有奖励的位置:复现拿不到经费、发不了高影响因子、不算原创贡献——它吃力不讨好(thankless),于是几乎没人做。失效机理:这制造了一个致命错配:RES 00/13 反复论证,独立复现是把"研究环"和"高速生成器"分开的唯一承重验证器(Open Science Collaboration 2015:97 项显著结果仅 36% 复现,R1;Baker 2016:逾 70% 科学家复现他人失败,R2)。也就是说,整个科学最关键的质量动作,恰好被激励结构放在了没人愿意干的位置。在执行昂贵时这个错配尚可忍受(反正复现也贵);AI 怎么把它从"可忍"变成"致命":当生成端近免费地把待验证主张乘以百倍,而验证端(复现)仍困在"吃力不讨好、没人做"的激励洼地,缺口直接爆炸——这正是 RES 05 剪刀差最尖锐的形态。讽刺的是,AI 本可以承担复现执行(跑代码、重算、交叉核对数据)的大部分,把复现从"吃力"里解放出来——但只要激励结构仍把复现放在没有奖励的位置,执行变便宜也没用:没人有动机去按那个按钮。这就是 RES 07 那条"省下的产能不会自动变成 slack"在复现上的具体形态:技术上能复现,不等于制度上有人去复现。修复点不在技术,在激励——必须把"担保可信"(复现、验证、整合)从金字塔底端的无奖励位,提到与"原创发现"同等的奖励位。这正是 RES 00 那句"科学社区的价值从产生知识转向担保可信"的制度含义。
What the device meant: the PI (principal investigator)/lab pyramid is built for "training + division of labor": a senior PI sets direction, raises funding, bears authorship responsibility, while PhD students and postdocs do execution. This was efficient when execution was expensive — execution was the scarce resource, so concentrating it in trainees with experience gatekeeping direction was sensible. At the same time, this incentive places "original new discovery" at the apex reward position and "replicating others' work" in a position with no reward: replication wins no grants, no high impact factor, counts as no original contribution — it is thankless, so almost no one does it. The failure mechanism: this creates a fatal mismatch: RES 00/13 argue repeatedly that independent replication is the one load-bearing verifier that separates the research loop from a fast generator (Open Science Collaboration 2015: only 36% of 97 significant results replicated, R1; Baker 2016: over 70% of scientists failed to reproduce others', R2). That is, science's single most critical quality act is placed by the incentive structure exactly where no one will do it. When execution was expensive this mismatch was bearable (replication was expensive anyway); how AI turns it from "bearable" to "fatal": when the generation side near-free multiplies claims-to-be-verified a hundredfold while the verification side (replication) stays trapped in the thankless, no-one-does-it incentive sink, the gap simply explodes — RES 05's scissors gap in its sharpest form. The irony is that AI could shoulder most of replication's execution (run the code, recompute, cross-check data), freeing replication from "thankless" — but as long as the incentive structure keeps replication in a no-reward position, cheaper execution does not help: no one is motivated to press the button. This is RES 07's "freed capacity does not automatically become slack," in its concrete replication form: technically able to replicate is not the same as institutionally someone-replicates. The fix is not technical but incentive-side — "vouching for credibility" (replication, verification, integration) must be raised from the no-reward base of the pyramid to a reward position equal to "original discovery." This is exactly the institutional meaning of RES 00's "the value of the scientific community shifts from producing knowledge to vouching for credibility."
结构批判判语The structural verdict
五个装置不是"过时",是被反向利用:每一个原本拦低质的过滤器,在伪造成本趋零后,都成了高速生成器的放大器。修复不在加更多指标(那只是给生成器更多刷分维度),而在换地基——把代理从"可数的产出量"换成"可担保的可信度":谁读过、谁复现过、谁为这条主张承重。这就是为什么本卷的承重动作是"担保可信"而非"产生知识"。The five devices are not "outdated" but turned against their purpose: each filter that once blocked low quality became, once faking went near-free, an amplifier for a fast generator. The fix is not more metrics (that just gives the generator more dimensions to game) but a new bedrock — swapping the proxy from "countable output volume" to "vouchable credibility": who read it, who replicated it, who bears weight for this claim. This is why this volume's load-bearing act is "vouching for credibility," not "producing knowledge."
RES
14
CASES · 四个走完一遍的真实情形
FOUR CASES WALKED THROUGH
工件 · 把内核压到具体情形上
Artifact · the kernel pressed onto specifics
把这卷的判据,按在四个具体到能照做的研究情形上
The volume's tests, pressed onto four cases concrete enough to copy
The mechanisms and matrices above need grounding. This section walks four concrete cases: a question-triage (one branch stays in-paradigm and goes to AI, one is a reframe and stays human); a believability-ledger applied to one real claim (booking evidence strength and paradigm distance separately); a knowledge-graph-guardrail failure that locked in a level of description; and an "AI accelerated the field but narrowed it" homogenization. Each gives before/after, where the test lands, and what judgment stayed on the human side.
案例一 · 提问分诊:同一个材料发现项目,两支问题走向相反的两侧
Case 1 · Question-triage: in one materials-discovery project, two questions go to opposite sides
情形:一个固态电解质材料组,手里有一个待解的问题包。AI 已能近免费地跑高通量筛选与性质预测(GNoME 类工作,R16),于是真正的瓶颈不在算,而在"哪个问题值得问"。组里把问题包摊开,逐条过 RES 10 的判据——能写出可机检验收标准的归左(交 AI),只能诉诸"换什么框架"的归右(留人)。结果两支问题走向了相反的两侧。
范式内支 → 交 AIIn-paradigm branch → to AI
问题:"在已知的石榴石结构(garnet)框架内,哪种元素替换能把锂离子电导率再提一档?"
Question: "Within the known garnet structure frame, which element substitution lifts Li-ion conductivity another notch?"
为什么归左:验收标准可机检——电导率有明确测量口径,候选空间是"已知结构内的元素替换",AI 可批量生成候选+DFT 初筛。这是 R8/R16 反复证实 AI 擅长的"在已知框架内"动作。人只需定阈值、抽验复现。
Why left: the acceptance criterion is machine-checkable — conductivity has a defined measurement, the candidate space is "element substitution within a known structure," and AI can mass-generate candidates plus DFT pre-screening. This is the "within a known frame" act that R8/R16 repeatedly confirm AI excels at. The human only sets thresholds and spot-checks replication.
Question: "Are we asking the wrong variable? Maybe we should not search within the 'crystalline solid' frame at all, but ask whether ion transport in the 'amorphous/glassy' state is a different mechanism?"
Why right: this branch is not nearest-neighbor search within a known frame but changing the variable, changing the level of description — it questions the problem frame itself (RES 11). No machine-checkable acceptance criterion can be written: you cannot pre-define what "reframed correctly" looks like. AI here can only fit the existing distribution (it will drag you back to the crystalline state, since that is where the literature lives), so the direction and falsification conditions must be written by a human first. This branch later became the group's real breakthrough — but only because it was first rescued from the in-paradigm verdict of "low conductivity, not worth doing."
The judgment that stayed human: not "which answer is right" but "which branch is worth chasing with scarce human bandwidth." The triage's value is precisely that it did not delete the second branch as noise (a single credibility score would) but recognized it as the "weak × far" cell — suspend and seek targeted evidence, not reject. Had both branches gone to AI, AI would have efficiently churned on the first while the second (the real paradigm-level opportunity) would have been auto-downweighted for being "far from the literature distribution." This is the fork that FIG 14.0's decision tree visualizes.
FIG. 14.0 / 提问分诊决策树:一个问题如何被分到"交 AI"或"留人"THE QUESTION-TRIAGE DECISION TREE: HOW A QUESTION IS ROUTED TO AI OR TO A HUMAN看懂:从顶上一个问题进,过三道判:能写可机检验收标准吗?→在已知框架内吗?→判错代价可逆吗?三个"是"才落到左侧"交 AI";任一道"否"就分到右侧"留人写方向"。案例一的两支正好走了这棵树的两条路。Read: a question enters at the top and passes three tests: can you write a machine-checkable acceptance criterion? → is it within a known frame? → is a wrong call reversible? Three yeses route it left to "to AI"; any no routes it right to "human writes direction." Case 1's two branches take the two paths of this tree.
决策树把 RES 10 的双层判据展成可照走的三道闸:可机检 → 已知框架 → 代价可逆。三个"是"才交 AI;任一"否"就留人。它和 INSTRUMENT 12(下面那台可拨的分诊器)是同一逻辑的静态版与交互版。关键在于:这棵树不是用来"自动判"的,右侧的每一支都明确写着"AI 当协作者不当裁判"——树本身只负责把问题路由到正确的判断者那里。The decision tree unfolds RES 10's two-layer test into three walkable gates: machine-checkable → known frame → reversible cost. Three yeses to AI; any no stays human. It and INSTRUMENT 12 (the adjustable triage decider below) are the static and interactive versions of the same logic. The key: this tree is not for "auto-judging" — every right-side branch explicitly says "AI as collaborator, not judge"; the tree only routes the question to the correct judge.
The tree, made adjustable: the decider below lets you answer "yes/no" to each gate for a concrete question, and gives a live routing verdict — to AI, or to a human, and why. Try dialing in Case 1's two branches and watch them reach opposite ends.
INSTRUMENT 12 · 提问分诊器 QUESTION-TRIAGE DECIDER
逐道答"是/否"。三道都"是"才把问题交给 AI 批量执行;任何一道"否",问题就留给人——先写方向与证伪条件,AI 只当协作者不当裁判。判据来自 RES 10 的双层分诊。Answer "yes/no" to each gate. Three yeses route the question to AI for mass-execution; any one no keeps it human — write direction and falsification first, with AI as collaborator, never judge. The tests come from RES 10's two-layer triage.
G1 · 能写出可机检的验收标准吗?(有明确测量口径、有标准答案)G1 · Can you write a machine-checkable acceptance criterion? (a defined measurement, a right answer)
G2 · 在已知框架内吗?(不是要换变量 / 换描述层级 / 换问题框架)G2 · Is it within a known frame? (not changing the variable / level / problem frame)
G3 · 判错的代价可逆吗?(不可逆 / 价值负载的高代价错判要留人)G3 · Is a wrong call reversible? (irreversible / value-laden high-cost errors stay human)
案例二 · 可信度天平:一条"AI 设计的新抗生素"主张,两条轴必须分开记账
Case 2 · The believability ledger: an "AI-designed new antibiotic" claim, two axes booked separately
情形:一个团队收到一条 AI 生成的主张——"模型在已知抗生素骨架外,设计出一类全新机理的候选分子,体外实验显示对耐药菌有效"。这条主张同时踩了天平的两条轴,而把它压成单一可信分会犯致命错误。逐轴记账:
X 轴 · 证据强度(可补、可机检)X-axis · evidence strength (supplementable, machine-checkable)
There is only single-lab in-vitro data, no independent replication, no in-vivo validation. On the grade ladder (FIG 9.0) this parks between Ⅱ and Ⅲ: measured/published, unreplicated. The disposition is clear: go get more evidence — the left act with a right answer, outsourceable: independent labs replicate the in-vitro result, then push to in-vivo. Weak evidence is not a reason to kill it but a signal triggering the "seek evidence" act.
Y 轴 · 范式距离(要人判、不可补)Y-axis · paradigm distance (human-judged, not supplementable)
"A wholly new mechanism" = far from the paradigm of known antibiotic action. This is exactly the axis that must not be handed to a "credibility score": distance from the paradigm is itself neither defect nor merit, only a signal needing constitutive judgment. With a single score, the model converts "far from the known-mechanism distribution" straight into low credibility (RES 03's structural bias), miscoding a possible paradigm-level breakthrough as outlier noise. The correct disposition: a human judges whether it is mechanistic noise or a real reframing, and goes after the decisive evidence that separates them (e.g. structural-biology validation at the mechanism level).
Weak × far = INSTRUMENT 10's most dangerous fourth cell. A single score says "delete"; the ledger's disposition is suspend + targeted evidence — no press release, no foundation-laying, but never delete; instead spend scarce human bandwidth precisely on "independent replication + mechanism validation." This is exactly the cell Einstein-1905 and Darwin's natural selection occupied at birth (RES 09): thin evidence, far from the paradigm, yet deleting it forfeits the paradigm shift.
What happens if the axes merge: suppose "weak evidence (minus points)" and "far from paradigm (minus more)" are fused into one credibility score; this claim scores extremely low and is auto-binned as "not credible, delete." Then both things that should have been done separately are done wrong: the independent replication that should be sought is not sought (the score already concluded for it), and the mechanistic reframing a human should judge is auto-killed (the score converted paradigm-distance into low credibility). The whole point of booking separately is to let weak evidence trigger "seek evidence" and paradigm distance trigger "a human judges" — two acts of completely different nature, each in its place.
案例三 · 图谱护栏失误:一张本草知识图谱,把"化学成分"锁成了唯一描述层级
Case 3 · A guardrail failure: a materia-medica knowledge graph locked "chemical constituent" as the only level of description
The graph drags every claim back onto the "constituent → target → pathway" ontology line — any explanation off this line, having "nowhere to attach" in the graph, is auto-judged untraceable, low-credibility, and filtered out. The guardrail did block hallucinations (good), but it also locked the level of description at the "reductionist chemistry" layer: any efficacy that cannot be reduced to "which molecule hits which target" cannot even be expressed in this graph, so it systematically vanishes from the candidate set.
认出锁 · 加一层本体Spot the lock · add an ontology layer
那类解释不了的药效,真正的机理在另一个描述层级:多成分协同 / 对菌群的群体调节 / 网络药理(整体扰动而非单靶点)。这不是"补更多成分数据"能解决的(那是 RES 11 那张一比一地图的陷阱:细节拉满仍是同一层信息)。修复动作是人做的范式级判断:给图谱本体加一个"系统/网络"描述层,让"整体扰动"成为可挂载、可追溯的一等公民。加层之后,原本被锁死过滤掉的那类机理重新进入候选,其中一条后来被独立复现证实。
The real mechanism of that unexplainable efficacy lives at a different level of description: multi-constituent synergy / microbiome population-level modulation / network pharmacology (whole-system perturbation, not single-target). This is not solved by "adding more constituent data" (that is RES 11's one-to-one-map trap: maxing detail is still the same layer of information). The fix was a human paradigm-level judgment: add a "system/network" description layer to the graph ontology, making "whole-system perturbation" a traceable first-class citizen. After the layer was added, the locked-out class of mechanisms re-entered the candidate set, and one was later confirmed by independent replication.
这个失误的普遍形态:护栏(知识图谱)守住了"范式内的可追溯",代价是把描述层级冻结在建图谱时的那一层。它的危险恰恰在于它看起来全对——产出可追溯、可验证、效率高,所有读数全绿——但它在悄悄地把"换描述层级"这种范式级动作排除在可能性之外(这正是 RES 11 的核心警告)。护栏的正确用法不是"建一次、永久信任",而是定期问一句:这张图谱的本体,有没有把某个描述层级锁成唯一?谁来问这一句、谁有权给本体加一层——这又落回 RES 07 的治理问题:本体的边界就是可被表达的范式的边界。
The general form of this failure: the guardrail (knowledge graph) preserves "in-paradigm traceability" at the cost of freezing the level of description at the layer present when the graph was built. Its danger is precisely that it looks entirely correct — traceable, verifiable, efficient output, every reading green — while quietly excluding "switching the level of description," a paradigm-level act, from the space of possibilities (exactly RES 11's core warning). The correct use of a guardrail is not "build once, trust forever" but to periodically ask: has this graph's ontology locked some level of description as the only one? Who asks this, and who has the authority to add an ontology layer — this falls back to RES 07's governance question: the ontology's boundary is the boundary of the paradigm that can be expressed.
案例四 · 加速却变窄:一个把 AI 用满的领域,三年内更高产也更同质
Case 4 · Accelerated yet narrowed: a field that maxed out AI grew more prolific and more homogeneous in three years
情形:一个计算驱动的子领域(可类比 Hao 等横跨约 4129.8 万篇论文的文献计量所刻画的形态,R9),从早期就把 AI 写作、文献综述、点子生成用满。三年后回看,所有"读数"都在变好:人均发表数上升、个人被引上升、项目周期缩短。按旧学术机器的每一个指标,这是个高歌猛进的领域。但把镜头拉到领域整体,出现了一组相反的信号。
个体读数 · 全绿Individual readings · all green
用 AI 的研究者个人影响力上升(R9 的核心发现之一):写得更快、综述更全、点子来得更密。从个人 KPI 看,AI 是纯增益。每个理性的个体都在做"对自己最优"的事——用 AI 把产出和影响力做上去。
AI-using researchers see individual impact rise (one of R9's core findings): faster writing, fuller reviews, denser ideas. By individual KPIs, AI is pure gain. Every rational individual is doing the "self-optimal" thing — using AI to push output and impact up.
Over the same period the field's topic coverage contracts (R9 measured ~4.63% contraction) and scholar-to-scholar interaction falls. Doshi & Hauser (R12) give the causal mechanism: give writers LLM ideas and individuals get more creative, yet grow more similar — the authors call it a "social dilemma" (individually better, collectively narrower). Anderson et al. (R13) locate it further: not individual fixation but the LLM suggesting similar ideas to different users, a group-level effect. Switching models does not cure it (R14: controlling for structural variables, models resemble one another far more than humans do).
This is hypernormalization at runtime (RES 08): the field did not get worse, it got narrower — more efficient, more stable, every metric prettier, but the variance of exploration is collapsing, everyone on the same model, chasing the same hot topics, converging to the same mean. The most dangerous part is that it sets off no alarm: every individual reading is green, every metric of the old academic machine says "all is improving." It is the field-level consequence of RES 13's "metrics reward countable output, and AI is their optimal solution." What judgment should have stayed human but did not: the question "is this field narrowing" is asked by no individual KPI — it can only be asked actively by a human at the field level, and actively hedged by institutions (protect off-mean exploration, reward replication and anti-consensus work, leave survival space for unmeasurable slack, RES 07). Outsourcing this judgment too, to "automatic metric monitoring," is letting the very mechanism that is producing the homogenization diagnose the homogenization. To see the narrowing, someone must first be willing to look at the signals that will not make their own KPI prettier.
四个案例的同一根线The single thread through four cases
四个案例走的是同一个动作:认出哪一格里"自动判断"会把范式级的东西误杀——分诊里的范式级支、天平里的"弱×远"格、图谱里被锁死的描述层、领域里没人问的"是否变窄"。每一处,正确动作都不是更快地判,而是先认出"这里不能交给可信分/指标/护栏自动判",再把稀缺的人类判断精准投进去。这就是整卷的内核在具体情形里的样子:执行可以充裕,但"哪个真相值得知道"这个判断,必须有人具名承重。All four cases run the same act: spot the cell where "auto-judging" would miscode something paradigm-level — the reframe branch in the triage, the "weak × far" cell in the ledger, the locked description layer in the graph, the unasked "is it narrowing" in the field. In each, the right act is not to judge faster but to first recognize "this cannot be handed to a credibility score / metric / guardrail to auto-judge," then spend scarce human judgment precisely there. This is what the volume's kernel looks like in specifics: execution may be abundant, but the judgment of "which truth is worth knowing" must have a named human bearing its weight.
RES
15
TEMPLATE · 研究工作流
THE RESEARCH WORKFLOW
可拷贝工件 · 照做的环
Copyable artifact · a loop you can run
把整卷收成一个可拷贝的环:生成多 · 验证严 · 整合先行
The whole volume as a copyable loop: generate much, verify hard, integrate first
The mechanisms, matrices, and signals above all reduce to one runnable research loop. It is isomorphic to engineering's spec-driven loop (Specify → Plan → Execute → Verify → Integrate → Learn), but the research version's load-bearing parts are two: the knowledge graph comes first (the evidence base is the spec), and integration takes priority over retrieval. Copy it as a template and fill in for your field.
① 框定FRAME
先立证据库 · 写下判据Stand up the base · write the criteria
建可追溯证据库(RES 04 四属性);显式写下"何为值得相信·值得知道"的判据。证据库即规格——这一步先于生成。Build the traceable evidence base (RES 04's four properties); write down explicit criteria for "worth believing / worth knowing." The base is the spec — this precedes generation.
② 生成GENERATE
范式内动作大规模并行Parallelize in-paradigm actions
检索/假设/实验/分析交给生成(RES 10 左格);每条产物挂证据边落进库——不入库的不算数。Search/hypothesis/experiment/analysis go to generation (RES 10's left cell); each output carries evidence edges into the base — what does not enter does not count.
用 RES 09 天平逐批判可信:范式内噪声删、可信入库、范式级远的挂起去找区分证据,别当噪声杀。Use RES 09's ledger to triage each batch: drop in-paradigm noise, integrate the believable, suspend the paradigm-distant to seek discriminating evidence — do not kill as noise.
④ 整合INTEGRATE
跨知识综合 · 非多检索Synthesize across · not more retrieval
人的稀缺动作(RES 05):把从未并置的几条缝成新理解。盯整合产物相对原始产出的比率,别让堆积成山。The human's scarce act (RES 05): stitch never-juxtaposed claims into new understanding. Watch the ratio of integration artifacts to raw output; do not let it pile into a mountain.
⑤ 守值OWN VALUE
定方向 · 留换变量口子Set direction · leave the variable door
让"值得"有归属(RES 07/12);给"换 schema/换变量"留人发起的通道(RES 11),抵抗生成层保守偏置。Give "worth" an owner (RES 07/12); keep a human-initiated channel to "change the schema / change the variable" (RES 11), resisting the generation layer's conservative bias.
⑥ 回流FEED BACK
把每次"被撤回/证伪/误杀的新颖"回流成证据库的新规则或新节点类型——错误回流成护栏,下一轮少犯。这一步把环闭合。Feed each "retracted / refuted / mistakenly-killed novelty" back as a new rule or node type in the base — errors become guardrails, fewer next round. This step closes the loop.
→ 真文件:→ real file: templates/research-loop.md
把省下的工时投回哪里,是这个环最容易失效的一步。这个环里有一个看不见的决策,决定它到底带来进步还是带来 hypernormal:②生成省下来的工时,投回哪里。默认会发生的事 RES 07 已经讲过——省下的产能不会自动变成 slack,它会被重新分配去做更多同样的事。落到研究环里,就是把②省下的时间拿去多产论文、多跑实验、多生成假设,于是产出曲线更陡,而③判断、④整合、⑤守值的带宽没增加半分。这条失效路径极其隐蔽,因为它在每一个局部指标上都显得是"进步":产量涨了、影响力涨了、团队看起来更高产了——这正是 Hao 等 4129.8 万篇研究里那批"个人影响力上升"的科学家的处境。正确的动作是反直觉的:把②省下的工时显式、刻意地投回③④⑤,让判断/复现/整合占研究者时间的比例上升,而不是让产出量上升。一句操作判据:如果一个团队上了 AI 之后产量暴涨但判断/整合的时间占比没变,它没有在跑这个环,它在跑一台更快的 hypernormal 机器。
Where to reinvest the saved hours is this loop's easiest step to get wrong. This loop hides one decision that determines whether it yields progress or hypernormal: where the hours ② saves get reinvested. What happens by default RES 07 already covered — freed capacity does not become slack on its own; it gets reallocated to more of the same. Inside the research loop, that means spending the time ② saved on more papers, more experiments, more hypotheses, so the output curve steepens while the bandwidth for ③ judgment, ④ integration, ⑤ owning value gains nothing. This way of going wrong is deeply insidious, because on every local metric it looks like "progress": output up, impact up, the team looks more productive — exactly the situation of the "individual impact up" scientists in Hao et al.'s 41.3M study. The correct move is counter-intuitive: reinvest the hours ② saves explicitly and deliberately into ③④⑤, raising the share of researcher time on judgment/replication/integration rather than the volume of output. One operational test: if a team's output spikes after adopting AI but its share of time on judgment/integration is unchanged, it is not running this loop — it is running a faster hypernormal machine.
知识图谱先行、整合优先,是这个环的两处承重。这个环和工程的规格驱动环(Specify → Plan → Execute → Verify → Integrate → Learn)同构,但研究版有两处刻意拧紧的承重点,照抄时不能松。第一处是①框定先于②生成:先立可追溯证据库、写下"何为值得相信"的判据,再开生成。原因 RES 04 已论证——证据库是研究的规格,不是事后归档;次序颠倒会得到一座无法整合的垃圾山。很多团队把研究环抄成"先让 agent 狂产、再想办法管",正是漏掉了这一拧。第二处是④整合优先于检索:当 RES 02 的生成把产出推到近无限,环里最容易拥堵的不是生成,是消化。如果团队把②省下的工时拿去多产,而不是投回④整合,环就会在"生成"和"整合"之间形成越积越高的堰塞——产出曲线陡升,理解曲线平躺。所以这个环真正的瓶颈阀门在④,不在②。
"Graph first" and "integration first" are this loop's two load-bearing joints. This loop is isomorphic to engineering's spec-driven loop (Specify → Plan → Execute → Verify → Integrate → Learn), but the research version has two deliberately tightened load-bearing joints that must not loosen when you copy it. The first is ① FRAME before ② GENERATE: stand up the traceable evidence base and write the "what is worth believing" criteria before opening generation. RES 04 argued the reason — the base is research's spec, not after-the-fact archiving; reverse the order and you get an un-integratable garbage mountain. Many teams copy the loop as "let the agent run wild first, manage it later," which is exactly missing this tightening. The second is ④ INTEGRATION before retrieval: when RES 02's generation pushes output toward the near-infinite, the loop's easiest congestion point is not generation but digestion. If a team spends the hours ② saved on more output rather than reinvesting into ④ integration, a rising barrier-lake forms between "generate" and "integrate" — output curve climbing steeply, understanding curve flat on its back. So the loop's true bottleneck valve is at ④, not ②.
Step ⑥ feed-back is what separates a "loop" from a "pipeline." This artifact is called a "loop," not a "pipeline," entirely because step ⑥ — feed-back — closes it. A pipeline is one-way: raw material in, product out, errors discarded as scrap. A loop has feedback: every "retracted, refuted, mistakenly-killed novelty" is not scrap but guardrail material for the next round. How does feed-back work concretely? A non-replicable claim feeds back as a new conflict-detection rule in the base; a residual the current schema judges anomalous yet that recurs feeds back as a new node type (RES 11's change-the-variable door); an incident of "deleting a paradigm-level reframing as noise" feeds back as an improved disposition for the ledger's "weak × far" cell (RES 09). Without this step, the first five degrade into a faster pipeline — generate, filter, integrate, output, errors flowing away never to return, the system forever repeating the same blind spots. With it, errors become the system's learning signal: each wall-hit grows the guardrail a little, fewer next round. This is exactly where it is fully isomorphic to engineering's "errors feed back as new tests" — and the reason the volume keeps stressing the metric "retraction/refutation rate should fall": it measures not "fewer mistakes" but "whether this loop is learning."
研究下注不是单点决定,是一个组合
A research bet is not a single decision but a portfolio
Once the research loop runs, the first governance question that bites back is: how much bandwidth goes to exploitation (refining the known, steady yields), and how much to exploration (chasing the uncertain, possibly nothing)? RES 07 already showed efficiency eats exploration by default — so the "bet ratio" cannot be left to default; it must be managed as an adjustable, observable portfolio. This is the new degree of freedom abundance creates: when execution is near-free, the marginal cost of running a redundant exploration drops sharply, so in theory you can afford a higher exploration share; but if incentives still appraise on output, the freed capacity gets pushed back to exploitation by default. The instrument below turns this tension into adjustable sliders — set the "explore vs exploit" mix and the "redundant exploration" allowance, watch how expected novelty and cost move together, and watch whether you are sliding into hypernormal's "every metric green while coverage shrinks."
INSTRUMENT 11 · 研究下注组合 RESEARCH-BET PORTFOLIO
拨两根杆:探索↔利用的配比、冗余探索的允许度。读数给出预期新颖、预期成本、与"是否滑进 hypernormal"的判词——把 RES 07 的散木命运做成可拨的组合。
Two sliders: the explore↔exploit mix, and the redundant-exploration allowance. The readout gives expected novelty, expected cost, and a verdict on "sliding into hypernormal" — RES 07's fate-of-useless-wood made an adjustable portfolio.
This instrument gives no "optimum"; it forces you to face the trade-off. Slide the explore share to 0 and expected cost is lowest, readout all green — but coverage is also lowest: that is hypernormal, efficient, stable, shrinking. Slide it to 100 and coverage is widest, expected novelty highest, but cost climbs steeply and much of the exploration is destined to yield nothing (the literal meaning of "useless-wood / the use of the useless"). Abundance changes the shape of this trade-off curve: near-free execution makes "high-redundancy exploration" no longer the prohibitive cost it once was, so the rational explore share should move up — but that move only actually happens when the incentive structure lets "unmeasurable slack" survive. The "redundant-exploration allowance" slider models exactly whether an organization will budget for seemingly useless directions.
起步路径(别一次全建):先做①+②的一条窄工作流——挑一个"执行已充裕、判断尚未外化"的环节(如文献综合或参数扫描),立一个最小可追溯证据库,把该环节的范式内动作交给生成。跑顺了再加③天平、④整合。把②省下的工时显式投回③④,而不是用来多产论文——这一条最容易失效(见 RES 14 边界)。三起步:先立可追溯证据库 → 用可信度天平挑该注入人类判断处 → 把节省工时投回整合与守值。
A starting path (do not build it all at once): first run a narrow ①+② workflow — pick one step where "execution is already abundant but judgment is not yet externalized" (literature synthesis, parameter sweeps), stand up a minimal traceable evidence base, hand that step's in-paradigm actions to generation. Once it runs, add ③ the ledger and ④ integration. Reinvest the hours ② saves explicitly into ③④, not into producing more papers — the easiest place to go wrong (see RES 14's boundary). Three starts: stand up a traceable evidence base → use the believability ledger to choose where to inject human judgment → reinvest saved hours into integration and owning value.
The research volume's thesis is not a universal law. It is strongest where "execution can be massively abundified and judgment can be externalized," and degrades where execution itself is still the real bottleneck, or where value criteria are highly consensual. Draw the boundary first, then talk rollout — a hard gate, and a matter of honesty.
边界的判据:执行是否真被充裕、判断是否真能外化
The boundary test: is execution truly abundified, can judgment truly be externalized
To state "where it applies" precisely, return to the thesis's two premises and test each against a given field. Premise one: execution can be massively abundified. It holds in computational biology, materials screening, literature synthesis — the marginal cost of one experiment/screen/review trends to zero and parallelizes. It fails in fields rate-limited by physics or ethics: a phase-three clinical trial, a field sample needing a three-year growth cycle, a rare-sample wet lab — execution is still the real bottleneck, the premise "judgment abundified, execution cheap" collapses entirely, and here the volume's retreat thesis adds little, because scarcity never moved off the execution end. Premise two: judgment can be externalized. It requires that the criteria for "worth believing, worth knowing" can be partly written down, partly machine-checked. In fields where criteria are highly consensual (some structured-prediction tasks where "correct" is near-uncontested), externalization is easy but the volume's gain is also small — there is no value fork to speak of. The volume's sweet spot is precisely the field where both premises hold and judgment is not yet externalized: execution can already be abundified yet judgment is still trapped, unstructured, in a few experts' heads. Running these two as a gate, case by case, is far more reliable than memorizing a "list of applicable fields."
最适用 · 命题最强Most applicable · thesis strongest
数据/计算密集、执行可并行的域(计算生物、材料筛选、文献综合)
Data/compute-intensive, parallel-execution fields (computational biology, materials screening, literature synthesis)
绿地研究项目——从零按"生成多·验证严"重画流程
Greenfield programs — redraw the workflow from zero around "generate much, verify hard"
已有可机检判据的域(结构化预测、可形式化证明)
Fields with machine-checkable criteria (structured prediction, formalizable proof)
不适合 / 须降权 · 别硬套Ill-fitting / down-weight · do not force
执行本身仍是真瓶颈的域(罕见样本田野、湿实验受物理/伦理限速、临床试验)
Fields where execution is still the real bottleneck (rare-sample fieldwork, wet labs rate-limited by physics/ethics, clinical trials)
价值判据高度共识的域——"值得知"无争议时,本卷的价值退守命题增益小
Fields with highly consensual value criteria — when "worth knowing" is uncontested, the value-retreat thesis adds little
把"提问被充裕"当已证现实去裁人——它是探索账,不是已证(见 RES 02 待坐实)
Using "questioning is abundified" as proven grounds to cut people — it is exploratory, not proven (see RES 02, to be grounded)
总闸(greenfield vs transformation):绿地研究项目可直接按本卷重画——先立可追溯证据库、再用价值分诊挑该注入人类判断处。存量实验室是渐进改造:从一条工作流切出"生成可大规模充裕"的环节先重画,把节省的工时显式投回整合与可信度判断,而不是用来多产论文。一句话边界:本卷适用于"执行已充裕、判断尚未外化"的研究域;执行仍稀缺、或判断已共识的地方,请直说这不是它的目标群体。
The master switch (greenfield vs transformation): a greenfield program can be redrawn by this volume directly — stand up the traceable evidence base first, then use value-triage to choose where to inject human judgment. An incumbent lab is a gradual transformation: carve out the "execution can be massively abundified" steps of one workflow and redraw them first, then explicitly reinvest the saved hours into integration and credibility judgment — not into producing more papers. The boundary in one line: this volume applies where "execution is already abundant but judgment is not yet externalized"; where execution is still scarce, or judgment is already consensual, say plainly this is not its target group.
可读性问题:突破可能要以部分不可读为代价。适用边界还有一条更深、更不舒服的前沿命题,必须诚实摆出:如果真要 AI 出突破,部分可读性损失可能不可避免。类比 AlphaZero——它下出的某些棋"概念上不透明",强于任何人类却无人能完整解释为什么。当 AI 在科学上做到类似的事,风险是发现被"搁浅"在无人能解析的产出洪流里:你拿到一个比现有理论预测更准的模型,却无法把它翻译成人能理解、能据以行动、能优先级排序的知识。这对研究卷是一记真实的张力——它的整个立论建立在"人接住判断"上,但若突破本身部分不可读,人能接住的就只是一个黑箱的输出,而不是它的理由。
The legibility problem: breakthroughs may cost some unreadability. The applicability boundary has one deeper, less comfortable frontier claim that must be put honestly on the table: if you really want AI to produce breakthroughs, some loss of legibility may be unavoidable. By analogy to AlphaZero — some of its moves are "conceptually opaque," stronger than any human yet no one can fully explain why. When AI does something similar in science, the risk is that discoveries get "stranded" in a flood of output no one can parse: you hold a model that predicts more accurately than current theory, yet you cannot translate it into knowledge a human can understand, act on, or prioritize. This is a real tension for the research volume — its whole argument rests on "humans catching the judgment," but if a breakthrough is itself partly unreadable, what a human catches is only a black box's output, not its reasons.
一句话边界:别把探索账当已证去裁人。适用边界里最该被当成硬门禁的,不是技术域的划分,而是一条诚实纪律:本卷有大量命题标着"探索账·待坐实"——提问被充裕(RES 02)、整合鸿沟急剧扩大(RES 05)、净知识 −40%(RES 02 的 ODE 预测)——它们是有侧证支撑的推演,不是已证事实。把这些探索账当成"已证现实"去做组织决策,尤其是去裁人,是这卷最危险的误用。一个组织若以"提问已被 AI 充裕,所以不需要这么多研究员"为由裁员,它其实是在用一个 Ⅴ 级推演当 Ⅰ 级证据用——而恰恰是 RES 06/07 反复强调的:被充裕的是范式内提问,范式级重构与价值判断不仅没被充裕,反而升值。误把探索账当硬锚,结果是裁掉了正是要守住的那批判断力。
The boundary in one line: do not cut people on an exploratory ledger. The thing that should be treated as a hard gate in the applicability boundary is not the partition of technical fields but an honesty discipline: this volume carries many claims tagged "exploratory · to be grounded" — questioning abundified (RES 02), the integration gap exploding (RES 05), net knowledge −40% (RES 02's ODE prediction) — which are side-evidenced projections, not proven facts. Treating these exploratory ledgers as "proven reality" for organizational decisions, especially to cut people, is this volume's most dangerous misuse. An organization that lays off staff on the grounds that "questioning is already abundified by AI, so we need fewer researchers" is in fact using a grade-Ⅴ projection as grade-Ⅰ evidence — and precisely what RES 06/07 stress repeatedly is: what gets abundified is in-paradigm questioning, while paradigm-level reframing and value judgment are not abundified but appreciate. Mistaking an exploratory ledger for a hard anchor cuts away the very judgment you meant to keep.
绿地直接重画,存量切一条工作流先改
Greenfield: redraw directly; incumbent: carve out one workflow first
落到"怎么开始",适用边界自然分成两条路径,对应组织的两种起点。绿地(greenfield)——一个从零起步的研究项目,可以直接按本卷重画:第一步不是招更多研究员,是立一个最小可追溯证据库、把"何为值得相信/值得知道"的判据显式写下来;然后用价值分诊(RES 10 矩阵 + RES 09 天平)决定哪些动作交给生成、哪些注入人类判断。绿地的优势是没有存量流程的惯性,可以一次把次序立对(规格先于生成、整合优先于检索)。存量改造(transformation)——一个已经在跑的实验室,绝不能推倒重来,只能渐进:从一条工作流里切出"执行已充裕、判断尚未外化"的那一个环节(文献综合、参数扫描是最常见的入口),只在这一段重画,跑顺了再扩。存量改造最致命的陷阱是把②省下的工时拿去多产论文——这一条 RES 13 反复警告。两条路径共用一句边界判词:本卷适用于"执行已充裕、判断尚未外化"的研究域;执行仍是真瓶颈、或判断已高度共识的地方,请直说这不是它的目标群体,别硬套。
Down to "how to start," the applicability boundary naturally splits into two paths matching an organization's two starting points. Greenfield — a program starting from zero can be redrawn by this volume directly: step one is not hiring more researchers but standing up a minimal traceable evidence base and writing down explicit criteria for "worth believing / worth knowing"; then using value-triage (RES 10's matrix + RES 09's ledger) to decide which actions go to generation and which inject human judgment. Greenfield's advantage is no legacy-process inertia, so you can set the order right in one go (spec before generation, integration before retrieval). Transformation — a lab already running must never be torn down and rebuilt, only changed gradually: carve out of one workflow the single step where "execution is already abundant but judgment is not yet externalized" (literature synthesis and parameter sweeps are the most common entry points), redraw only that segment, and expand once it runs. Transformation's deadliest trap is spending the hours ② saved on more papers — which RES 13 warns against repeatedly. The two paths share one boundary verdict: this volume applies where "execution is already abundant but judgment is not yet externalized"; where execution is still the real bottleneck or judgment is already highly consensual, say plainly this is not its target group — do not force it.
对策不是减速,是建翻译层。诚实地说,这是 Asimov Press 等的推演,缺工程实证,标为前沿命题。可能的出路:建"解释层 / 翻译层",让 AI 的发现对人可读、可优先级排序——不是要求 AI 只产人能立即理解的东西(那等于把它锁回范式内),而是在它产出之后,专门投入把不可读的发现翻译成可读知识的工作。这本身就是内核④"人回归意义"的一个新落点:当一阶发现可能不可读,人的稀缺贡献之一,就是做那座把黑箱输出译成人类理解的桥。它也回连 RES 05 的整合:legibility 翻译,本质上是一种最难的整合——把一个无框架可借的发现,缝进人类既有的理解结构里。
The remedy is not to slow down but to build a translation layer. Honestly, this is a projection from Asimov Press and others, lacking engineering empirics, flagged as a frontier claim. A possible way out: build an "explanation layer / translation layer" that makes AI's discoveries legible and prioritizable for humans — not demanding that AI produce only what humans can immediately understand (that would lock it back into the paradigm), but, after it produces, deliberately investing in the work of translating unreadable discoveries into readable knowledge. This is itself a new landing point for kernel ④'s "humans return to meaning": when first-order discovery may be unreadable, one of the human's scarce contributions is to build the bridge that translates a black box's output into human understanding. It also wires back to RES 05's integration: legibility-translation is, in essence, the hardest kind of integration — stitching a discovery with no frame to borrow into humanity's existing structure of understanding.
RES
17
SPECULATION · 推演幕
SPECULATION
推论 · 外推,非事实
Inference · Extrapolation, Not Fact
2026–2032:当研究开始设计科学自己
2026 to 2032: When Research Starts to Design Science Itself
This act draws no single acceleration curve; it opens a possibility space. Autonomous-research-agent autonomy will climb the "execute → design → select-agenda" ladder, and meta-science (the study of which scientific institutions generate better paradigms) turns from a luxury into a necessity. Below are three observable converging forces, each with a falsification condition; one explicitly fictional 2031 autonomous-lab quarterly; and an on-record counter-bet against this volume's central thesis.
Nature of this chapter · InferenceWhat follows is extrapolation from the public trajectory of 2024–2026, not a statement of fact. It inherits the volume's honesty discipline: what gets abundified is in-paradigm questioning; whether constitutive value judgment ("which truth is worth knowing") also gets abundified is what this chapter bets on, and the first thing it should be falsified against. When the inference fails, this chapter should be the first to be rewritten.
三股会聚力,每股带一条证伪条件
Three converging forces, each with a falsification condition
Speculation is not prophecy about "which line must happen"; it names which forces are stacking and under what observation each would be judged wrong. If the three forces below hold simultaneously, research's face slides from "humans ask, machines execute" toward "machines also pose in-paradigm questions, humans retreat to selecting the agenda and defining what counts as true." Each carries a "leading indicator" and a "falsification condition" — the latter is the force's pressure point: see it, and the force was overrated.
力 1FORCE 1
自主实验闭环商品化Autonomous experiment loops commoditize
会聚:自驾实验室(self-driving lab)+ 编码 agent + 文献 agent 拼成"假设→实验→分析→下一假设"的整环,单位发现成本逐年掉。 先行指标:一个领域里"无人值守通过同行评审"的论文占比连续两年上升。 证伪:若到 2029 自主闭环仍只在窄域(材料筛选、超参搜索)有效,跨域复现率不升反降,则"整环商品化"被证为局部假象,而非通用力。Converging: self-driving labs + coding agents + literature agents assemble a full "hypothesis → experiment → analysis → next hypothesis" loop; unit cost of discovery falls year on year. Leading indicator: in a field, the share of "unattended, peer-review-passing" papers rises for two consecutive years. Falsified if: by 2029 autonomous loops still work only in narrow domains (materials screening, hyperparameter search) and cross-domain reproducibility falls rather than rises — then "whole-loop commoditization" was a local illusion, not a general force.
力 2FORCE 2
提问被部分充裕Question-asking partly abundified
会聚:知识图谱 agent 在"知识边界上做最近邻搜索"——找空白、补缺环、提范式内好问题——逼近熟练博士生。 先行指标:顶刊里"问题由 AI 首先提出、人类筛选执行"的致谢条目出现并增多。 证伪:若 AI 提的问题在盲评里系统性偏"安全、范式内、引用密集",且这种偏置三年不收敛,则提问的构成性那一半未被充裕——力 2 只吃到了边角。Converging: knowledge-graph agents do "nearest-neighbor search on the knowledge frontier" — finding gaps, filling missing links, posing good in-paradigm questions — approaching a skilled PhD student. Leading indicator: acknowledgments of the form "question first posed by AI, humans selected and executed" appear and multiply in top journals. Falsified if: in blind review, AI-posed questions skew systematically toward "safe, in-paradigm, citation-dense" and that skew does not converge over three years — then the constitutive half of questioning was not abundified; force 2 only ate the margins.
力 3FORCE 3
元科学成显学Meta-science goes mainstream
会聚:既然"什么规则让范式更优"还没有判据,加速执行不自动等于进步——于是"怎样的制度生得出更优范式"本身成为被资助、被实验的对象。 先行指标:出现把评审机制、资助规则、复现激励当变量做对照实验的注册研究(科学成了自己的模式生物)。 证伪:若加速十年后,突破性范式(非渐进)的产出率不升反平,且无人能把它归因到制度变量,则"元科学能撬动范式质量"这一假设缺乏可操作抓手。Converging: since there is still no criterion for "what makes one paradigm better," accelerating execution does not automatically equal progress — so "which institutions generate better paradigms" itself becomes a funded, experimented-upon object. Leading indicator: registered studies appear that treat review mechanisms, funding rules, and replication incentives as variables in controlled experiments (science becomes its own model organism). Falsified if: a decade of acceleration later, the rate of breakthrough (non-incremental) paradigms plateaus rather than rises and no one can attribute it to institutional variables — then "meta-science can move paradigm quality" lacks an operable handle.
FIG. 14.0 / 推演幕:研究的 2026→2032 可能性空间(不是一条线,是一个分支场)THE SPECULATION ACT: RESEARCH'S 2026→2032 POSSIBILITY SPACE (a branch field, not a line)看懂:横轴=自主闭环的可信度(弱→强),纵轴=价值判断谁掌(人保留→交给系统)。四格是四种 2032 图景;本卷押注左上"人守议程"格,反方押注右上"判断也被学走"格。Read: x-axis = credibility of the autonomous loop (weak→strong); y-axis = who holds value judgment (kept by humans→handed to the system). The four cells are four 2032 pictures; this volume bets on the top-left "humans hold the agenda" cell, the counter-bet on the top-right "judgment learned away" cell.
两轴是研究最不确定的两件事:自主闭环到底可不可信(横),以及"值得"的判断权最终在人还是在系统(纵)。本卷押注右上格——执行充裕、人守议程;反方押注右下格——连价值判断都被学走。注意两个左格:闭环一旦不可信,加速只会放大错误,把研究环变成 hypernormal science 的高速生成器。这张图的意义不在选定一格,而在给出每格的先行指标——让你能根据真实观测,判断世界正滑向哪一格。The two axes are research's two least-certain things: whether the autonomous loop is credible at all (x), and whether the right to judge "worth" ends up with humans or the system (y). This volume bets on the top-right cell — execution abundant, humans hold the agenda; the counter-bet is the bottom-right — even value judgment is learned away. Note the two left cells: once the loop is not credible, acceleration only amplifies error, turning the research loop into a hypernormal-science fast generator. The figure's value is not in picking a cell but in giving each cell's leading indicator — so you can judge, from real observation, which cell the world is sliding toward.
2026→2028→2030→2032:研究的面逐步变形
2026→2028→2030→2032: research's face deforms step by step
NOW2026–2027
AI 当强力副驾,人仍握每一个判断闸
AI as a powerful copilot; humans still hold every judgment gate
文献综述、代码、初步分析大面积交给 agent;提问、实验设计的把关、"值不值得发"仍是人的活。可观测信号:顶刊投稿量已经在涨(Organization Science AI Task Force 2026-04:+42%),评审带宽没跟上——张力开始显形,但判断闸仍在人手里。
Literature review, code, and first-pass analysis are handed wholesale to agents; questioning, experiment-design gatekeeping, and "is it worth publishing" remain human work. Observable signal: top-journal submissions are already rising (Organization Science AI Task Force, Apr 2026: +42%) while review bandwidth has not kept up — the tension surfaces, but the judgment gate is still in human hands.
NEAR2028–2029
范式内提问被部分充裕,评审制度先撑不住
In-paradigm questioning partly abundified; review institutions buckle first
Knowledge-graph agents reliably pose "good in-paradigm questions," and autonomous loops run unattended in narrow domains. The first thing to deform is not the lab but the review-and-publication institution: the generation end is accelerated, the judgment end is not scaled, and the system falls back on the cheapest proxies (format, similarity, citation counts); net knowledge may decline (the ODE model in arXiv 2604.05714 predicts about −40%, a model prediction, not proven). Meta-science's first controlled experiments arrive in this window.
MID2030
自主实验室常态化,人退守到"选议程 + 定何为真"
Autonomous labs become normal; humans retreat to "select agenda + define what is true"
On the "execute → design → select-agenda" ladder, the first two rungs are largely eaten; the last (direction selection) stays scarcest (FIG 12.0). The human-to-machine ratio in research orgs jumps from single to double digits; the evaluation lens shifts from "output volume" to "judgment quality + context coherence." The pivotal divergence of this year: whether AI-chosen agendas can match the human baseline on long-run citation — exactly the test of whether the rightward x-shift drags the y-axis down.
FAR2031–2032+
两条线分岔:人守议程,或"值得"也被系统化
Two lines fork: humans hold the agenda, or "worth" is systematized too
Here the volume and the counter-bet formally fork. Volume line: constitutive value judgment is not abundified; humans become a few high-density "agenda gatekeepers + truth referees," and the research org looks like a tiny team of extreme judgment density. Counter-bet line: something like RLCF finally learns frontier value departing from the community mean, "which truth is worth it" is systematized, and the human's last ground collapses. Which comes true turns on that 2030 citation test — not on who argues more eloquently.
Speculation made only of assertions reads thin. The piece below is design fiction: an explicitly fictional 2031 future artifact that makes "research retreating to agenda and refereeing" tangible. It is not a prediction; it is a way of projecting the thesis onto 2031.
SPECULATIVE · 虚构 · Fiction
ARTIFACT 01 · 自主实验室季报节选 · Excerpt from an Autonomous-Lab Quarterly
Meridian Autonomous Lab · 2031 Q3 研究季报(节选)
Meridian Autonomous Lab · 2031 Q3 Research Quarterly (Excerpt)
Selecting the agenda 38% · independent replication & adjudicating credibility 41% · building a "legibility layer" for unreadable findings 21% (coding is now < 2%)
弃用指标
"论文产出量"已从季报删除——它由闭环近乎免费地产生,不再是稀缺信号
Retired metric
"Paper output volume" has been removed from the quarterly — the loop produces it near-free; it is no longer a scarce signal
新设指标
议程命中率:本季所选方向中,三年后被独立团队接续/复现的比例(替代了"高引论文数")
New metric
Agenda hit-rate: the share of this quarter's chosen directions later picked up/replicated by independent teams within three years (it replaced "count of high-citation papers")
"We no longer take pride in how many papers we produced — that is a byproduct of the loop. We are accountable for only two things: which worth-chasing questions we chose correctly, and which 'results' we dare to sign off as true. The rest, the system grows on its own." — memo to the board
记录在案的反方:判断也许只是又一种待充裕的能力
The counter-bet, on record: judgment may be just one more capability awaiting abundance
Honesty requires recording the strongest counter-argument, not only the line that flatters one's own thesis. This volume's central claim is: constitutive value judgment ("which truth is worth knowing") will not be abundified by stronger models — its scarcity is structural, not a capability threshold. The counter-bet's sharpest cut is: that "structural scarcity" may itself be a temporary artifact of models not yet being strong enough.
反方 · 与本卷对赌Counter-bet · against this volume
RLCF("AI 能学科学品味",arXiv 2603.14473)已证 AI 能学到科学品味的社群均值。反方预测:足够大的模型 + 足够长的引用反馈,终将学到偏离均值的前沿价值——即"哪个真相值得知道"被系统化。果真如此,则"人回归意义"不是终局,只是模型弱时的过渡态,本卷整条第④步会被改写。这道分叉正是研究卷向创新方法论交棒的悬案——同一个"反共识前沿能否被学走"的实验,在那一卷是命根级裁决。证伪本反方(即本卷成立)的条件:到 2032,在长期引用与"被独立团队接续"两个口径上,AI 自选议程仍系统性低于人类专家基线,且差距不随模型规模收敛。这一条,本卷愿意拿全卷的主命题去赌。RLCF ("AI can learn scientific taste," arXiv 2603.14473) showed AI can learn the community mean of scientific taste. The counter-bet predicts: a large-enough model plus long-enough citation feedback will eventually learn frontier value departing from the mean — i.e. "which truth is worth knowing" gets systematized. If so, "humans return to meaning" is not the endgame but a transitional state of weak models, and this volume's entire step ④ gets rewritten. This fork is precisely the open question research hands to the innovation methodology — the same experiment, "can the anti-consensus frontier be learned away," is a make-or-break adjudication in that volume. The condition that falsifies this counter-bet (i.e. that the volume holds): by 2032, on both long-run citation and "picked up by independent teams," AI-selected agendas remain systematically below the human-expert baseline, and the gap does not converge with model scale. On this one, the volume is willing to wager its central thesis.
Putting the counter-bet in the body is not rhetorical modesty; it is this volume's method itself: a claim with no falsification condition its own author would accept is not knowledge, only attitude. This chapter stands ready to be rewritten by that 2032 citation test — which is precisely why it earns the name "research methodology."
RES
18
PLAYBOOK · 落地 + 最后一层
PLAYBOOK + THE LAST LAYER
行动 · 可执行
Action
落地 · 先立证据库,再守价值责任
Rollout · stand up the evidence base, then hold value accountability
Every redraw above reduces to one set of principles and metrics. Hold the principle — when the next literature tool or autonomous-scientist appears, the same ruler tells you whether it belongs in ① or ④. The last layer gives no static answer: use the dynamic three-way split for what is invariant, what is shifting, what is at the frontier.
01 / ↑
生成多·验证严 · 判断占比↑Generate much, verify hard · judging share↑
先立可追溯证据库;判断/复现占研究者时间的比例上升,产出量本身不是指标。Stand up a traceable evidence base; the share of time spent judging/replicating rises — output volume itself is not the metric.
02 / ↑
整合优先于检索 · 整合比率↑Integration over retrieval · integration ratio↑
写下"何为值得相信·值得知道"的判据;盯整合产物相对原始产出的比率。Write down the criteria for "worth believing / worth knowing"; watch integration artifacts vs raw output.
03 / ↓
守住价值责任 · 撤回/证伪率↓Hold value accountability · retraction/refutation↓
让"值得"有归属、不被生成层默认偏置替换;可信度命中率可测。Give "worth" an owner so the generation bias can't replace it; credibility hit-rate becomes measurable.
The last layer hands you no static checklist but a dynamic three-way split. The split is not a one-time classification; it is re-decided as the abundance frontier moves right — the same action sitting in "shifting" today may slide into or out of "invariant" next year.
不变INVARIANT
哪个真相值得知道Which truth is worth knowing
无对错、只有归属的构成性价值判断——AI 学得到平均,学不到异质。这是基岩。A constitutive value judgment with no right answer, only belonging — AI learns the average, not the heterogeneous. The bedrock.
在变SHIFTING
提问/验证被自动化Questioning/verifying automated
〔探索账·Ⅲ〕peer review 净知识 −40%(ODE 模型预测,非已证);范式内提问并入①充裕。[exploratory · Ⅲ] peer review's net knowledge −40% (an ODE-model prediction, not proven); in-paradigm questioning joins ① abundance.
前沿FRONTIER
谁有权定研究方向Who owns the direction
〔探索账〕当价值判断成稀缺资源,定方向的权力即治理问题——交棒组织卷,悬而未决。[exploratory] when value judgment is the scarce resource, the power to set direction is a governance question — handed to the Org volume, unresolved.
Drop a research action onto two axes and see whether it goes to generation, is decided by evidence-base rules, or must be judged by a human — the kernel's double retreat (epistemic → axiological) made playable.
X · 可被 AI 执行 / 生成?AI-executable / generatable?
Y · 需人类价值判断?Needs human value judgment?
必人判 · 价值Human · value
可生成 × 需价值判断Generatable × value-laden
必人判 · 可信度Human · credibility
难自动 × 需价值判断Hard × value-laden
交给生成Hand to generation
可生成 × 可机检Generatable × checkable
知识图谱规则定Graph rules decide
难自动 × 可机检Hard × checkable
生产速度与可消化速度的鸿沟,是这卷真正的赌注
The gap between production rate and digestion rate is this volume's real bet
Collapse all the sheets into one line: the research volume bets not that "AI cannot do science" but that "the rate of knowledge production will far outstrip the rate humans can digest, so scarcity migrates permanently from the production end to the digestion end (judging, integrating, valuing)." This bet has order-of-magnitude support: the scientific literature already runs at ~2.5 million papers a year, doubling every 9 years, with AI layering a qualitative acceleration on top. When the production curve climbs exponentially while human cognitive bandwidth stays nearly constant, the scissors-gap between the two curves is where this volume's entire thesis lives — RES 05's integration gap, RES 03's credibility judgment, RES 06's valuing are all facets of that gap. If one day this bet is falsified (human digestion can scale proportionally with AI, or machines can losslessly take over digestion-end judgment), the whole volume should retire. Being able to write the retirement condition is what makes it a claim, not a faith.
The inner unity of the three principles: "generate much, verify hard," "integration over retrieval," "hold value accountability" look like three things but are one line projected onto three positions of the loop. Generate much — admits ① execution is abundant; verify hard — the load-bearing act after ② judgment retreats; integration first — the human's scarce contribution catching the ④ bandwidth bottleneck; hold value — ④ keeping "worth" with humans, not the generation layer's default bias. The three share one measurement discipline: output volume itself is never the metric. What should rise is the share of time spent judging/replicating, the ratio of integration artifacts to raw output, the clarity of ownership over research-direction value decisions; what should fall is the retraction/refutation rate and the convergence rate toward known solutions. Pin this metric set on the wall and you have a mirror that, at any moment, shows whether you are sliding into hypernormal.
收束 · 全命题Closing · the whole thesis
可机检/范式内的判断会被充裕;构成性/范式级/异质的价值判断,才是人最后的守地。而真正的敌人不是 AI 变强,是人自愿把"定义什么值得知道"也交出去。Machine-checkable / in-paradigm judgments get abundified; the constitutive / paradigm-level / heterogeneous value judgment is the human's last ground. And the real enemy is not AI getting stronger, but people voluntarily handing over even "defining what is worth knowing."
耦合枢纽 · 接驳全系列The coupling hub · seams to the whole series
研究是系列里耦合最深的一卷,它的每条接缝都落在一处可走的链接上:向上把"哪个真相值得知道"交给 创新(价值发现);向下把"谁有权定方向"交给 组织(治理 · 阅读入口);与 工程(判对错) 同构对照(瓶颈搬家,但研究判可信与值得);与 设计(判好坏) 共享"人类不可外包的规格";与架构/谱系在 SHEET 04 共用一道护栏(知识图谱 ↔ 设计系统 ↔ 架构边界);与学习同处认知层上游。没有上游供给的"真相",下游的高效只是精密的空转。完整接线见 体系总图。Research is the most deeply coupled volume in the series, and each of its seams lands on a clickable link: upward it hands "which truth is worth knowing" to Innovation (value discovery); downward it hands "who owns the direction" to the Organization (governance · reading entry); it mirrors Engineering (judges correctness) isomorphically (the bottleneck moves, but research judges credibility and worth); it shares the "un-outsourceable spec" with Design (judges goodness); it shares one guardrail with Architecture/Lineage at SHEET 04 (knowledge graph ↔ design system ↔ architecture boundaries); and it sits beside Learning in the upstream cognitive layer. Without the "truth" the upstream supplies, the downstream's efficiency is only a precise idle. The full wiring is in the system chart.
研究面 · 可执行 skill:ai-native-researchThe research surface, as an executable skill: ai-native-research
The research surface, as an executable skill: ai-native-research
这是可执行配套——它真的去做研究,按本卷的方式:大规模遍历文献、生成假设、跑标准分析、起草综述,然后担保哪条可信、定夺哪个真相值得知道。它不是"设计一个研究组织"(那是架构师 ai-native-architect),也不是把旧流水线加速的文献检索器;删掉 AI 它不会塌回"研究者读得更快",因为环是围绕"充裕生成 × 承重验证(复现 + 可信度账)"重画的,可追溯证据库即规格。
This is the executable companion — it actually does the research the way this volume describes: traverse the literature at scale, generate hypotheses, run standard analyses, draft synthesis, then vouch for what is credible and decide which truth is worth knowing. It is not "design a research org" (that is the architect, ai-native-architect), nor a literature-search tool that merely speeds the old pipeline; delete the AI and it does not collapse to "a researcher reading faster," because the loop is redrawn around abundant generation gated by a load-bearing verifier (replication + a credibility ledger), with a traceable evidence base as the spec.
# 在 Claude Code 里调用invoke inside Claude Code
$ /skill ai-native-research
> "把这 40 篇相互冲突的论文整合成一个判断:这个结论可信吗、值得我们押注吗?""Synthesize these 40 conflicting papers into one judgment: is this claim credible, and is it worth our bet?"→ 一份研究发现档案 = 发现 + 可信度账(主张分 Ⅰ–Ⅴ、主张与证据不混)+ 知识图谱贡献 + 盲点登记a Research Finding Dossier = the finding + a credibility ledger (claims graded Ⅰ–Ⅴ, claims kept unmixed from evidence) + a knowledge-graph contribution + a blind-spot register
What this is The research executable companion in a seven-piece system on one shared kernel: the architecture layer (ai-native-architect) designs the organization; the six companion pieces are one per surface, one kernel, mutually coupled, with no fixed reading entry — this is the executable form of the research methodology. Judgment node + stop-line: hand traversal, synthesis, and drafting fully to agents; but "which truth is worth knowing" and the final credibility verdict must be signed by a human — a grade can be drafted by a tool, the verdict "this is what I, in this value frame, am willing to vouch for" cannot be offloaded. A high "score" cannot proxy a species/surrogate jump; where a finding feeds a high-stakes irreversible decision (clinical, legal, safety), the credibility verdict is reserved for a named human more, not less.
SPEC.V / AI NATIVE METHODOLOGY / OWL METHODOLOGY SERIES
SCOPE /一套方法论 · 完整组织光谱 N=1 → N=众多(一人公司至 agent 网络,同一套第一性原理)One methodology · the full organizational spectrum N=1 → N=many (from the one-person company to the agent network, on a single set of first principles)
SERIES /六卷同一内核 · 本卷是其中一个面,完整接线见上方「方法论系列」。Six volumes, one kernel · this volume is one surface; the full wiring is above under "The Series."
APPENDIX · SOURCES /证据与引用登记 —— 分级口径:Ⅰ 审计级实证(监管文件交叉验证)· Ⅱ 同行评审 · Ⅲ 理论模型/工作论文(引用须写"模型预测",不得写"已证明")· Ⅳ 从业者一手陈述 · Ⅴ 咨询预测(是预测,不是事实)。引用条目以本表为准;本轮 3 票对抗复核未发现被驳倒条目。Evidence and citation registry; grading key: Ⅰ audit-grade empirics (cross-checked against regulatory filings) · Ⅱ peer-reviewed · Ⅲ theoretical model / working paper (citations must read "the model predicts," never "proven") · Ⅳ practitioner first-hand account · Ⅴ advisory forecast (a forecast, not a fact). Citation rows are authoritative in this table; the current 3-vote adversarial review found no overturned source.
REF
级GR
SOURCE
承重论断Load-bearing claim
R1
Ⅰ
Open Science Collaboration《Estimating the reproducibility of psychological science》Science 349(6251) · 2015 · DOI 10.1126/science.aac4716
97 项有显著结果的研究里仅 36% 复现成功——复现是把"研究环"和"高速生成器"分开的承重验证器(RES 00 / 13)。Of 97 studies with significant results only 36% replicated — replication is the load-bearing verifier separating the research loop from a fast generator (RES 00 / 13).
R2
Ⅱ
Baker《1,500 scientists lift the lid on reproducibility》Nature 533(7604) · 2016 · DOI 10.1038/533452a
逾 70% 的科学家复现他人实验失败、逾 50% 复现自己的也失败——复现之墙是实证存在的,不是修辞(RES 05)。Over 70% of scientists failed to reproduce others' experiments and over 50% failed to reproduce their own — the wall of reproducibility is empirical, not rhetorical (RES 05).
R3
Ⅳ
Karpathy《Software Is Changing (Again)》YC AI Startup School · 2025
前线随 agentic 执行的充裕而右移——可验证性梯度的左段被一路吃掉,右端不动(RES 02)。从业者一手陈述。The frontier moves right as agentic execution becomes abundant — the left of the verifiability gradient is eaten while the right end holds (RES 02). Practitioner first-hand account.
R4
Ⅳ–Ⅴ
Anthropic 研究 agent 自主性阶梯Anthropic ladder of research-agent autonomy · 2026 (公司自述,曲线为示意,非测量数据) (company self-account; the curve is illustrative, not measured)
最右端、最难自动化的一阶恰是研究议程选择(problem selection)——稀缺判断落在这一格(RES 06)。The rightmost, hardest-to-automate rung is research-agenda selection — the scarce judgment lands in this cell (RES 06).
R5
Ⅱ–Ⅲ
RLCF(社群偏好当 reward 的强化学习,arXiv 2603 系列,待回溯定稿编号)RLCF (RL from community feedback, arXiv 2603 series; final ID to be traced)
"科学品味的社群均值"可被外化、可被学走——可学的是均值(梯度左段),守地的是偏离均值的前沿(RES 06);能否学反共识前沿尚缺直接实验。"The community mean of scientific taste" can be externalized and learned — what is learnable is the mean (the gradient's left), what is held is the off-mean frontier (RES 06); whether anti-consensus frontier is learnable lacks a direct experiment.
R6
Ⅲ
同质化动力学 ODE 模型(arXiv 2604 系列,待回溯定稿编号)An ODE model of homogenization dynamics (arXiv 2604 series; final ID to be traced)
把"生成层向均值收敛"写成形式化动力学——比定性论证更尖锐,但仍是模型预测,非已证明(RES 08)。Formalizes "the generation layer converging to the mean" as dynamics — sharper than the qualitative argument, but still a model prediction, not proven (RES 08).
一手信号:瓶颈正从"生成研究"搬向"判断可信"——生成已不稀缺,担保可信才稀缺(RES 01 / 02)。First-hand signal: the bottleneck is moving from "generating research" to "judging credibility" — generation is no longer scarce, vouching credibility is (RES 01 / 02).
R8
Ⅱ
AI Feynman(符号回归)AI Feynman (symbolic regression) · Udrescu & Tegmark · Science Advances 6(16) · 2020 · DOI 10.1126/sciadv.aay2631
100 条费曼方程全数重发现(旧软件 71 条)——但都是已知方程:充裕化擅长"在已知框架内",不等于跨框架的新理解(RES 06 / 11)。Recovered all 100 Feynman equations (older software got 71) — but all are known equations: abundance excels "within a known frame," not at cross-frame new understanding (RES 06 / 11).
R9
Ⅱ
Hao, Xu, Li & Evans《AI tools expand scientists' impact but contract science's focus》Nature 649(8099) · 2026 · DOI 10.1038/s41586-025-09922-y
约 4129.8 万篇论文的文献计量:用 AI 的科学家个人影响力上升,但科学整体主题覆盖收缩 4.63%、学者间互动下降——"加速 ≠ 进步"的硬锚(RES 03 / 08)。Bibliometrics over ~41.298 million papers: AI-using scientists' individual impact rises, yet topic coverage contracts 4.63% and scholar-to-scholar interaction falls — the hard anchor for "acceleration ≠ progress" (RES 03 / 08).
R10
Ⅱ
Bornmann & Mutz《Growth rates of modern science》JASIST 66(11) · 2015 · DOI 10.1002/asi.23329
科学文献基数约 250 万篇/年、每 9 年翻倍——剪刀差产出侧的实证基线(RES 05 · FIG 7.0)。The scientific literature base is ~2.5 million papers/year, doubling every 9 years — the empirical baseline for the production side of the scissors gap (RES 05 · FIG 7.0).
R11
Ⅳ–Ⅴ
《The epistemic revolution of AI》(认识论综述/观点文)"The epistemic revolution of AI" (epistemology review / opinion piece)
论证 AI 同时扰动经验论/证伪/库恩范式,并指"知识生产速度超出单一人类认知"——为整合鸿沟提供综述侧证,但未给逐条实证(RES 05 / 06)。Argues AI simultaneously perturbs empiricism / falsification / Kuhnian paradigms and that "the rate of knowledge production outpaces single-human cognition" — review-grade side-evidence for the integration gap, with no item-by-item empirics (RES 05 / 06).
R12
Ⅱ
Doshi & Hauser《Generative AI enhances individual creativity but reduces the collective diversity of novel content》Science Advances 10(28) · 2024 · DOI 10.1126/sciadv.adn5290
给写作者 LLM 点子,个体故事更"有创意",但故事彼此更相似——作者明确称之为"社会困境"(个人更好、集体更窄):同质化最强的因果锚(RES 08)。Give a writer LLM ideas and individual stories get more "creative," yet stories grow more similar — the authors call it a "social dilemma" (individually better, collectively narrower): the strongest causal anchor for homogenization (RES 08).
R13
Ⅱ–Ⅲ
Anderson 等(36 人实验)Anderson et al. (36-person experiment) · 2024
同质化是群体层效应:不来自个体固着,而来自 LLM 向不同用户建议相似点子——定位机理在"群体"而非"个人"(RES 08)。Homogenization is a group-level effect: not individual fixation but the LLM suggesting similar ideas to different users — locating the mechanism at the group, not the individual (RES 08).
R14
Ⅲ
《We're Different, We're the Same》"We're Different, We're the Same" · 2025
控制结构变量后,LLM 之间的相似度远高于人与人之间——跨模型同质,换个模型也救不了(RES 08)。Controlling for structural variables, LLMs resemble one another far more than humans do — cross-model homogeneity that switching models does not cure (RES 08).
R15
Ⅱ
March《Exploration and Exploitation in Organizational Learning》Organization Science 2(1) · 1991 · DOI 10.1287/orsc.2.1.71
探索(搜索/变异/冒险)与利用(精炼/选择/效率)争同一份资源,利用倾向于赢——"省下的产能不会自动变成 slack"的底座(RES 07)。Exploration (search / variation / risk) and exploitation (refinement / selection / efficiency) compete for one resource, and exploitation tends to win — the base for "freed capacity does not automatically become slack" (RES 07).
R16
Ⅱ
DeepMind《GNoME — Scaling deep learning for materials discovery》Nature 624 · 2023 · DOI 10.1038/s41586-023-06735-9
发现约 220 万种新晶体材料,但绝大多数是已知结构类型内的元素替换——充裕化扩张已知,不等于换描述层级(RES 11)。Discovered ~2.2 million new crystalline materials, but the vast majority are element substitutions within known structure types — abundance expands the known, it does not switch the level of description (RES 11).
R17
Ⅳ
历史锚:Harry Beck 1933 伦敦地铁图(重示意化)· William Farr 霍乱地图(围绕"空气质量"组织数据)Historical anchors: Harry Beck's 1933 London Tube map (re-schematization) · William Farr's cholera map (data organized around "air quality")
Beck 抛掉地理精确、重画成电路图才是范式动作;Farr 的变量框架推不出"水传播微生物"——换框架要靠换变量,不是堆细节(RES 11)。Beck's paradigm act was discarding geographic accuracy for a circuit diagram; Farr's variable frame could not infer "waterborne microbes" — reframing comes from changing variables, not piling detail (RES 11).
细节拉满的一比一地图仍是同一种信息,没有变成新理解——整合不是更长的检索(RES 05)。观点文,其转引实证须各自回溯定级。A one-to-one map maxed on detail is still the same information, not new understanding — integration is not longer retrieval (RES 05). Opinion piece; its cited empirics each need tracing and grading.
R19
Ⅳ
历史锚:贝尔实验室 · 施乐 PARC · 剑桥 LMB("小团队 + 制度性保护冗余探索")Historical anchors: Bell Labs · Xerox PARC · the Cambridge LMB ("small teams + institutional protection of redundant exploration")
保护"看似无用"探索的具体治理动作有历史证据支撑——散木的命运是条件性的,不是注定的(RES 07)。The governance acts that protect "seemingly useless" exploration have historical support — the fate of useless-wood is conditional, not fated (RES 07).
R20
Ⅱ
Kuhn《The Structure of Scientific Revolutions》University of Chicago Press · 1962(专著) (monograph)
范式转移按定义落在 AI 训练分布之外——价值论轴上"换框架"的判断不可被统计学习归纳(RES 06 · FIG 6.1 / 6.2)。A paradigm shift lies by definition outside AI's training distribution — the axiological "reframe" judgment cannot be induced by statistical learning (RES 06 · FIG 6.1 / 6.2).
R21
Ⅱ
Hirsch《An index to quantify an individual's scientific research output》PNAS 102(46) · 2005 · DOI 10.1073/pnas.0507655102
h 指数把"高被引论文数"压成一个可数代理——它有"多发/多被引"两条都能刷的路径,是 RES 13① 中"产量当价值"机制的原始装置。The h-index compresses "count of highly cited papers" into one countable proxy with two gameable paths (publish more / get cited more) — the original device of RES 13①'s "counting output as worth."
R22
Ⅳ
Goodhart 定律(Strathern 1997 的常引转述:"当一个度量成为目标,它就不再是好度量")Goodhart's law (Strathern 1997's widely cited restatement: "when a measure becomes a target, it ceases to be a good measure")
为 RES 13 整节提供机制底座:指标一旦成为目标即被优化而非反映其本意——AI 把"成为目标后失效"的时标从数年压到数周。常引格言级,非逐条实证。The mechanism base for all of RES 13: once a metric is a target it is optimized rather than reflective — and AI compresses the "fails-once-targeted" timescale from years to weeks. An aphorism-grade citation, not item-level empirics.
R23
Ⅳ
期刊影响因子(Garfield 1955 起源;后为 Clarivate JCR 商业指标)· 作者本人多次警告勿用于评单篇/单人The journal impact factor (Garfield 1955 origin; later the Clarivate JCR commercial metric) · its originator repeatedly warned against judging single papers/people by it
用"期刊均值"代理"单篇质量"是范畴错误(引用分布长尾);它系统性偏好热点/阳性结果,是 RES 13② 把"追热点"焊进激励的装置(接 RES 08 同质化)。Using "a journal mean" to proxy "a single paper's quality" is a category error (long-tailed citation distribution); it systematically prefers hype/positive results — RES 13②'s device welding "hype-chasing" into incentives (links to RES 08 homogenization).
R24
Ⅱ
Björk & Solomon《The publishing delay in scholarly peer-reviewed journals》Journal of Informetrics 7(4) · 2013 · DOI 10.1016/j.joi.2013.09.001
投稿到接收的中位延迟以月计——同行评审是一道吞吐量由人类专家数量(非投稿量)决定的串行闸,故在投稿无限时被结构性淹没(RES 13③ · 接 RES 05 剪刀差)。Median submit-to-accept delay runs in months — peer review is a serial gate whose throughput is set by the number of human experts (not submission volume), so it is structurally drowned when submissions are unbounded (RES 13③ · links to RES 05's scissors gap).
完整调研档案(27 条主张 · 限定语全文 · 未竟项):references/2026-06-深度调研-证据与引用.mdFull research dossier (27 claims · full qualifiers · open items): references/2026-06-deep-research-evidence-and-citations.md