Study · Negative result · 2026-06-01 研究紀錄 · 負結果 · 2026-06-01

When does automatic abstraction beat an event-driven switch-level simulator?自動抽象化在什麼時候能贏過事件驅動的開關級模擬器?

A case study on the NES 2A03 / 2C02 — an honest, fully-measured account.

以 NES 2A03 / 2C02 為案例 —— 一份誠實、全程實測的紀錄。

AprVisual project notes. All numbers below are measured on the project's own C# engine, not estimated. The external-model viewpoints are from Gemini 3.1-pro, explicitly instructed to be brutally honest.

AprVisual 專案紀錄。以下所有數字皆為本專案 C# 引擎上的實測,非估計。外部模型觀點來自 Gemini 3.1-pro,並明確要求其「殘酷誠實」。

Abstract摘要

We ask whether a flat transistor netlist can be transformed, fully automatically, into a faster single-CPU-core simulator than a carefully tuned event-driven switch-level engine. Using that golden engine as an oracle (static analysis of charge-sharing / clocking from a netlist with no capacitances is undecidable), we build an automatic pipeline that extracts 98.9% of the chip's per-cycle activity into boolean logic + registers — leaving only ~1.1% irreducible analog — and that reproduces the golden engine bit-for-bit on the covered nodes over hundreds of millions of checks. Despite this strong reducibility, every evaluation strategy we implement is slower than the golden engine on a single core: oblivious evaluation 45× slower, compiled-oblivious 84× (instruction-cache thrashing), and a macro-event scheme yields only 1.1× structural compression. We trace the cause to a sparsity-vs-density crossover: the golden engine already operates at the netlist's natural minimum granularity (the ~1.4-node conducting group), and the bidirectional, two-phase, pass-transistor structure forbids auto-deriving the static dataflow graph that dense evaluation would need. We generalize the negative result into a chip-characteristic → strategy map: the strategies are not bad, they are mismatched to this netlist; on chips with different properties the same strategies are live paths. Contributions: (i) a fully-automatic silicon→logic extraction + verification pipeline, (ii) a falsifiable reducibility measurement, and (iii) the characteristic→strategy map.

我們問:能否全自動地把一份扁平的電晶體網表,轉換成比「精心調校的事件驅動開關級引擎」更快的單核 CPU 模擬器?由於從「沒有電容值的網表」靜態分析 charge-sharing / 時脈是不可判定的,我們把該 golden 引擎當成oracle,建立一條自動管線:把晶片每週期活動的 98.9% 抽取成「布林邏輯 + 暫存器」—— 只剩 ~1.1% 不可約的類比 —— 並在數億次比對中,於涵蓋節點上逐位元重現 golden 引擎。儘管可約性如此之高,我們實作的每一種評估策略在單核上都比 golden 更慢:oblivious 評估慢 45×、編譯版 oblivious 慢 84×(指令快取 thrash)、macro-event 方案只得到 1.1× 的結構壓縮。根因是稀疏 vs 稠密的交叉點:golden 引擎已運作在網表的天然最小粒度(平均 ~1.4 節點的導通群),而「雙向、兩相、pass-transistor」的結構,使我們無法自動導出稠密評估所需的靜態資料流圖。我們把這個負結果一般化成一張晶片特性 → 策略對照表:這些策略並不爛,只是跟這份網表不匹配;在特性不同的晶片上,同樣的策略會是活路。貢獻:(i) 一條全自動的 silicon→logic 抽取 + 驗證管線、(ii) 一個可證偽的可約性量測、(iii) 特性→策略對照表。

1.Background & problem背景與問題

The subject is the NES 2A03 (a 6502-family CPU + APU) and 2C02 (PPU), simulated at the switch (transistor) level from a Visual6502/MetalNES-style netlist: ~14,700 live nodes, ~30,000 NMOS transistors, pass-transistor logic. The only inputs are connectivity (gate, c1, c2) and which nodes are VCC / GND / pull-up. There is no RTL, no source, and no annotated capacitances.

研究對象是 NES 2A03(6502 家族 CPU + APU)與 2C02(PPU),以 Visual6502/MetalNES 風格的網表做開關(電晶體)級模擬:~14,700 個 live 節點、~30,000 顆 NMOS 電晶體、pass-transistor 邏輯。唯一輸入是連線關係(gate, c1, c2)與哪些節點是 VCC / GND / pull-up。沒有 RTL、沒有原始碼、沒有電容標註。

The golden engine (the baseline to beat)Golden 引擎(要打敗的基準)

An event-driven switch-level engine (Bryant's model), ported to C# and heavily optimized (struct-of-arrays, unmanaged pointers, an O(1) singleton fast-path, a 256-entry flags→state LUT). On the test machine it runs at ~83,000 half-cycles/s = 12.0 µs/hc (≈48,000 CPU cycles/hc at 4 GHz). Per half-cycle it resolves connected transistor groups via flags-OR → LUT, with a floating-capacitance tie-break. Crucially, only ~90–600 nodes change per half-cycle and the average conducting group walked is ~1.4 nodes — extreme dynamic sparsity. This engine is bit-exact and is the offline reference.

一個事件驅動的開關級引擎(Bryant 模型),移植到 C# 並大幅最佳化(struct-of-arrays、非託管指標、O(1) singleton 快路徑、256 項 flags→state LUT)。在測試機上達 ~83,000 半週期/秒 = 12.0 µs/hc(4 GHz 下約 48,000 cycle/hc)。每半週期透過 flags-OR → LUT 解析連通電晶體群,並以 floating 電容 tie-break。關鍵:每半週期只有 ~90–600 個節點改變,平均走訪的導通群僅 ~1.4 個節點 —— 極端的動態稀疏。此引擎逐位元正確,是離線參考基準。

The question. The textbook way to make a logic simulator faster is to lift it to a higher abstraction and compile/levelize it. We require the lift to be fully automatic (program-derived from the netlist + the oracle; no hand-written per-block behavioral code, no "recognize it's a 6502"), CPU-only, single-instance, and only behaviorally correct (the rendered frame and test ROMs must pass; per-node bit-exactness may be sacrificed). Can such a lift beat 12.0 µs/hc?

研究問題。讓邏輯模擬器更快的教科書做法,是把它抬升到更高的抽象、再編譯/levelize。我們要求這個抬升必須全自動(由網表 + oracle 程式化導出;不准手寫各區塊行為碼、不准「認出它是 6502」)、限 CPU、單實例、且只需行為正確(畫面與測試 ROM 要過;逐節點 bit-exact 可犧牲)。這樣的抬升能贏過 12.0 µs/hc 嗎?

2.Method: oracle-driven empirical extraction ("Escape-1")方法:oracle 驅動的實證抽取(「Escape-1」)

Pure static analysis of a netlist with no capacitances cannot decide which nodes are dynamic latches, where clock phases are, or which behavior is charge-share analog. The asymmetric advantage is that we have a golden simulator. We replace undecidable static analysis with empirical-dynamic analysis: observe the oracle, then keep only what provably matches it. The pipeline, each stage fully automatic:

對「沒有電容值的網表」做純靜態分析,無法判定哪些節點是動態 latch、時脈相位在哪、哪些行為是 charge-share 類比。我們的不對稱優勢是有一個 golden 模擬器。於是用實證動態分析取代不可判定的靜態分析:觀察 oracle,只保留可證明與它相符的部分。管線各階段皆全自動:

  1. Coverability probe. For each node, is its value a consistent boolean function of its radius-1 inputs (its transistors' gate + far-end states) over a long oracle trace? A single contradiction flags it non-boolean (hidden state / analog).
  2. 可覆蓋性探針。對每個節點,在長 oracle trace 上,它的值是否是其 radius-1 輸入(電晶體的 gate + far-end 狀態)的一致布林函數?出現一次矛盾就標記為非布林(隱藏狀態 / 類比)。
  3. Extractor. Learn each clean node's truth table online — a dense table for ≤16 inputs, a sparse hash map for 17–60.
  4. 抽取器。線上學習每個乾淨節點的真值表 —— ≤16 輸入用 dense 表,17–60 用 sparse hash map。
  5. Structural bus model. Wide (>60-input) data/address buses are resolved by replicating the oracle's group resolution (a transitive walk over conducting channels → flags LUT + largest-capacitance float tie-break = dynamic-latch hold) but reading the model's own state. No truth table.
  6. 結構式 bus 模型。寬(>60 輸入)的 data/address bus,以「複製 oracle 的群解析」(沿導通通道遞移走訪 → flags LUT + 最大電容浮接 tie-break = 動態 latch 的 hold)來解,但讀模型自己的狀態。不需真值表。
  7. Evaluation by relaxation. The pass network is one giant bidirectional SCC (every pass transistor is a 2-cycle), so the extracted logic cannot be levelized; we iterate a sweep to a fixed point (~6.5 iterations).
  8. 以 relaxation 評估。pass 網路是一個巨大的雙向 SCC(每顆 pass transistor 都是一個 2-cycle),抽出的邏輯無法 levelize;我們把 sweep 迭代到 fixed point(~6.5 次)。
  9. Dynamic Miter + verify-then-enable. The model runs beside the oracle every half-cycle; any node that ever diverges is automatically demoted (a state element or true analog). The set converges, by construction, to one that reproduces the oracle exactly on the covered nodes.
  10. Dynamic Miter + verify-then-enable。模型每半週期與 oracle 並排跑;任何曾發散的節點自動降級(視為狀態元件或真類比)。此集合「建構即正確」地收斂到「在涵蓋節點上完全重現 oracle」。

Methodological note. "verify-then-enable" makes correctness a property of the construction, not an afterthought: a node is in the fast model only if it has been observed to never disagree with the oracle. This is what lets us claim behavioral exactness on the covered set despite using empirical (not proven) extraction.

方法論註記。「verify-then-enable」讓正確性成為建構的內建性質,而非事後檢查:一個節點只有在「被觀察到從不與 oracle 不符」時,才會進入快速模型。這正是我們能在「使用實證(而非證明)抽取」的情況下,仍宣稱涵蓋集合行為精確的原因。

3.Result A — the chip is ~99% reducible to logic + registers結果 A —— 整顆晶片約 99% 可約成邏輯 + 暫存器

Coverability is high and broad. Across three workloads (a static palette image, an all-opcodes CPU test, and the game Super Mario Bros driving the whole NES), 95.8–97.5% of observed nodes are clean radius-1 boolean; with the game exercising 78% of the chip, the PPU is 99.3% clean and the CPU 91.7% (the ~8% non-clean CPU nodes are exactly the famous 6502 analog points: ALU carry chain, decode PLA, dynamic nodes).

可覆蓋性又高又廣。在三種 workload(靜態調色盤畫面、全指令 CPU 測試、以及驅動整機的遊戲《超級瑪利歐》)下,被觀測節點中 95.8–97.5% 是乾淨的 radius-1 布林;在遊戲活化 78% 晶片時,PPU 達 99.3% 乾淨、CPU 91.7%(那 ~8% 不乾淨的 CPU 節點,正是著名的 6502 類比點:ALU carry chain、decode PLA、動態節點)。

Adding extraction techniques one at a time, each shrinks the residual switch-level activity and raises the idealized speed ceiling, while verify-then-enable holds behavioral match at 100.000% throughout (measured as the fraction of golden state-changes landing on covered nodes, Super Mario Bros, validation windows):

逐一加入抽取技術,每一種都縮小殘留的開關級活動、抬高理想加速上限,而 verify-then-enable 全程把行為相符維持在 100.000%(以「golden 狀態變化落在涵蓋節點上的比例」量測,《超級瑪利歐》,驗證窗):

Extraction抽取技術activity covered活動覆蓋residual switch-level殘留開關級Amdahl ceiling*Amdahl 上限*match相符
Dense truth tables (K≤16)Dense 真值表(K≤16)56.8%32.7%3.1×100%
+ Sparse maps (K≤60)+ Sparse map(K≤60)66.0%23.2%4.3×100%
+ Structural bus model+ 結構式 bus 模型84.3%5.0%20.1×100%
+ Stateful nodes via bus-resolve+ stateful 節點走 bus-resolve76.5% + 22.4% reg1.1%88.1×100%

*Amdahl ceiling assumes the covered (logic + register) work is free — an upper bound on reducibility, not an achieved speed (see §4–5). The final residual 1.1% is 100% no-channel / supply / analog — the irreducible core.

*Amdahl 上限假設「涵蓋的(邏輯 + 暫存器)工作免費」—— 那是可約性的上界,不是實際達成的速度(見 §4–5)。最終殘留的 1.1%100% 的 no-channel / supply / 類比 —— 不可約的核心。

Finding A. The NES 2A03/2C02 is, to within ~1.1% of its dynamic activity, a digital machine: boolean logic plus registers. This is a clean, automatic, falsifiable reducibility result. It is the strongest answer the project produced to "can this silicon be abstracted?".

發現 A。NES 2A03/2C02 在動態活動的 ~1.1% 誤差內,就是一台數位機器:布林邏輯加上暫存器。這是一個乾淨、自動、可證偽的可約性結果,也是本專案對「這塊矽能不能被抽象化」給出的最強答案。

4.Result B — reducibility does not become speed: every strategy is slower結果 B —— 可約性不會變成速度:每種策略都更慢

The reducibility ceiling assumes the extracted logic is cheap to evaluate. It is not. We implemented and measured each evaluation strategy against the 12.0 µs/hc golden engine:

可約性上限假設「抽出的邏輯評估起來很便宜」。並不便宜。我們把每種評估策略實作出來,對 12.0 µs/hc 的 golden 引擎量測:

Strategy策略µs / half-cycleµs / 半週期vs goldenwhy原因
golden (event-driven switch-level)12.0the baseline基準
Oblivious, interpretedOblivious,直譯53945× slower~39,000 node-evals/hc (6,000 nodes × 6.5 iters) vs golden's ~90–600每 hc ~39,000 次節點評估(6,000 節點 × 6.5 迭代)vs golden ~90–600
Oblivious, compiled (Roslyn)Oblivious,編譯(Roslyn)1,00484× slower~700 KB straight-line code × 6.5 sweeps ≫ 32 KB L1i ⇒ i-cache thrash~700 KB 直線碼 × 6.5 次掃 ≫ 32 KB L1i ⇒ i-cache thrash
AOT batch backends (prior S4)AOT batch 後端(先前 S4)3–6× slowerre-evaluates ~14.7k nodes/hc when hundreds change幾百個變動時卻重算 ~14.7k 節點/hc
Per-node event-driven booleanPer-node 事件驅動布林break-even / wrongbreaks conducting-group atomicity; transient order is load-bearing破壞導通群原子性;暫態順序是承重的
Macro-event (cone granularity)Macro-event(cone 粒度)1.1× compressioncones mean 2.0 nodes ≈ golden's 1.4-node group; activity ~1:1 with cones (§5)cone 平均 2.0 節點 ≈ golden 的 1.4 節點群;活動與 cone ~1:1(§5)
GPU, single instance (prior S4)GPU,單實例(先前 S4)10.7× slowerone workgroup ≈ 1/76 of the GPU; the rest idle一個 workgroup ≈ GPU 的 1/76,其餘閒置

The instruction-cache inversion. The most diagnostic data point: the compiled oblivious sweep is 2× slower than the interpreter. The interpreter is a tiny code loop over data (L1i fine; the cost is scattered data gathers). The compiled version unrolls ~6,000 nodes into ~700 KB of machine code streamed 6.5× per half-cycle — far exceeding the 32–48 KB L1 instruction cache, so the CPU front-end stalls re-fetching code while the ALUs idle. "Compile the oblivious sweep" is self-defeating precisely because oblivious means touching the whole circuit every cycle, so the code footprint scales with the whole circuit.

指令快取的反轉。最具診斷性的一點:編譯版的 oblivious sweep 比直譯器慢 2×。直譯器是「小程式碼迴圈跑資料」(L1i 沒事;成本在散落的資料 gather)。編譯版把 ~6,000 節點展開成 ~700 KB 機器碼、每半週期掃 6.5 次 —— 遠超 32–48 KB L1 指令快取,CPU 前端不斷重抓指令、ALU 空轉。「把 oblivious sweep 編譯掉」會自我擊敗,正是因為 oblivious 意味著每週期碰整顆電路,程式碼足跡因此隨整顆電路增長。

5.Analysis — the sparsity–density crossover, and why no static DAG exists分析 —— 稀疏–稠密交叉點,以及為何沒有靜態 DAG

Every circuit simulator sits on one side of a crossover. Per half-cycle the cost is roughly:

每個電路模擬器都坐落在一個交叉點的某一側。每半週期的成本大致是:

Cost = MIN( Events × PerEvent , Nodes × Iterations × PerNode )

Sparse (golden)稀疏(golden)Dense (oblivious)稠密(oblivious)
work units工作單位~90–600 events事件~6,000 nodes節點
iterations迭代1~6.5
cost / unit每單位成本~80 cyccyc~2.5 cyc (ideal)cyc(理想)
CPU cycles / hc~24,000–48,000~97,500

Even an idealized dense evaluation (2.5 cyc/node, perfect L1d) costs ~97,500 cycles — already ~2× golden — because of the 6.5 iterations. To win, dense must drop to 1 iteration, which requires a topologically-ordered, acyclic dataflow graph. And here is the wall:

即使理想化的稠密評估(2.5 cyc/節點、完美 L1d)也要 ~97,500 cycle —— 已是 golden 的 ~2× —— 就因為那 6.5 次迭代。要贏,稠密必須降到 1 次迭代,而那需要一個拓樸有序、無環的資料流圖。牆就在這:

You cannot auto-derive a static DAG from a flat, two-phase, bidirectional pass-transistor netlist. NMOS pass gates are bidirectional (a gate's "input" and "output" sides are decided dynamically by which side drives), and two-phase transparent latches make the transparent paths cyclic. Without RTL or clock-phase metadata — which we are forbidden to assume — static extraction must treat these as SCCs and cannot sever the cycles. The model is structurally condemned to ~6.5 iterations, so dense execution always loses.

你無法從「扁平、兩相、雙向 pass-transistor」的網表自動導出靜態 DAG。NMOS pass gate 是雙向的(誰是「輸入」誰是「輸出」由哪一側在驅動動態決定),兩相透明 latch 又讓透明路徑成環。在沒有 RTL 或時脈相位 metadata(我們又被禁止假設)的情況下,靜態抽取只能把它們當成 SCC、切不開環。模型在結構上被判定卡在 ~6.5 次迭代,所以稠密執行永遠輸。

Why "fewer, coarser events" also fails (the macro-event measurement)為何「更少、更粗的事件」也失敗(macro-event 量測)

If dense loses, the other route is to keep event-driven sparsity but make each event a combinational cone (latch-bounded), recovering ~90% fewer, atomic events. We measured the compression directly: condense the combinational nodes into cones (cut at bus/latch boundaries) and count distinct cones touched per half-cycle.

若稠密會輸,另一條路是「保留事件驅動的稀疏,但把每個事件變成一個(latch 邊界界定的)組合 cone」,以拿回少 ~90%、且原子的事件。我們直接量了壓縮比:把組合節點凝縮成 cone(在 bus/latch 邊界切開),數每半週期觸動的相異 cone 數。

cone definitioncone 定義result結果
channel (far-end) connectivity通道(far-end)連通2,737 cones, mean 2.0 nodes; 90 node-events → 81 macro-units = 1.1×個 cone,平均 2.0 節點;90 node-events → 81 macro-units = 1.1×
gate + channel connectivitygate + 通道連通collapses into one giant 5,349-node cone ⇒ firing it = full oblivious again塌成一個 5,349 節點的巨型 cone ⇒ 觸發它 = 又變回完整 oblivious

The deepest finding. A combinational cone averages 2.0 nodes — essentially identical to golden's mean conducting group of 1.4 nodes. The golden engine already operates at the netlist's natural minimum granularity. There is no coarser, reusable structural unit to extract; the activity is diffuse (~1:1 node-to-cone). The macro-event scheme would re-do golden's work plus abstraction overhead — break-even at best.

最深的發現。一個組合 cone 平均 2.0 節點 —— 幾乎等同於 golden 平均 1.4 節點的導通群。golden 引擎早已運作在網表的天然最小粒度上。沒有更粗、可重用的結構單位可抽;活動是擴散的(節點對 cone ~1:1)。macro-event 方案只會把 golden 已做的事重做一遍、再加抽象開銷 —— 最好的情況也只是打平。

The per-event floor (~80 cycles) and why temporal tricks failper-event 地板(~80 cycle),以及時間性技巧為何失敗

Golden spends ~80 cycles/event: dequeue, load node state, branch on change, chase 1–3 adjacency edges (dependent loads), index a LUT, write back, enqueue downstream. With diffuse hot nodes there is no small hot-set to keep in L1 and the branch predictor keeps guessing on tiny loops; the bottleneck is memory-dependency latency, not arithmetic. ~80 cycles is the pointer-chasing floor on a 14.7k-node irregular graph. Finally, temporal memoization (cache & replay recurring state transitions) is a mirage here: the APU has free-running timers and a 15-bit LFSR with mutually-coprime periods, so the chip's global micro-state virtually never repeats — and isolating the "CPU part" to hash would itself require auto-detecting a functional boundary we cannot derive.

golden 每事件花 ~80 cycle:dequeue、載入節點狀態、依變動分支、追 1–3 條鄰接邊(依賴載入)、查 LUT、寫回、enqueue 下游。熱節點擴散 ⇒ 沒有小熱集能留在 L1,分支器又一直在猜小迴圈;瓶頸是記憶體依賴延遲,不是算術。對一張 14.7k 節點的不規則圖,~80 cycle 就是指標追逐的地板。最後,時間記憶化(快取並重播重複的狀態轉移)在這裡是海市蜃樓:APU 有自由跑的計時器與一個週期互質的 15-bit LFSR,使晶片全域微狀態幾乎永不重複 —— 而要「只 hash CPU 部分」又需要自動偵測一個我們導不出的功能邊界。

6.The general lesson — a chip-characteristic → strategy map通則 —— 一張晶片特性 → 策略對照表

The crucial point: the strategies are not bad — they are mismatched to this netlist. Each one is the right tool for a circuit with the opposite property. The negative result is therefore reusable: it tells you, for a new chip, which strategy to reach for and which to avoid.

關鍵在於:這些策略並不爛 —— 它們只是跟這份網表不匹配。每一種都是「具相反性質的電路」的正確工具。因此這個負結果是可重用的:對一顆新晶片,它告訴你該拿哪種策略、該避開哪種。

Strategy策略Wins when the chip is…在以下晶片上會贏…Why it dies on the NES 2A03為何死於 NES 2A03
Event-driven switch-level事件驅動開關級sparse activity (few nodes change/cycle) — the winner here稀疏活動(每週期少數節點變)—— 此處的贏家— (it wins)—(它贏)
Dense / oblivious / static-DAG稠密 / oblivious / 靜態 DAGsynchronous CMOS from RTL, unidirectional logic, clean clock domains; or dense-active chips where most nodes toggle every cycle (systolic arrays, dataflow accelerators) — event-driven has no sparsity to exploit thereRTL 合成的同步 CMOS、單向邏輯、乾淨時脈域; 稠密活躍的晶片(大多數節點每週期都翻轉,如 systolic array、dataflow 加速器)—— 那裡事件驅動沒有稀疏可吃bidirectional pass-transistor + two-phase ⇒ no auto static DAG ⇒ stuck at 6.5 iters; and only ~1% active雙向 pass-transistor + 兩相 ⇒ 無法自動靜態定序 ⇒ 卡 6.5 迭代;且只 ~1% 活躍
Macro-event / coneMacro-event / conecoarse structure: deep combinational cones between sparse registers (big ALUs/datapaths, few state elements)粗粒度結構:稀疏暫存器之間有深組合 cone(大 ALU/datapath、少量狀態元件)cones ≈ 2 nodes ≈ conducting group; no coarse structure to exploit (1.1×)cone ≈ 2 節點 ≈ 導通群;沒有粗結構可吃(1.1×)
Temporal / trace memoization時間 / trace 記憶化bounded recurring state: controllers/state machines that loop, no free-running counters有界且重複的狀態:會迴圈的控制器/狀態機,沒有自由跑的計數器APU coprime timers + LFSR ⇒ global state never repeatsAPU 互質計時器 + LFSR ⇒ 全域狀態永不重複
GPU bit-slicingGPU bit-slicingmany independent instances (throughput, not latency) — oblivious is the right model there (no divergence)大量獨立實例(吞吐,非延遲)—— oblivious 在那裡是對的模型(不發散)single instance uses 1/76 of the GPU; this study is latency-only單實例只用 1/76 的 GPU;本研究只看延遲

The 2A03 is close to the worst case for single-core CPU acceleration of any abstraction: bidirectional pass-transistor (no DAG) + two-phase dynamic (transient-order-sensitive) + extreme sparsity (~1% active) + small (cache-resident, so the event-driven overhead is the whole cost, nothing to amortize) + free-running coprime counters (no temporal repetition). Each property independently disables one strategy; the common root is the first two.

對「任何抽象的單核 CPU 加速」而言,2A03 接近最壞情況:雙向 pass-transistor(無 DAG)+ 兩相動態(暫態順序敏感)+ 極稀疏(~1% 活躍)+ 小(整個進快取,所以事件驅動的開銷就是全部成本、沒東西可攤)+ 自由跑的互質計數器(無時間重複)。每個性質各自關掉一種策略;共同的根源是前兩者。

7.External-model consultations (Gemini 3.1-pro)外部模型諮詢(Gemini 3.1-pro)

We stress-tested the analysis against an external model, instructed each time to be brutally honest and to prefer a correct "no" over a hopeful "maybe." Its position evolved as we fed back measurements — a useful record of how an optimistic prior collapses under data.

我們把分析丟給一個外部模型壓力測試,每次都明確要求它「殘酷誠實」、寧可給正確的「不行」也不要給樂觀的「也許」。隨著我們回饋實測,它的立場逐步演變 —— 這是一份「樂觀先驗如何在數據下崩解」的有用紀錄。

Consult諮詢Its proposal / verdict其提案 / 判決Our empirical outcome我們的實證結果
1 — the bottleneck1 — 瓶頸Macro-event via SCC condensation; cone = atomic unit; est. 5–15×用 SCC 凝縮做 macro-event;cone = 原子單位;估 5–15×measured cone compression = 1.1× (§5) ⇒ falsified實測 cone 壓縮 = 1.1×(§5)⇒ 被證偽
2 — the i-cache angle2 — i-cache 角度confirmed compiled-slower = i-cache thrash; codegen/LLVM is self-defeating; use a data-driven interpreter; revised to 4–6×確認編譯較慢 = i-cache thrash;codegen/LLVM 會自我擊敗;改用 data-driven 直譯器;下修為 4–6×consistent with the 2× compiled-vs-interp inversion we measured與我們量到的「編譯比直譯慢 2×」反轉一致
3 — last resort3 — 最後手段"a genuine, structural dead end." Sparsity-vs-density crossover; no auto static DAG; per-event 80-cyc floor; temporal memoization a mirage (APU)「這是真正的、結構性的死路。」稀疏 vs 稠密交叉點;無法自動靜態定序;per-event 80-cyc 地板;時間記憶化是海市蜃樓(APU)matches every independent measurement in this study與本研究每一項獨立量測相符

"You are not failing to see a hidden trick; you have fully mapped the boundaries of the problem. … This is likely the fastest single-instance switch-level simulator possible for the 2A03 on a von Neumann architecture." — Gemini 3.1-pro, third consult

「你不是沒看到隱藏的技巧,你已經完整測繪了這個問題的邊界。……這大概是 von Neumann 架構上、單實例 2A03 switch-level 模擬器的最快可能形態。」 — Gemini 3.1-pro,第三次諮詢

We treat this as corroboration, not authority: the model twice proposed an approach (macro-event) that our own measurement then falsified, which is exactly why the measurement, not the model, is the arbiter. The value of the consults was sharper framing (the crossover equation; the APU-LFSR argument), not the conclusion.

我們把這當成佐證,而非權威:該模型兩度提出一個方法(macro-event),隨後被我們自己的量測證偽 —— 這正是為何裁決者是量測、不是模型。諮詢的價值在於更銳利的框架(交叉點方程式、APU-LFSR 論證),而非結論本身。

8.Conclusions & contributions結論與貢獻

Main conclusion. For a flat, bidirectional, two-phase pass-transistor netlist of this scale, no fully-automatic abstraction beats a well-tuned event-driven switch-level engine on a single CPU core — because that engine already runs at the netlist's natural minimum granularity (the ~1.4-node conducting group), the per-event cost is at the memory-latency floor, and the static dataflow graph that dense evaluation needs cannot be auto-derived without clock-phase/RTL hints. Real-time (~516× faster) is unreachable by this route.

主結論。對這種規模、扁平、雙向、兩相的 pass-transistor 網表,沒有任何全自動抽象能在單核 CPU 上贏過調校良好的事件驅動開關級引擎 —— 因為該引擎已運作在網表的天然最小粒度(~1.4 節點的導通群)、per-event 成本在記憶體延遲地板、而稠密評估所需的靜態資料流圖在沒有時脈/RTL 提示下無法自動導出。即時(快 ~516×)在這條路上不可達。

What the project nonetheless contributes:

儘管如此,本專案的貢獻:

A note on scope. This study is about single-instance latency on a CPU — that was the goal. A separate, orthogonal question (never a goal here, and not pursued) is whether oblivious evaluation pays off for many-instance bit-sliced throughput on a GPU — a different metric (aggregate throughput, not latency) where oblivious is actually the natural model. We mention it only to delimit what this negative result does and does not cover: it says nothing about that throughput regime.

關於範圍的說明。本研究談的是 CPU 上的單實例延遲 —— 那才是目標。另一個正交、不同的問題(在此從來不是目標、也未進行)是:oblivious 評估在 GPU 的多實例 bit-sliced 吞吐上是否划算 —— 那是不同的指標(總吞吐,非延遲),在那裡 oblivious 反而是自然的模型。提它只是為了界定這個負結果涵蓋與不涵蓋的範圍:它對那個吞吐情境不作任何宣稱。

9.Reproducibility & threats to validity重現性與效度威脅

Reproducibility. The engine is C# (.NET 10, x64). The reducibility/speed pipeline runs as --miter / --compile / --cones on the headless build; the golden engine is the same code with the model disabled. Workloads: full_palette.nes (static), cpu.nes (all opcodes), Super Mario Bros (whole-NES, mapper 0). The golden engine is validated bit-exact against an independent reference (whole-NES checksum 0x794A43ABDF169ADA). All figures are single-machine, single-thread, Release build.

重現性。引擎為 C#(.NET 10、x64)。可約性/速度管線以 --miter / --compile / --cones 在 headless build 上執行;golden 引擎是同一份程式碼關掉模型。Workload:full_palette.nes(靜態)、cpu.nes(全指令)、超級瑪利歐(整機,mapper 0)。golden 引擎對獨立參考驗過 bit-exact(整機 checksum 0x794A43ABDF169ADA)。所有數字皆為單機、單執行緒、Release build。

Threats to validity. (i) Workloads are limited to three ROMs; activity patterns of other games may differ, though the radius-1 cleanliness was stable across all three (95.8–97.5%). (ii) Reducibility is empirical (oracle-validated over finite windows), not proven; ~0.5% of validate-window evaluations hit unlearned input combinations (held the prior value, which happened to be correct in our runs). (iii) The Amdahl ceilings are upper bounds assuming free covered work; §4 shows the realized speed is the opposite. (iv) The cone analysis depends on the cone definition; we report two definitions, both pointing the same way. (v) Per-event cycle figures are derived from measured throughput and a 4 GHz assumption; the relative conclusions do not depend on the exact clock.

效度威脅。(i) Workload 僅限三顆 ROM;其他遊戲的活動模式可能不同,不過 radius-1 乾淨度在三者間穩定(95.8–97.5%)。(ii) 可約性是實證的(在有限窗上經 oracle 驗證),非證明;驗證窗中約 0.5% 的評估遇到未學過的輸入組合(沿用前值,在我們的執行中恰好正確)。(iii) Amdahl 上限是「假設涵蓋工作免費」的上界;§4 顯示實際速度恰相反。(iv) cone 分析取決於 cone 定義;我們報告兩種定義,皆指向同一結論。(v) per-event cycle 數由實測吞吐與 4 GHz 假設推導;相對結論不依賴確切時脈。

Companion pages: why IR / codegen hit the wall (the bit-exact ceiling) · why the "obvious" speedups don't work · the research landscape.

延伸頁面:為什麼 IR / codegen 撞牆(bit-exact 天花板) · 為什麼「直覺上該更快」的點子沒用 · 研究現況