AprVisual — why IR / codegen still hit the wall (with data)

Attempt 1 — the IR interpreter: correct, but −2.5%嘗試一 —— IR 直譯器:正確,但 −2.5%

We built a CPU-first, event-driven, hybrid IR: extract the cleanest pure-logic nodes into per-node truth-table LUTs, dispatch them through the IR, leave everything else on the switch-level path. It was bit-exact (same whole-NES checksum 0x794A43ABDF169ADA) and covered 3,390 nodes (23% of live). Then we measured it, interleaved-paired:

我們做了一個 CPU-first、事件驅動的混合 IR:把最乾淨的純邏輯節點抽成「每節點一張真值表 LUT」、走 IR 解析,其餘留在 switch-level 路徑。它位元完全相同(整機 checksum 0x794A43ABDF169ADA),涵蓋 3,390 個節點(23% live)。然後 interleaved-paired 量它:

	hc/s (median)(中位)	paired result配對結果
S1 (no IR)S1(無 IR)	81,738	—
S2 with IRS2 開 IR	80,026	−2.54%, 0 of 9 rounds won9 輪 0 勝

Why slower? The nodes we IR-ized are exactly the ones the switch-level engine already resolves in O(1) (its singleton fast-path — a few instructions + one LUT read). The IR's truth-table eval (read the gate states, pack an index, chase a gate list, read a table) is heavier than that. You wrapped the already-minimal in a layer. And the nodes the engine does slowly can't be IR-ized (next section). Net: a clean, correct IR — that's pure overhead here.

為何更慢?被 IR 化的節點,正是 switch-level 引擎本來就用 O(1) 秒殺的(它的 singleton fast-path —— 幾條指令 + 一次 LUT)。IR 的真值表評估(讀 gate 狀態、打包 index、追 gate list、查表)還更重。你把「已經最精簡的」再包一層。而引擎做得慢的那些,又抽不成 IR(見下節)。淨值:一個乾淨、正確的 IR —— 在這裡純粹是開銷。

Attempt 2 — codegen: killed by the profile, before a line was written嘗試二 —— codegen:還沒寫就被 profile 數據判死

The interpreter is capped, so the textbook escalation is macro-block codegen: collapse multi-node chains into single compiled methods, queue at the block boundary, evaluate the block obliviously inside (straight-line, L1-i-cache-resident). For that to win, the work must concentrate — a few hot blocks that, once compiled, cover most of the time. So before writing a codegen backend, we profiled the whole NES (CPU 2A03 + PPU 2C02) to see if that concentration exists.

直譯器封頂,於是教科書的升級是 macro-block codegen:把多節點鏈 collapse 成單一編譯方法、在 block 邊界排隊、block 內部 oblivious 評估(直線碼、駐 L1 i-cache)。但它要贏,工作必須集中 —— 少數熱 block 編譯後就能涵蓋大部分時間。所以在寫 codegen 後端之前,我們先 profile 整顆 NES(CPU 2A03 + PPU 2C02),看那個集中度存不存在。

Whole-NES profile整機 profile — full_palette, 300,000 hc, 181.4M recalcs

Metric指標	Value數值	Meaning意義
total recalcs總 recalc	181,429,313	~604.6 per half-cycle每半週期 ~604.6 次
fast-path (singleton, O(1))fast-path(singleton,O(1))	126,072,032 (69.5%)	S1 already optimal — untouchableS1 已最優 —— 碰不得
group-BFS (multi-node)group-BFS(多節點)	55,357,281 (30.5%)	the only possible codegen target唯一可能的 codegen 目標
avg BFS group sizeBFS 平均群大小	2.25 nodes節點	tiny — nothing to unroll極小 —— 沒東西可展開

The killer — concentration of the 30.5% BFS work (across 4,752 distinct BFS-touched nodes):

致命的 —— 那 30.5% BFS 工作的集中度(分散在 4,752 個被 BFS 碰到的節點):

hottest…最熱的…	share of all BFS work佔全部 BFS 工作
top 10 nodestop 10 節點	1.2%
top 50 nodestop 50 節點	6.0%
top 200 nodestop 200 節點	21.6%

By subsystem: PPU ~36%, CPU ~12%, the rest unnamed internal/board nodes. The per-node distribution is nearly flat (the hottest node is 0.14%; the next thirty are all ~0.12%).

依子系統:PPU ~36%、CPU ~12%、其餘是無名內部/板級節點。每節點分布幾乎是平的(最熱的 0.14%,後面三十個都 ~0.12%)。

That table is the verdict. There is no hot concentration to exploit. To cover even half the BFS work, codegen would have to compile thousands of nodes → the compiled-code working set blows the 32 KB L1 instruction cache (the exact "giant function" death, just spread out). And the groups average 2.25 nodes — unrolling a 2-node BFS into straight-line code saves almost nothing. Both prerequisites for a codegen win are absent. The macro-block path was dead before the first line.

這張表就是判決。沒有熱點集中可以打。要覆蓋哪怕一半的 BFS 工作,codegen 得編譯上千個節點 → 編譯碼工作集撐爆 32 KB L1 指令快取(就是「巨型函數」的死法,只是攤開)。而群平均才 2.25 節點 —— 把一個 2 節點的 BFS 展開成直線碼,幾乎省不到東西。codegen 要贏的兩個前提都不存在。macro-block 這條路,第一行還沒寫就死了。

The per-event floor: ~82 CPU cycles, no fat leftper-event 地板:每次 ~82 個 CPU cycle,沒肥肉了

Put the throughput in cycles. At 80,000 hc/s and ~604.6 recalcs/hc, the engine processes ~48.4 million recalcs/second. On a ~4 GHz core that is ~82 CPU cycles per recalc. In 82 cycles the code dequeues an event, reads the node, checks the fast-path, and 30.5% of the time walks a 2.25-node pointer chase, builds an 8-bit flag, indexes a LUT, writes back, and conditionally enqueues downstream with dedup bits.

把吞吐換算成 cycle。80,000 hc/s × ~604.6 recalc/hc = 每秒 ~4,840 萬次 recalc。在 ~4 GHz 核心上 = 每次 recalc ~82 個 CPU cycle。82 cycle 內要:dequeue、讀節點、查 fast-path、(30.5% 時)走 2.25 節點的指標追逐、建 8-bit flag、查 LUT、寫回、帶 dedup 位元條件 enqueue 下游。

An L1 hit is ~4 cycles; an L2 hit ~12–20; a branch misprediction ~15. Because the hot nodes are uniformly diffuse, there is no small hot-set to keep warm in L1, and the branch predictor keeps guessing on tiny 2-node loops. The bottleneck is memory-dependency latency and pipeline hazards, not arithmetic — and you cannot trim cycles that are spent waiting on a dependent load. There is no fat left to cut.

L1 命中 ~4 cycle;L2 ~12–20;分支誤判 ~15。因為熱節點均勻擴散,沒有小熱集合能留在 L1 暖著,分支器又一直在猜 2 節點的小迴圈。瓶頸是記憶體依賴延遲 + pipeline hazard,不是算術 —— 而你削不掉「在等一個依賴載入」的 cycle。沒有肥肉可削了。

Can't make events cheaper — can't make fewer of them either每個 event 不能更便宜 —— 數量也減不了

If per-event cost is at the floor, the other axis is fewer events — skip recalcs that "won't change anything". For most discrete-event simulators you do this with a levelized (ranked) event queue: topologically sort the DAG and process in rank order so transient glitches never propagate and re-trigger evaluations. Two things make that impossible here, both fundamental:

如果每個 event 已到地板,另一條軸是減少 event 數 —— 跳過「不會改變什麼」的 recalc。多數離散事件模擬器用levelized(分級)事件佇列做這件事:把 DAG 拓樸排序、按 rank 處理,讓瞬態 glitch 不會傳播再觸發重算。兩個原因讓它在這裡不可能,而且都是根本性的:

You can't rank the graph. ~94% of nodes are in one bidirectional pass-transistor SCC — everything clumps into the same rank. There is no topological order.
圖排不了序。~94% 的節點在同一個雙向 pass-transistor SCC —— 全擠進同一個 rank。沒有拓樸序。
The transients are load-bearing. In Bryant's switch-level model, the floating-capacitance tie-break depends on the exact sequence of charge-sharing events. The intermediate transient states aren't redundant — skipping a transient that momentarily pulls a floating node alters the tie-break math for the rest of that half-cycle. "Skipping the redundant" isn't redundant; it's wrong.
瞬態是承重的。在 Bryant 的 switch-level 模型裡,floating 電容 tie-break 取決於 charge-sharing 事件的精確順序。中間瞬態不是冗餘 —— 跳過某個「瞬間拉動 floating 節點」的瞬態,就改了該半週期後續的 tie-break 數學。「跳過冗餘」不是冗餘,是錯的。

And the activity rate is already ~4% (604 of 14,723 nodes per half-cycle) — the bare dynamic wavefront. Any static prune to shave it costs more analysis than the ~82 cycles it would save.

而且活動率已經 ~4%(每半週期 604 / 14,723 個節點)—— 就是最小動態波前。任何想再剪它的靜態分析,成本都超過它能省的那 ~82 cycle。

The verdict, and the only escapes判決,以及唯二的逃生口

~136K hc/s (at boost, cool machine) is the practical ceiling for a correct, dynamically-interpreted Bryant switch-level simulation of this netlist on one CPU core — up from the ~80K, ~100K, ~120K and ~127K we successively called the ceiling; memory-layout, dispatch, the same-state turn-on prune (+11.85%, bit-exact), the 2026-06 range-prune + self-captured key (a dead end re-aimed) and the B1 pair path lifted it, but the order of magnitude is fixed. Per-event cost is at the memory-latency floor; event count can be trimmed at the margins — the same-state turn-on prune cut ~12% of redundant re-evaluations on the statically-driven (pull-up) subset — but not by an order of magnitude, since the dynamic/storage majority can't be pruned bit-exactly; and codegen has no concentration to exploit. There is no remaining order-of-magnitude single-core lever — not without changing the abstraction.

Hardware-counter confirmation (2026-06-12, ETW PMC across a 9-stage same-day ablation of the engine's own history): L1 instruction-cache misses sit at 0.17–0.30 per 1000 instructions at every stage — the fully-inlined settle loop fits the 32 KB L1i, which is exactly where every compiled/AOT variant paid its 45–84× i-cache penalty. The class-major relayout cut L1 data-cache misses from 27.3 to 13.2 MPKI (−67% absolute) yet moved wall-clock only single digits, and the self-captured key added its +5–6% with no further miss reduction — counter-level proof that the bound is dependent-load latency, not miss count. Meanwhile the prune ladder deleted 33% of all retired instructions (155.6B → 104.4B per 1M half-cycles) at IPC ≈ 2.1–2.4. Raw data: MD/note/bench-data/ in the repo.

硬體計數器佐證(2026-06-12,以 ETW PMC 對引擎自身歷史做同日九階段消融):L1 指令快取失誤全程只有每千指令 0.17~0.30 次 —— 全內聯的 settle 迴圈塞得進 32 KB L1i,而這正是所有編譯/AOT 變體付出 45~84 倍 i-cache 代價的地方。類別為主鍵重排把 L1 資料快取失誤從 27.3 砍到 13.2 MPKI(絕對量 −67%),牆鐘卻只動個位數;自我捕捉鍵又加 +5~6% 而 miss 完全沒再降 —— 「瓶頸是相依載入延遲、不是 miss 數量」的計數器級證明。同時剪枝階梯刪掉了全部已執行指令的 33%(每百萬半週期 155.6B → 104.4B),IPC 維持 ≈2.1~2.4。原始數據:repo 的 MD/note/bench-data/。

對「單核 CPU、正確、動態直譯的 Bryant switch-level」模擬這顆網表,~136K hc/s(涼機 boost)是實務天花板 —— 我們先後認定過 ~80K、~100K、~120K、~127K,又一再被推高;記憶體佈局、dispatch、同態 turn-on 剪枝(+11.85%、bit-exact)、2026-06 的範圍剪枝 + 自我捕捉鍵(一條被重新瞄準的死路)與 B1 成對路徑把它抬上來,但數量級不變。每個 event 的成本在記憶體延遲地板;event 數量可以在邊際上削減 —— 同態 turn-on 剪枝在「被靜態驅動(有上拉)」的子集上砍掉約 12% 的多餘重算 —— 但無法砍到數量級,因為動態/儲存的大宗無法 bit-exact 地剪;codegen 又沒有集中熱點可吃。沒有剩下的數量級單核槓桿了 —— 除非換掉抽象。

There are exactly two ways past the wall, and both change the problem:

越過這道牆只有兩條路,兩條都改變了問題本身:

Escape逃生口	What it gives up放棄什麼	Status here在這裡
Logic extraction邏輯抽取 extract the boolean DAG for the ~90% that is really just gates & flip-flops; keep switch-level only for the ~10% genuinely-analog把真正只是閘與正反器的 ~90% 抽成 boolean DAG;只對 ~10% 真類比的保留 switch-level	per-node bit-exactness (internal switch-level nodes no longer reproduced) — but observable behavior (the rendered frame, CPU/PPU registers) can survive逐節點 bit-exact(內部 switch-level 節點不再重現)—— 但可觀測行為(畫面、CPU/PPU 暫存器)可以保住	a different, interesting project (auto silicon→logic→fast sim); the extracted logic is feed-forward, so it can be levelized — the actual way past the ceiling, at the cost of precision另一個有趣的專案(自動 silicon→logic→快速模擬);抽出的邏輯是前饋的、能 levelize —— 這才是越過天花板的真路,代價是精度
FPGA / spatialFPGA / 空間平行	nothing (stays 100% bit-exact)什麼都不放棄(維持 100% bit-exact)	off the table — the goal is one CPU core, not silicon離題 —— 目標是單核 CPU,不是矽

Does abstracting to logic gates necessarily break bit-exact? The per-node checksum: yes, necessarily — the internal switch-level mids, transients and tie-break order aren't reproduced. Observable behavior: not necessarily — for the ~90% that is genuine digital logic, gates (plus latches for the state elements) produce the same outputs; the screenshot can stay correct. The risk is the ~10% genuinely-analog (dynamic storage, charge-sharing, bus contention), which a naive boolean abstraction gets wrong unless detected and kept switch-level (or modeled as state). Detecting that automatically is the whole game of the next direction.

往邏輯閘抽象化,勢必破壞 bit-exact 嗎?逐節點 checksum:是,勢必 —— 內部 switch-level 的 mid、瞬態、tie-break 順序不會被重現。可觀測行為:不一定 —— 對 ~90% 真正是數位邏輯的部分,閘(加上 state element 的 latch)會算出同樣的輸出,截圖可以維持正確。風險在 ~10% 真類比的(動態儲存、charge-sharing、bus 競爭),純 boolean 抽象會算錯,除非自動偵測出來、保留 switch-level(或建模成有狀態)。把那件事自動做對,正是下一個方向的全部難點。

Where the numbers come from數字的出處

Every figure here is measured on this engine (full_palette, 300k half-cycles), not estimated. The IR interpreter was validated bit-exact (whole-NES checksum) before timing; the profile is from an instrumented run that counts every recalc and every group-BFS visit. The companion notes: why the "obvious" speedups don't work and the research landscape (why this regime has no 20-year breakthrough). The cycle-budget framing (~82 cycles/recalc) and the "transients are load-bearing" point were stress-tested against an external model asked, explicitly, to be brutally honest — its verdict was the same: no remaining lever.

這裡每個數字都是這個引擎上的實測(full_palette、30 萬半週期),不是估計。IR 直譯器在計時前已驗 bit-exact(整機 checksum);profile 來自一個會數每次 recalc、每次 group-BFS visit 的儀器化執行。延伸閱讀:為什麼「直覺上該更快」的點子都沒用、研究現況(為何這個 regime 近 20 年沒突破)。cycle 預算(每 recalc ~82)與「瞬態承重」的論點,也丟給一個被明確要求「殘酷誠實」的外部模型壓力測試 —— 它的結論一樣:沒有剩下的槓桿。