Attempt 1 — the IR interpreter: correct, but −2.5%嘗試一 —— IR 直譯器:正確,但 −2.5%
We built a CPU-first, event-driven, hybrid IR: extract the cleanest pure-logic nodes into per-node truth-table LUTs, dispatch them through the IR, leave everything else on the switch-level path. It was bit-exact (same whole-NES checksum 0x794A43ABDF169ADA) and covered 3,390 nodes (23% of live). Then we measured it, interleaved-paired:
我們做了一個 CPU-first、事件驅動的混合 IR:把最乾淨的純邏輯節點抽成「每節點一張真值表 LUT」、走 IR 解析,其餘留在 switch-level 路徑。它位元完全相同(整機 checksum 0x794A43ABDF169ADA),涵蓋 3,390 個節點(23% live)。然後 interleaved-paired 量它:
| hc/s (median)(中位) | paired result配對結果 | |
|---|---|---|
| S1 (no IR)S1(無 IR) | 81,738 | — |
| S2 with IRS2 開 IR | 80,026 | −2.54%, 0 of 9 rounds won9 輪 0 勝 |
Why slower? The nodes we IR-ized are exactly the ones the switch-level engine already resolves in O(1) (its singleton fast-path — a few instructions + one LUT read). The IR's truth-table eval (read the gate states, pack an index, chase a gate list, read a table) is heavier than that. You wrapped the already-minimal in a layer. And the nodes the engine does slowly can't be IR-ized (next section). Net: a clean, correct IR — that's pure overhead here.
為何更慢?被 IR 化的節點,正是 switch-level 引擎本來就用 O(1) 秒殺的(它的 singleton fast-path —— 幾條指令 + 一次 LUT)。IR 的真值表評估(讀 gate 狀態、打包 index、追 gate list、查表)還更重。你把「已經最精簡的」再包一層。而引擎做得慢的那些,又抽不成 IR(見下節)。淨值:一個乾淨、正確的 IR —— 在這裡純粹是開銷。
Attempt 2 — codegen: killed by the profile, before a line was written嘗試二 —— codegen:還沒寫就被 profile 數據判死
The interpreter is capped, so the textbook escalation is macro-block codegen: collapse multi-node chains into single compiled methods, queue at the block boundary, evaluate the block obliviously inside (straight-line, L1-i-cache-resident). For that to win, the work must concentrate — a few hot blocks that, once compiled, cover most of the time. So before writing a codegen backend, we profiled the whole NES (CPU 2A03 + PPU 2C02) to see if that concentration exists.
直譯器封頂,於是教科書的升級是 macro-block codegen:把多節點鏈 collapse 成單一編譯方法、在 block 邊界排隊、block 內部 oblivious 評估(直線碼、駐 L1 i-cache)。但它要贏,工作必須集中 —— 少數熱 block 編譯後就能涵蓋大部分時間。所以在寫 codegen 後端之前,我們先 profile 整顆 NES(CPU 2A03 + PPU 2C02),看那個集中度存不存在。
Whole-NES profile整機 profile — full_palette, 300,000 hc, 181.4M recalcs
| Metric指標 | Value數值 | Meaning意義 |
|---|---|---|
| total recalcs總 recalc | 181,429,313 | ~604.6 per half-cycle每半週期 ~604.6 次 |
| fast-path (singleton, O(1))fast-path(singleton,O(1)) | 126,072,032 (69.5%) | S1 already optimal — untouchableS1 已最優 —— 碰不得 |
| group-BFS (multi-node)group-BFS(多節點) | 55,357,281 (30.5%) | the only possible codegen target唯一可能的 codegen 目標 |
| avg BFS group sizeBFS 平均群大小 | 2.25 nodes節點 | tiny — nothing to unroll極小 —— 沒東西可展開 |
The killer — concentration of the 30.5% BFS work (across 4,752 distinct BFS-touched nodes):
致命的 —— 那 30.5% BFS 工作的集中度(分散在 4,752 個被 BFS 碰到的節點):
| hottest…最熱的… | share of all BFS work佔全部 BFS 工作 |
|---|---|
| top 10 nodestop 10 節點 | 1.2% |
| top 50 nodestop 50 節點 | 6.0% |
| top 200 nodestop 200 節點 | 21.6% |
By subsystem: PPU ~36%, CPU ~12%, the rest unnamed internal/board nodes. The per-node distribution is nearly flat (the hottest node is 0.14%; the next thirty are all ~0.12%).
依子系統:PPU ~36%、CPU ~12%、其餘是無名內部/板級節點。每節點分布幾乎是平的(最熱的 0.14%,後面三十個都 ~0.12%)。
That table is the verdict. There is no hot concentration to exploit. To cover even half the BFS work, codegen would have to compile thousands of nodes → the compiled-code working set blows the 32 KB L1 instruction cache (the exact "giant function" death, just spread out). And the groups average 2.25 nodes — unrolling a 2-node BFS into straight-line code saves almost nothing. Both prerequisites for a codegen win are absent. The macro-block path was dead before the first line.
這張表就是判決。沒有熱點集中可以打。要覆蓋哪怕一半的 BFS 工作,codegen 得編譯上千個節點 → 編譯碼工作集撐爆 32 KB L1 指令快取(就是「巨型函數」的死法,只是攤開)。而群平均才 2.25 節點 —— 把一個 2 節點的 BFS 展開成直線碼,幾乎省不到東西。codegen 要贏的兩個前提都不存在。macro-block 這條路,第一行還沒寫就死了。
The per-event floor: ~82 CPU cycles, no fat leftper-event 地板:每次 ~82 個 CPU cycle,沒肥肉了
Put the throughput in cycles. At 80,000 hc/s and ~604.6 recalcs/hc, the engine processes ~48.4 million recalcs/second. On a ~4 GHz core that is ~82 CPU cycles per recalc. In 82 cycles the code dequeues an event, reads the node, checks the fast-path, and 30.5% of the time walks a 2.25-node pointer chase, builds an 8-bit flag, indexes a LUT, writes back, and conditionally enqueues downstream with dedup bits.
把吞吐換算成 cycle。80,000 hc/s × ~604.6 recalc/hc = 每秒 ~4,840 萬次 recalc。在 ~4 GHz 核心上 = 每次 recalc ~82 個 CPU cycle。82 cycle 內要:dequeue、讀節點、查 fast-path、(30.5% 時)走 2.25 節點的指標追逐、建 8-bit flag、查 LUT、寫回、帶 dedup 位元條件 enqueue 下游。
An L1 hit is ~4 cycles; an L2 hit ~12–20; a branch misprediction ~15. Because the hot nodes are uniformly diffuse, there is no small hot-set to keep warm in L1, and the branch predictor keeps guessing on tiny 2-node loops. The bottleneck is memory-dependency latency and pipeline hazards, not arithmetic — and you cannot trim cycles that are spent waiting on a dependent load. There is no fat left to cut.
L1 命中 ~4 cycle;L2 ~12–20;分支誤判 ~15。因為熱節點均勻擴散,沒有小熱集合能留在 L1 暖著,分支器又一直在猜 2 節點的小迴圈。瓶頸是記憶體依賴延遲 + pipeline hazard,不是算術 —— 而你削不掉「在等一個依賴載入」的 cycle。沒有肥肉可削了。
Can't make events cheaper — can't make fewer of them either每個 event 不能更便宜 —— 數量也減不了
If per-event cost is at the floor, the other axis is fewer events — skip recalcs that "won't change anything". For most discrete-event simulators you do this with a levelized (ranked) event queue: topologically sort the DAG and process in rank order so transient glitches never propagate and re-trigger evaluations. Two things make that impossible here, both fundamental:
如果每個 event 已到地板,另一條軸是減少 event 數 —— 跳過「不會改變什麼」的 recalc。多數離散事件模擬器用levelized(分級)事件佇列做這件事:把 DAG 拓樸排序、按 rank 處理,讓瞬態 glitch 不會傳播再觸發重算。兩個原因讓它在這裡不可能,而且都是根本性的:
- You can't rank the graph. ~94% of nodes are in one bidirectional pass-transistor SCC — everything clumps into the same rank. There is no topological order.
- 圖排不了序。~94% 的節點在同一個雙向 pass-transistor SCC —— 全擠進同一個 rank。沒有拓樸序。
- The transients are load-bearing. In Bryant's switch-level model, the floating-capacitance tie-break depends on the exact sequence of charge-sharing events. The intermediate transient states aren't redundant — skipping a transient that momentarily pulls a floating node alters the tie-break math for the rest of that half-cycle. "Skipping the redundant" isn't redundant; it's wrong.
- 瞬態是承重的。在 Bryant 的 switch-level 模型裡,floating 電容 tie-break 取決於 charge-sharing 事件的精確順序。中間瞬態不是冗餘 —— 跳過某個「瞬間拉動 floating 節點」的瞬態,就改了該半週期後續的 tie-break 數學。「跳過冗餘」不是冗餘,是錯的。
And the activity rate is already ~4% (604 of 14,723 nodes per half-cycle) — the bare dynamic wavefront. Any static prune to shave it costs more analysis than the ~82 cycles it would save.
而且活動率已經 ~4%(每半週期 604 / 14,723 個節點)—— 就是最小動態波前。任何想再剪它的靜態分析,成本都超過它能省的那 ~82 cycle。
The verdict, and the only escapes判決,以及唯二的逃生口
~80K hc/s is the architectural ceiling for a correct, dynamically-interpreted Bryant switch-level simulation of this netlist on one CPU core. Per-event cost is at the memory-latency floor; event count can't be reduced bit-exactly; and codegen has no concentration to exploit. There is no remaining single-core lever — not without changing the abstraction.
對「單核 CPU、正確、動態直譯的 Bryant switch-level」模擬這顆網表,~80K hc/s 就是架構天花板。每個 event 的成本在記憶體延遲地板;event 數量無法 bit-exact 地減少;codegen 又沒有集中熱點可吃。沒有剩下的單核槓桿了 —— 除非換掉抽象。
There are exactly two ways past the wall, and both change the problem:
越過這道牆只有兩條路,兩條都改變了問題本身:
| Escape逃生口 | What it gives up放棄什麼 | Status here在這裡 |
|---|---|---|
| Logic extraction邏輯抽取 extract the boolean DAG for the ~90% that is really just gates & flip-flops; keep switch-level only for the ~10% genuinely-analog把真正只是閘與正反器的 ~90% 抽成 boolean DAG;只對 ~10% 真類比的保留 switch-level |
per-node bit-exactness (internal switch-level nodes no longer reproduced) — but observable behavior (the rendered frame, CPU/PPU registers) can survive逐節點 bit-exact(內部 switch-level 節點不再重現)—— 但可觀測行為(畫面、CPU/PPU 暫存器)可以保住 | a different, interesting project (auto silicon→logic→fast sim); the extracted logic is feed-forward, so it can be levelized — the actual way past the ceiling, at the cost of precision另一個有趣的專案(自動 silicon→logic→快速模擬);抽出的邏輯是前饋的、能 levelize —— 這才是越過天花板的真路,代價是精度 |
| FPGA / spatialFPGA / 空間平行 | nothing (stays 100% bit-exact)什麼都不放棄(維持 100% bit-exact) | off the table — the goal is one CPU core, not silicon離題 —— 目標是單核 CPU,不是矽 |
Does abstracting to logic gates necessarily break bit-exact? The per-node checksum: yes, necessarily — the internal switch-level mids, transients and tie-break order aren't reproduced. Observable behavior: not necessarily — for the ~90% that is genuine digital logic, gates (plus latches for the state elements) produce the same outputs; the screenshot can stay correct. The risk is the ~10% genuinely-analog (dynamic storage, charge-sharing, bus contention), which a naive boolean abstraction gets wrong unless detected and kept switch-level (or modeled as state). Detecting that automatically is the whole game of the next direction.
往邏輯閘抽象化,勢必破壞 bit-exact 嗎?逐節點 checksum:是,勢必 —— 內部 switch-level 的 mid、瞬態、tie-break 順序不會被重現。可觀測行為:不一定 —— 對 ~90% 真正是數位邏輯的部分,閘(加上 state element 的 latch)會算出同樣的輸出,截圖可以維持正確。風險在 ~10% 真類比的(動態儲存、charge-sharing、bus 競爭),純 boolean 抽象會算錯,除非自動偵測出來、保留 switch-level(或建模成有狀態)。把那件事自動做對,正是下一個方向的全部難點。
Where the numbers come from數字的出處
Every figure here is measured on this engine (full_palette, 300k half-cycles), not estimated. The IR interpreter was validated bit-exact (whole-NES checksum) before timing; the profile is from an instrumented run that counts every recalc and every group-BFS visit. The companion notes: why the "obvious" speedups don't work and the research landscape (why this regime has no 20-year breakthrough). The cycle-budget framing (~82 cycles/recalc) and the "transients are load-bearing" point were stress-tested against an external model asked, explicitly, to be brutally honest — its verdict was the same: no remaining lever.
這裡每個數字都是這個引擎上的實測(full_palette、30 萬半週期),不是估計。IR 直譯器在計時前已驗 bit-exact(整機 checksum);profile 來自一個會數每次 recalc、每次 group-BFS visit 的儀器化執行。延伸閱讀:為什麼「直覺上該更快」的點子都沒用、研究現況(為何這個 regime 近 20 年沒突破)。cycle 預算(每 recalc ~82)與「瞬態承重」的論點,也丟給一個被明確要求「殘酷誠實」的外部模型壓力測試 —— 它的結論一樣:沒有剩下的槓桿。