Doing less, exactly — the event-count prunes精準地少做事 —— 事件數剪枝

The hot loop is memory-latency bound: every node re-evaluation is a random gather that stalls on cache. You can't make a stall faster — so the only lever left is to do fewer of them, without changing a single output bit. A profiler showed that 84.4% of all node re-evaluations recompute to the same value — pure waste. This is the story of four prunes (P-1 → P-4) that delete the provably-redundant ones at the source, the trick that makes them safe (telling a logic node from a bus / memory / register cell by its physics), and the whole-NES numbers that prove it stayed bit-exact.

熱迴圈受記憶體延遲限制:每次節點重算都是一個隨機 gather、卡在 cache miss 上。你沒辦法讓一次 stall 變快 —— 所以唯一剩的槓桿,是少做幾次,而且一個輸出位元都不能變。一支 profiler 顯示:全部節點重算裡有 84.4% 重算出來跟原值一模一樣 —— 純粹浪費。這篇講四個剪枝(P-1 → P-4)如何從源頭刪掉「可被證明是多餘」的那些、讓它們安全的關鍵(用物理特性分辨邏輯節點與匯流排/記憶體/暫存器 cell),以及證明全程 bit-exact 的整機數據。

The finding: 84% of the work changes nothing發現:84% 的工是白做的

The engine settles each half-cycle by popping "dirty" nodes off a queue and re-resolving each one's conducting group. We instrumented that loop to count, per pop, whether any state actually changed. Over full_palette at 300k half-cycles:

引擎每半週期透過「從佇列彈出 dirty 節點、重新解析它的導通群組」來收斂。我們在那個迴圈裡量測每次 pop 是否真的改變了任何狀態。在 full_palette、30 萬半週期下:

159.6M
node re-evaluations節點重算次數
84.4%
recompute to the SAME value (wasted)重算成同值(浪費)
~88%
of that waste = dynamic-singleton nodes浪費中是動態 singleton

A node gets re-queued whenever a neighbouring transistor's gate flips — but its resolved value usually doesn't move (a node held low by three parallel paths doesn't change when one of them opens). That redundancy is inherent to level-triggered event propagation. The win is to recognise the redundant cases at the moment we would enqueue them, with a check cheap enough to beat the stall it avoids — and a static safety mask so we never prune a case that could actually change.

每當某個鄰接電晶體的閘極翻轉,節點就被重新入隊 —— 但它解析出的值通常沒動(一個被三條並聯路徑拉低的節點,其中一條斷開時不會變)。這種冗餘是 level-triggered 事件傳播的本質。要贏,就要在準備入隊的那一刻認出冗餘的情況,而且檢查要便宜到划得過它省下的 stall —— 再加一個靜態安全 mask,確保絕不剪掉「真的可能會變」的情況。

The four prunes四個剪枝

Each is a transformation in SetNodeState's enqueue walk, guarded by a per-node static mask built once at load. All are bit-exact — verified against the golden whole-NES checksum at 300k and 1M half-cycles — and validated visually (a pixel-perfect Super Mario Bros. title frame). Gains are interleaved-paired medians on an AMD Ryzen 7 3700X.

每一個都是 SetNodeState 入隊走訪裡的一個變換,由載入時建好的「每節點靜態 mask」把關。全部 bit-exact —— 對照 golden 整機 checksum,在 30 萬 100 萬半週期都驗過 —— 並做了視覺驗證(像素完全正確的 Super Mario Bros. 標題畫面)。增益為 AMD Ryzen 7 3700X 上的 interleaved-paired 中位數。

Mechanism機制C#Rust
P-1
same-state turn-on同態 turn-on
A transistor turns ON and merges two groups. If both ends already hold the same value, the merge can't change anything — skip the re-evaluation. (Earlier work.)電晶體導通、合併兩個群組。若兩端已是同值,合併不會改變任何東西 —— 跳過重算。(先前的成果。) +11.85%+11.36%
P-2
turn-off isolateturn-off 孤立
A transistor turns OFF. If the endpoint's only channel was that transistor (and it has no driver), it becomes an isolated floating node — it holds its previous value, so re-evaluating it is provably a no-op. Skip it.電晶體斷開。若該端點唯一的通道就是這顆電晶體(且無驅動源),它就變成孤立浮接節點 —— 維持原值,所以重算可證明是 no-op。跳過。 +1.4%+10.0%
(P-2+3+4
combined合計)
P-3
turn-on un-taintturn-on 放行
Re-admit a node to P-1's prune even though it has no pull-up — provided its capacitance is strictly less than its single neighbour's, so it can never win a floating tie-break. Then a same-state turn-on merge is still a no-op for it.即使節點沒有 pull-up,也讓它重新適用 P-1 剪枝 —— 條件是它的電容嚴格小於唯一鄰居,使它永遠不可能贏得浮接 tie-break。如此同態 turn-on 合併對它仍是 no-op。 +5.96%
P-4
multi-channel多通道推廣
The same un-taint generalised to multi-channel nodes whose capacitance is less than every neighbour's (P-1's endpoint same-state check already blocks the cross-state bridging case).把上面的放行推廣到「電容小於每一個鄰居」的多通道節點(P-1 的端點同態檢查已擋掉跨態 bridging 的情況)。 +1.71%

P-2/3/4 land entirely in the load-time classifier (they only set static mask bits) — zero added work in the hot loop. The two masks were later packed into one byte (bit 0 = turn-on-unsafe, bit 1 = turn-off-skip) so future prunes add a bit, not an array.

P-2/3/4 全部落在載入期的分類器(只設靜態 mask 位元)—— 熱迴圈裡新增工作。兩個 mask 後來合併成一個 byte(bit 0 = turn-on 不安全、bit 1 = turn-off 可跳),日後新剪枝只要加一個 bit、不必再加一個陣列。

The hard part: a logic node vs a bus / memory / register cell難的地方:邏輯節點 vs 匯流排/記憶體/暫存器 cell

A prune is only bit-exact if it never fires on a node whose value could actually change. The dangerous nodes are exactly the chip's stateful ones — and the trick is that we identify them statically, by their structure and physics, never by a hand-maintained list:

剪枝要 bit-exact,前提是它絕不在「值真的可能會變」的節點上觸發。危險的節點正是晶片裡有狀態的那些 —— 而關鍵在於我們是靜態地、用結構與物理特性辨識它們,從不靠手維護的名單:

Node kind節點種類How we detect it我們怎麼偵測Why it's dangerous為何危險
Data bus pins資料匯流排腳
u1._d*, ROM/RAM data
They appear in a memory handler's DataOut set — the pins the behavioural RAM/ROM drives via SetHigh/SetLow. We build that set at load and exclude every node in it.它們出現在某個記憶體 handler 的 DataOut 集合 —— 行為層 RAM/ROM 用 SetHigh/SetLow 驅動的腳。我們在載入時建好這個集合,把裡面每個節點排除。 When driven, an isolated pin resolves to the driven value, not a float-hold — pruning it would freeze a stale value.被驅動時,孤立的腳會解析成被驅動的值,而非浮接保持 —— 剪掉它會凍結成過時的值。
Precharged bus預充匯流排
PPU data bus
Carries the ForceCompute flag (the "Gnd + Pwr cancel" nodes that model NMOS precharge-vs-pulldown). Excluded by flag.ForceCompute 旗標(用「Gnd + Pwr 互相抵消」模擬 NMOS 預充 vs 下拉的節點)。用旗標排除。 Its resolution is non-monotone (cancelling drivers), so an equal-state merge can still flip it.它的解析非單調(會抵消驅動源),所以同態合併仍可能翻轉它。
Memory / register cells記憶體/暫存器 cell
OAM, palette RAM, dynamic CPU/ALU
No pull-up (they store charge), and a high capacitance — specifically, capacitance ≥ a neighbour's. The capacitance test automatically separates a heavy storage cell (which wins the floating tie-break and is kept) from a passive pass-transistor leaf (which can't, and is safe to prune).沒有 pull-up(它們靠電荷儲存),而且電容大 —— 具體說,電容 ≥ 某個鄰居。這個電容測試自動把「重儲存 cell(會贏浮接 tie-break,保留)」和「被動的 pass-transistor 葉節點(贏不了,可安全剪)」分開。 A storage cell is the chip's memory: its value, transferred by charge-sharing when its access transistor turns on, must propagate. Prune it and the latch loses its bit.儲存 cell 就是晶片的記憶體:存取電晶體導通時靠電荷分享傳遞的值,必須要傳出去。剪掉它,latch 就掉了那個 bit。

The key idea: we never enumerate "which nodes are registers." We let the capacitance sort them — a node that is strictly lighter than its neighbours can never be the one that decides a floating group's held value, so merging it (when both sides already agree) is provably free. That single inequality, cap(X) < cap(neighbours), is what made P-3/P-4 safe — and it generalises to the PPU without a single hand-listed node.

關鍵想法:我們從不列舉「哪些節點是暫存器」。我們讓電容去分類 —— 一個比所有鄰居都嚴格更輕的節點,永遠不可能是決定浮接群組保持值的那個,所以(在兩邊已一致時)合併它可證明是免費的。就是這條不等式 cap(X) < cap(鄰居) 讓 P-3/P-4 安全 —— 而且它能泛化到 PPU,完全不需要手列任何一個節點。

Why turn-off is easier than turn-on. Turn-OFF isolates a node (its group shrinks to just itself → it trivially holds) — unconditionally safe for a driverless single-channel leaf. Turn-ON merges it into a bigger group, where the floating capacitance tie-break can pick a different winner — which is exactly why turn-on needs the capacitance guard. An early, un-guarded turn-on attempt diverged on exactly two charge-share ALU/status-register nodes; tracing them is what surfaced the capacitance condition.

為什麼 turn-off 比 turn-on 容易。turn-OFF 把節點孤立(群組縮成自己 → 自然保持)—— 對無驅動的單通道葉節點無條件安全。turn-ON 把它併入更大的群組,浮接電容 tie-break 可能選到不同的勝者 —— 這正是 turn-on 需要電容護欄的原因。早期一個沒加護欄的 turn-on 嘗試,剛好在兩個電荷分享的 ALU/狀態暫存器節點上發散;追這兩個節點,才逼出了電容條件。

By the numbers數字總整理

What the prunes remove, per full_palette 300k half-cycles:

這些剪枝實際刪掉的量,以 full_palette 30 萬半週期計:

33.6M
fewer node re-evaluations / 300k hc每 30 萬 hc 少算的節點次數
−21%
of all RecalcNode pops, deleted所有 RecalcNode pop 被刪掉的比例
2,258
nodes carry a prune bit (1,272 + 986)帶剪枝位元的節點(1,272 + 986)
Metric (full_palette, 300k hc)指標(full_palette、30 萬 hc)before P-2P-2 之前after P-2/3/4P-2/3/4 之後
RecalcNode pops (node re-evaluations)RecalcNode pop(節點重算次數)159,646,522126,086,860
of which recompute to no change其中重算成同值84.4%80.3%
nodes masked turn-off-skip (P-2)turn-off-skip 標記節點 (P-2)1,272
nodes turn-on-un-tainted (P-3/4)turn-on 放行節點 (P-3/4)986

The remaining ~80% no-change pops are structurally irreducible: pull-up / supply-driven combinational nodes that recompute the same value after an input opens (one pull-down path closes but another still holds the node) — predicting that requires scanning the other pull-downs, i.e. exactly the work we'd be trying to avoid. An external model, asked to be brutally honest, agreed there is no cheap predictor. This is the natural floor of "doing less" on this model — why the last 80% won't prune →

剩下約 80% 的 no-change pop 是結構性不可削減的:pull-up / 供電驅動的組合邏輯節點,在某個輸入斷開後重算出同值(一條下拉關了、但另一條還拉著)—— 要預測得掃它其他的下拉路,正是想省的那份工。一個被要求殘酷誠實的外部模型也同意:沒有便宜的預測子。這就是這個模型上「少做事」的自然底線 —— 為什麼最後 80% 剪不掉 →

Result結果

P-2/3/4 lifted the engine across both back-ends, and — unlike many micro-optimisations that flip sign between RyuJIT and LLVM — this one landed positive on both, with bit-identical output:

P-2/3/4 把兩個後端都拉高,而且 —— 不像許多在 RyuJIT 與 LLVM 之間反號的微優化 —— 這次在兩邊都是正的,輸出位元完全相同:

109.4K
C# hc/s · was 101.6K (+7.7%)C# hc/s · 原 101.6K(+7.7%)
101.0K
Rust hc/s · was 91.9K (+9.9%)Rust hc/s · 原 91.9K(+9.9%)
~383×
C# gap to NES real-timeC# 離 NES 實機

Both engines still produce the same checksum 0x794A43ABDF169ADA (@300k) / 0x6D4CCBCE2E9CD599 (@1M). NES NTSC real-time needs 42,954,552 hc/s, so C# is still ~393× short — but the ceiling we once called final keeps moving, one provably-redundant event at a time. Got a faster CPU? Run the benchmark and share your numbers →

兩個引擎仍輸出相同 checksum 0x794A43ABDF169ADA(@30 萬)/ 0x6D4CCBCE2E9CD599(@100 萬)。NES NTSC 實機需要每秒 42,954,552 個 hc,所以 C# 還差約 393 倍 —— 但我們一度認定是終點的天花板,正一次一個「可證明多餘的事件」地往前移。有更快的 CPU 嗎?跑 benchmark 分享你的數據 →

Where the numbers come from數字的出處

Every figure is measured on this engine (full_palette, 300k half-cycles), not estimated. The no-change profiler is a #if DEBUG instrument kept in the tree — it counts every pop and categorises the wasted ones, with zero cost in Release. Each prune was validated bit-exact (whole-NES checksum at 300k and 1M) before timing; gains are interleaved-paired medians with a checksum guard on every round. Companion reading: why IR / codegen still hit the wall, what S1 added over MetalNES, and the study paper on when abstraction beats switch-level.

每個數字都是這個引擎上的實測(full_palette、30 萬半週期),不是估計。no-change profiler 是留在 repo 裡的 #if DEBUG 儀器 —— 它數每次 pop、把浪費的分類,Release 下零成本。每個剪枝在計時前都驗過 bit-exact(整機 checksum,30 萬與 100 萬);增益為 interleaved-paired 中位數,每一輪都有 checksum 守門。延伸閱讀:為什麼 IR / codegen 仍然撞牆S1 比 MetalNES 多了什麼、以及研究論文(抽象化何時贏過 switch-level)。

Lineage & sources淵源與來源

In one line: the methods here are established principles (Bryant's switch-level model + Ulrich's event-driven selective trace, recalled from the literature) plus contributions derived in this project (the specific prune conditions P-1 → P-4 and the storage-node detection heuristic) — and what makes those contributions trustworthy is not anyone's say-so but bit-exact measurement: the whole-NES checksum and interleaved-paired A/B that gated every one of them.

一句話:本文的方法 = 既有原理(Bryant 的 switch-level 模型 + Ulrich 的事件驅動 selective trace,從文獻喚起)+ 本專案推導的貢獻(P-1 → P-4 的具體剪枝條件,以及儲存節點偵測啟發法)—— 而讓這些貢獻可信的不是誰的一句話,而是 bit-exact 實測:整機 checksum 與 interleaved-paired A/B 為每一項把關。

Honest attribution — what is a standard principle vs. what is derived in this work. The engine's foundations are established literature; the four prunes are this project's own derivations from those principles (not lifted from a specific paper), and the memory/register detection is a derived heuristic standing on a published concept.

誠實標註 —— 哪些是標準原理、哪些是本專案衍生。引擎的基礎是既有文獻;四個剪枝是本專案那些原理自行推導出來的(不是抄自某篇論文);而記憶體/暫存器偵測是一個站在已發表概念上的衍生啟發法。

Element元素Status性質Source來源
Switch-level model — signal-strength ordering, the "largest-capacitance member holds the value" floating/charge-share rule, X handling (our priority LUT is a standard implementation of it)Switch-level 模型 —— 訊號強度排序、「最大電容成員保持值」的浮接/電荷分享規則、X 處理(我們的 priority LUT 是它的標準實作) standard標準原理 R. E. Bryant, MOSSIM: A Switch-Level Simulator for MOS LSI, Proc. 18th DAC, 1981 · R. E. Bryant, A Switch-Level Model and Simulator for MOS Digital Systems, IEEE Trans. Computers, C-33(2), 1984
Event-driven settle — only re-evaluate the sub-network whose inputs changed (selective trace)事件驅動收斂 —— 只重算輸入有變的子網路(selective trace) standard標準原理 E. G. Ulrich, Exclusive simulation of activity in digital networks, Comm. ACM, 12(2), 1969
Event filtering — not scheduling a successor when the change provably can't alter its value. The four prunes (P-1 → P-4) are specific, derived applications of this on the Bryant model.事件過濾 —— 當變化可證明不會改變後繼節點的值時就不排程它。四個剪枝(P-1 → P-4)是這個原理在 Bryant 模型上的具體、衍生應用。 standard principle + derived here標準原理 + 本專案衍生 principle: a standard optimization of selective-trace (textbook: M. Abramovici, M. Breuer, A. Friedman, Digital Systems Testing and Testable Design, IEEE Press, 1990). The specific P-1 → P-4 conditions: derived in this work.原理:selective-trace 的標準最佳化(教科書:M. Abramovici、M. Breuer、A. Friedman,Digital Systems Testing and Testable Design,IEEE Press,1990)。P-1 → P-4 的具體條件:本專案推導。
Memory / register / dynamic-node detection — distinguishing charge-storage (no-pull-up, capacitance-held latch / dynamic cell) from static combinational logic, so the prunes exclude storage記憶體/暫存器/動態節點偵測 —— 區分電荷儲存(無 pull-up、靠電容保持的 latch / 動態 cell)與靜態組合邏輯,讓剪枝排除儲存節點 concept published + heuristic derived here概念已發表 + 啟發法本專案衍生 concept (state-retaining vs combinational MOS nodes): R. E. Bryant, Boolean Analysis of MOS Circuits, IEEE Trans. CAD, CAD-6(4), 1987 · node-capacitance analysis: J. K. Ousterhout, Crystal: A Timing Analyzer for nMOS VLSI Circuits, 3rd Caltech Conf. on VLSI, 1983. Our specific test (no-PullUp + connection-count capacitance + ForceCompute flag, and the cap < neighbours tie-break guard): derived in this work.概念(MOS 的狀態保持 vs 組合節點):R. E. Bryant,Boolean Analysis of MOS Circuits,IEEE Trans. CAD,CAD-6(4),1987 · 節點電容分析:J. K. Ousterhout,Crystal: A Timing Analyzer for nMOS VLSI Circuits,3rd Caltech Conf. on VLSI,1983。我們的具體判據(無 PullUp + 連線數電容 + ForceCompute 旗標,以及 cap < 鄰居 的 tie-break 護欄):本專案推導。
Netlist data — the 2A03 / 2C02 transistor netlists (polygon-extracted from die photos)網表資料 —— 2A03 / 2C02 電晶體網表(從晶片照片多邊形提取) data source資料來源 Visual6502 / Visual 2A03 / Visual 2C02 — G. James, B. Silverman, B. Silverman, Visualizing a Classic CPU in Action: The 6502, ACM SIGGRAPH 2010 Talks (visual6502.org). Netlist data is CC-BY-NC-SA.
Per-node metadata + board model — pull-up / weak-depletion / ForceCompute flags, module/memory handlers每節點 metadata + 板級模型 —— pull-up / weak-depletion / ForceCompute 旗標、模組/記憶體 handler data source資料來源 MetalNES — Ian Addis (iaddis), github.com/iaddis/metalnes (the S1 reference implementation)

In short: the model is Bryant's and the event-driven settle is Ulrich's; the prunes apply standard event-filtering with conditions derived here; and the storage-node exclusion is a derived heuristic resting on Bryant's state-retaining-node concept and Ousterhout-style node-capacitance analysis — never a profiled/learned mask (which would not be bit-exact; see the frontier).

一句話:模型是 Bryant 的、事件驅動收斂是 Ulrich 的;剪枝是套用標準的事件過濾、條件由本專案推導;而儲存節點的排除是一個衍生啟發法,立基於 Bryant 的「狀態保持節點」概念與 Ousterhout 式的節點電容分析 —— 絕非剖析/學習來的 mask(那不會 bit-exact,見前緣一文)。