AprVisual — the frontier: why the last 80% won't prune

The remaining waste, precisely剩下的浪費,精確地說

After the prunes, ~80% of the still-wasted re-evaluations are pull-up / supply-driven combinational nodes that recompute the same value after an input gate turns off. Picture a node held low by two parallel pull-down paths, A and B. When the gate of A opens, the engine re-queues the node and re-resolves it — but B still holds it at 0, so nothing changed. The re-evaluation was wasted.

剪枝之後,仍被浪費的重算裡有 ~80% 是 pull-up / 供電驅動的組合邏輯節點,在某個輸入 gate 關閉後重算成同值。想像一個被兩條並聯下拉路 A、B 拉低的節點。當 A 的 gate 打開,引擎把節點重新入隊、重新解析 —— 但 B 還拉著它在 0,所以什麼都沒變。這次重算是白做的。

The catch, in one sentence: whether that event actually changes the node is runtime / data-dependent — it depends on the live state of the other pull-down. That single fact is what kills both of the ideas below.

關鍵一句話:那次事件到底會不會改變節點,是執行期 / 資料相依的 —— 它取決於另一條下拉路當下的狀態。就是這一件事,讓底下兩個想法都成立不了。

Idea A — learn the waste from a trace (and skip it)想法 A —— 從一條 trace 學出浪費(然後跳過)

Tempting and simple: in a DEBUG build, run 1,000,000 half-cycles, record every node that never changed, write it to a file. In Release, load the file and mark those nodes "skip" before the hot loop. It's profile-guided optimisation for a simulator.

很誘人、很簡單:在 DEBUG build 跑 100 萬半週期,記下每個從沒變過的節點、寫成檔。Release 時載入,進熱迴圈前把那些節點標成「跳過」。這就是給模擬器用的 profile-guided optimisation。

Why it fails: it is not bit-exact. "Never changed in this 1M trace" is a sample, not a proof. The whole point of a switch-level sim is to be exact for any program — and a learned mask is exact only for the one trace it learned:

為什麼失敗:它不是 bit-exact。「這 1M trace 裡從沒變」是取樣,不是證明。switch-level 模擬的全部意義就是對任何程式都精確 —— 而學來的 mask 只對它學的那條 trace 精確:

Program difference. full_palette is a static test pattern with no controller input, exercising one small code path. Run Super Mario Bros., or press a button, and a whole datapath / ALU / control region that "never moved" during profiling comes alive — those nodes are now changing, and we'd be skipping them.
程式差異。full_palette 是靜態測試圖、不按搖桿、只走一小段程式。換成 Super Mario Bros.、或按一個鍵,一整塊在 profile 時「從沒動」的資料路徑 / ALU / 控制邏輯就活了 —— 那些節點現在會變,而我們卻在跳過它們。
Step-count difference. A node tied to a counter, a timer, a rare carry/overflow, or a late branch can be unchanged for 1M half-cycles and flip at 1.5M. A finite observation can't prove an infinite property — you can't tell "never changes" apart from "hasn't changed yet."
步數差異。綁在計數器、計時器、罕見進位/溢位、或很後面才走到的分支上的節點,可能在 100 萬半週期內不變、卻在 150 萬時翻。有限的觀察證明不了無限的性質 —— 你分不出「永不變」和「只是還沒變到」。

Skip a node that should have changed and the simulation silently diverges — it stops being the golden model. (This is, in spirit, the behavioural-abstraction route (S2 / "Escape-1") the project already explored: trade exactness for speed. It concluded the abstraction is real but doesn't beat the event-driven engine single-core.)

跳過一個該變的節點,模擬就靜默發散 —— 它不再是 golden 模型。(本質上這就是專案早已探索過的行為抽象路線(S2 /「Escape-1」):用精確換速度。結論是抽象確實存在,但單核打不過事件驅動引擎。)

Idea B — reduce the logic, statically (the better idea)想法 B —— 靜態地縮減邏輯(更高明的想法)

A sharper instinct: those nodes mostly exist to express logic-layer computation. Any logic has an equivalence-reduction space — and crucially that reduction is invariant (a property of the Boolean function, not of any input trace). So, unlike a profiled mask, a logic-equivalence reduction would be bit-exact. Merge two nodes that compute the same function; remove a transistor that's logically redundant. How much is there?

一個更敏銳的直覺:那些節點大多是為了表現邏輯層的計算。任何邏輯都有等價縮減空間 —— 而且關鍵是,那個縮減是不變的(它是 Boolean 函數的性質,跟任何輸入 trace 無關)。所以,不像剖析來的 mask,邏輯等價縮減會是 bit-exact。把算出同一個函數的兩個節點合併;移除邏輯上多餘的電晶體。到底有多少?

In pure logic the instinct is correct, and the space is even huge — an earlier investigation measured that ~98.9% of the chip is reducible to logic + registers. But it does not transfer to the switch level, for one deep reason:

在純邏輯裡這個直覺是對的,空間甚至很大 —— 早先的研究量過,晶片有 ~98.9% 可化簡成邏輯 + 暫存器。但它無法轉移到 switch-level,原因只有一個、但很深:

Logically-equivalent ≠ switch-level-equivalent. A switch-level node isn't a Boolean value — it carries capacitance. Two nodes can settle to the same logic value yet behave differently while floating (charge-sharing) or in a tie-break (the largest-capacitance member wins). Merge them, or delete a "redundant" pull-down, and you change a connected group's total/maximum capacitance — which can flip the floating tie-break and break bit-exactness. The physics carries information that the logic doesn't.

邏輯等價 ≠ switch-level 等價。switch-level 的節點不是一個 Boolean 值 —— 它帶著電容。兩個節點可以穩態解析成同一個邏輯值,卻在浮接時(charge-sharing)或 tie-break 時(最大電容的成員勝出)行為不同。把它們合併、或刪掉一個「多餘的」下拉,你就改了某個連通群組的總/最大電容 —— 可能翻轉 floating tie-break、破壞 bit-exact。物理帶著邏輯沒有的資訊。

On top of that, the structure is hostile: ~94% of nodes sit in one bidirectional pass-transistor SCC, where "a Boolean function of fixed inputs" isn't even well-defined; only ~6% is feed-forward. And that 6% — a hand-laid 1970s NMOS layout — has almost no functional duplicates, because the few repeated bits of logic (fan-out buffers) were given different capacitance on purpose. The verdict, from a logic-synthesis/EDA review asked to be brutally honest: a realistic node reduction of < 0.1% and 0% measurable single-core speedup. The structural merges that are safe (always-on shorts, dead transistors) are already done by the offline lowering pass.

更糟的是結構不利:~94% 的節點在同一個雙向 pass-transistor SCC 裡,那裡「固定輸入的 Boolean 函數」根本沒定義;只有 ~6% 是 feed-forward。而那 6% —— 一張 1970 年代手工 layout 的 NMOS —— 幾乎沒有功能性重複,因為少數重複的邏輯(fan-out buffer)是刻意給了不同的電容。一份被要求殘酷誠實的 logic-synthesis/EDA 評估給的判定:實際節點縮減 < 0.1%、單核加速 0%。真正安全的結構性合併(always-on short、死電晶體)離線 lowering 早就做完了。

The punchline: the waste is the price of the speed結局:浪費就是速度的代價

Here is why both ideas had to fail, and it's the most useful thing on this page. The ~80% of re-evaluations that "change nothing" are not a flaw — they are the cost of the event-driven engine's sparsity. The engine is fast precisely because it only touches the few hundred nodes that might have changed each half-cycle, and it touches each one in local ignorance: a node re-checks itself because it cannot cheaply know that a sibling pull-down still holds it. That ignorance is what lets it skip the other ~14,000 nodes.

這裡說明為什麼兩個想法都注定失敗,而這是整頁最有用的一點。那 ~80% 「什麼都沒改」的重算不是缺陷 —— 它們是事件驅動引擎稀疏性的代價。引擎之所以快,正是因為它每半週期只碰那幾百個可能變了的節點,而且是在局部無知中碰每一個:一個節點重新檢查自己,是因為它沒辦法便宜地知道某條兄弟下拉路還拉著它。正是那個無知,讓它能跳過其他約 14,000 個節點。

To remove the ignorance you must add knowledge — and there are only two ways, both already measured, both losses:

要消除無知,就得加上知識 —— 而方法只有兩種,都已實測,都是負的:

Add knowledge by…加上知識的方法…	What it costs代價
Global state全域狀態 maintain an "active pull-down count" per node每節點維護「ON 下拉數」計數器	The counter must be updated on every gate flip — the write path runs ~10× more often than the read it saves. Measured −6% (the state-caching fallacy). A second, sharper attempt — caching the single "dominant driver" id instead of a count — was bit-exact and genuinely pruned (+3.4%), yet still lost −12% to −15.4% of maintenance: the P-5 case →計數器得在每次 gate 翻轉時更新 —— 寫路徑比它省下的讀路徑熱約 10 倍。實測 −6%(state-caching 謬誤)。第二次更精準的嘗試 —— 改快取單一「dominant driver」id 而非計數 —— bit-exact、且真的有剪枝(+3.4%),卻仍因 −15.4% 的維護成本而淨虧 −12%:P-5 案例 →
Bigger cones更大的 cone fold logic into composite gates / IR / compiled把邏輯折成複合閘 / IR / 編譯	A node now covers a whole logic cone — which loses the sparsity (you re-evaluate more) and pays i-cache + dispatch. Measured 3× to 84× slower.一個節點現在覆蓋整個邏輯 cone —— 這失去稀疏性(你重算更多)、付 i-cache + dispatch。實測慢 3 到 84 倍。

Add knowledge by…加上知識的方法…

What it costs代價

Global state全域狀態
maintain an "active pull-down count" per node每節點維護「ON 下拉數」計數器

The counter must be updated on every gate flip — the write path runs ~10× more often than the read it saves. Measured −6% (the state-caching fallacy). A second, sharper attempt — caching the single "dominant driver" id instead of a count — was bit-exact and genuinely pruned (+3.4%), yet still lost −12% to −15.4% of maintenance: the P-5 case →計數器得在每次 gate 翻轉時更新 —— 寫路徑比它省下的讀路徑熱約 10 倍。實測 −6%(state-caching 謬誤)。第二次更精準的嘗試 —— 改快取單一「dominant driver」id 而非計數 —— bit-exact、且真的有剪枝(+3.4%),卻仍因 −15.4% 的維護成本而淨虧 −12%:P-5 案例 →

Bigger cones更大的 cone
fold logic into composite gates / IR / compiled把邏輯折成複合閘 / IR / 編譯

A node now covers a whole logic cone — which loses the sparsity (you re-evaluate more) and pays i-cache + dispatch. Measured 3× to 84× slower.一個節點現在覆蓋整個邏輯 cone —— 這失去稀疏性(你重算更多)、付 i-cache + dispatch。實測慢 3 到 84 倍。

Small cones can't have the masking effect (so they can't be pruned away); large cones lose the sparsity (so they're slower). They are two sides of one coin. The 80% waste and the engine's speed are the same phenomenon seen from two directions.

小 cone 沒有 masking 效應(所以剪不掉);大 cone 失去稀疏性(所以更慢)。這是同一枚硬幣的兩面。那 80% 的浪費,和引擎的速度,是同一個現象的兩個方向。

The Pareto frontierPareto 前緣

Under all three constraints together — event-driven, bit-exact (full Bryant switch-level semantics, per-node), and single CPU core — the engine is now at its Pareto frontier. The provably-redundant work has been removed (P-1 → P-4, lowering, the fast-path); what remains is irreducible without paying somewhere else. Going faster means relaxing exactly one of the three constraints:

在三個約束同時成立下 —— 事件驅動、bit-exact(完整 Bryant switch-level 語意、逐節點)、單核 CPU —— 引擎現在就在它的 Pareto 前緣。可證明多餘的工已經移除(P-1 → P-4、lowering、fast-path);剩下的,不在別處付代價就降不下來。要更快,就得放寬三個約束中的剛好一個:

faster core更快的核

relax "single core" → a higher-IPC / lower-latency CPU. The leaderboard's #1 is exactly this.放寬「單核」→ 更高 IPC / 更低延遲的 CPU。排行榜第一名正是如此。

approximate近似

relax "bit-exact" → behavioural abstraction (S2). Real, but a different artifact, not the golden model.放寬「bit-exact」→ 行為抽象(S2)。可行,但是另一個產物,不是 golden 模型。

re-architect換架構

relax "event-driven" → batch / GPU. Measured slower for one instance; only many-instance throughput could win, which was never the goal.放寬「事件驅動」→ 批次 / GPU。單實例實測更慢;只有多實例吞吐可能贏,而那從不是目標。

None of those makes the existing artifact faster; each builds a different one. So for the golden, bit-exact, single-core switch-level engine, this is the floor — reached not by giving up, but by proving, twice, that the floor is real.

那些都不會讓現有的產物變快;每一個都是在建造另一個產物。所以對 golden、bit-exact、單核的 switch-level 引擎來說,這就是底 —— 不是放棄而到的底,而是兩次證明「這個底是真的」之後到的底。

Where the conclusions come from結論的出處

The waste profile (84% no-change, the PullUp/Supply breakdown) is from a #if DEBUG instrument kept in the engine. The counter (−6%) and IR/codegen (3–84×) figures are this project's own measured results. The logic-synthesis review of Ideas A and B was an external EDA model asked, explicitly, to be brutally honest about whether the invariant logic-reduction space is exploitable under these constraints — its verdict was "< 0.1% nodes, 0% speedup, not worth pursuing." Companion pages: the prunes that did work (P-1 → P-4), why IR / codegen hit the wall, and the study paper on when abstraction beats switch-level.

浪費 profile(84% no-change、PullUp/Supply 拆解)來自留在引擎裡的 #if DEBUG 儀器。計數器(−6%)與 IR/codegen(3–84×)的數字是本專案自己的實測。想法 A、B 的 logic-synthesis 評估,是一個被明確要求「殘酷誠實」的外部 EDA 模型,評估在這些約束下不變邏輯縮減空間是否可利用 —— 它的判定是「< 0.1% 節點、0% 加速、不值得追」。延伸頁面:真的有效的剪枝(P-1 → P-4)、為什麼 IR / codegen 撞牆、以及研究論文(抽象化何時贏過 switch-level)。