AprVisual — research landscape: where switch-level simulation acceleration stands

The short version先講結論

For our regime — single-core, event-driven, switch-level, extremely low activity (~1.4 nodes resolved per event) — the algorithmic research space looks close to its ceiling. The two strategy families we rely on (structural reduction and common-case fast-pathing) matured between the 1980s and the mid-2000s, and the last ~20 years brought no disruptive algorithmic breakthrough in this specific niche. The gains since then are systems / data-oriented engineering, not new algorithms.

就我們這個情境而言 —— 單核、event-driven、開關級、極低 activity(平均每個 event 只解析 ~1.4 個節點)—— 演算法研究空間看起來已接近天花板。我們倚賴的兩條策略(結構化簡、common-case fast-path)在 1980s 到 2000s 中期就成熟了,近 20 年在這個特定利基沒有顛覆性的演算法突破。此後的進步是系統 / 資料導向的工程,不是新演算法。

So short-term, the realistic lever is not "find a smarter graph algorithm" — it's to validate the classic methods on a real, full netlist, and compose the combination that actually runs fastest on real hardware. That is exactly where this project puts its effort, and where it claims its (modest, honest) value.

所以短期內,務實的槓桿不是「找一個更聰明的圖論演算法」,而是把這些經典方法在一顆真實、完整的網表上驗證,並組合出在真實硬體上實際跑最快的搭配。這正是本專案投入的地方,也是它(謙遜而誠實的)價值所在。

One honest caveat about this very conclusion: "the ceiling is reached" is a claim we (and an external LLM we consulted) made about our own engine once before — and then R-1 beat it by +18% by measuring rather than assuming. So read "near the ceiling" as "no known algorithm left to copy from the literature", not "physically impossible to improve". The two are very different.

對這個結論本身的一個誠實但書:「天花板到了」這句話,我們(以及我們諮詢的外部 LLM)曾對自己的引擎講過一次 —— 然後 R-1 靠實測而非假設,把它打破 +18%。所以「接近天花板」要理解成「文獻上沒有現成可抄的新演算法」,而不是「物理上不可能再快」。這兩者差很多。

The two families, in academic terms用學術詞彙講這兩條策略

We use two engineering names; here is what they map to in the literature, so an academic reader can connect them to existing work.

我們用兩個工程名稱;以下是它們在文獻中的對應,讓學術背景的讀者能接回既有研究。

Our name我們的稱呼	Academic / EDA equivalent學術 / EDA 對應
lowering (pre-simulation netlist simplification)(模擬前的網表化簡)	Logic-network structural reduction: constant propagation / folding, dead-gate (constant-driven) elimination, structural hashing (strashing), AIG rewriting & balancing, SAT-sweeping / functional node merging.邏輯網路結構化簡:常數傳播 / 摺疊、死閘(被常數驅動)消除、structural hashing(strashing)、AIG rewriting 與 balancing、SAT-sweeping / 函式等價節點合併。
fast-path (run-time common-case specialization)(執行期 common-case 特化)	Event-driven common-case specialization: selective-trace, activity-directed simulation, event filtering / activity bypass — recognizing the cheap case (a single-node resolve / a group that won't grow) and skipping the general group-resolution machinery.event-driven common-case 特化:selective-trace、activity-directed simulation、event filtering / activity bypass —— 辨識便宜情況(只解析單節點 / group 不會擴張),跳過通用 group-resolution。

Our name我們的稱呼

Academic / EDA equivalent學術 / EDA 對應

lowering
(pre-simulation netlist simplification)(模擬前的網表化簡)

Logic-network structural reduction: constant propagation / folding, dead-gate (constant-driven) elimination, structural hashing (strashing), AIG rewriting & balancing, SAT-sweeping / functional node merging.邏輯網路結構化簡:常數傳播 / 摺疊、死閘(被常數驅動)消除、structural hashing(strashing)、AIG rewriting 與 balancing、SAT-sweeping / 函式等價節點合併。

fast-path
(run-time common-case specialization)(執行期 common-case 特化)

Event-driven common-case specialization: selective-trace, activity-directed simulation, event filtering / activity bypass — recognizing the cheap case (a single-node resolve / a group that won't grow) and skipping the general group-resolution machinery.event-driven common-case 特化:selective-trace、activity-directed simulation、event filtering / activity bypass —— 辨識便宜情況(只解析單節點 / group 不會擴張),跳過通用 group-resolution。

The underlying simulation model is Bryant's switch-level model; the strategy taxonomy (event-driven vs levelized vs oblivious / compiled) is textbook. We borrow compiler ("lowering") and systems ("fast-path") vocabulary — the equivalent EDA techniques are all classic.

底層模擬模型是 Bryant 的開關級模型;策略分類(event-driven 對 levelized 對 oblivious / compiled)是教科書級。我們借用編譯器(lowering)與系統(fast-path)的詞彙 —— 對應的 EDA 技術都是經典。

Where each idea actually peaked — a 40-year timeline每個想法實際在哪一年成熟 —— 40 年時間軸

The telling pattern: the load-bearing algorithmic ideas all date from 1981–2006. After that, the data-structure work (MIG, 2014) is real but doesn't help low-activity simulation, and everything else is engineering.

關鍵在於:撐起這領域的演算法想法,全部出自 1981–2006。之後的資料結構工作(MIG,2014)是真的,但對低-activity 模擬沒幫助,其餘都是工程。

Year年份	Milestone里程碑	Still the state of the art for us?對我們仍是最先進?
~1981–84	Bryant — switch-level model & MOSSIM (the model we simulate)Bryant —— 開關級模型與 MOSSIM(我們模擬的就是這個模型)	✓ foundational地基
~1980s	Ulrich et al. — selective-trace / activity-directed event-driven simulationUlrich 等 —— selective-trace / activity-directed event-driven 模擬	✓ our "fast-path" is this我們的 fast-path 就是這個
1987	COSMOS (Bryant et al.) — symbolic switch-level analysis: statically extract channel-connected components (CCC) into Boolean / BDDCOSMOS(Bryant 等)—— 符號化開關級分析:把 channel-connected component(CCC)靜態提取成 Boolean / BDD	~ the "static extraction" idea; we tested its modern form (IR/codegen) — slower for us「靜態提取」的想法;我們測過它的現代版(IR/codegen)—— 對我們更慢
2006	Berkeley ABC — DAG-aware AIG rewriting (Mishchenko et al.) + SAT-sweepingBerkeley ABC —— DAG-aware AIG rewriting(Mishchenko 等)+ SAT-sweeping	✓ still the structural-reduction champion至今仍是結構化簡霸主
2014	MIG — Majority-Inverter Graph (Amarù, Gaillardon, De Micheli, EPFL)MIG —— Majority-Inverter Graph(Amarù、Gaillardon、De Micheli,EPFL)	✗ great for synthesis/depth; hurts a bidirectional pass-transistor sim對合成/深度很好;對雙向 pass-transistor 模擬反而有害
2005–2025	Everything that actually got faster: cache-conscious / data-oriented layout (SoA), flat double-buffered event arrays, SIMD bit-parallel, devirtualization真正變快的全部:cache-conscious / 資料導向佈局(SoA)、扁平雙緩衝事件陣列、SIMD bit-parallel、去虛擬化	✓ engineering, not algorithm工程,不是演算法

What the last 20 years actually delivered (and what's hype)近 20 年真正帶來了什麼(以及什麼是炒作)

Real, effective — but engineering / systems, not new algorithms真實有效 —— 但是工程 / 系統,不是新演算法

✓ Data-oriented layout資料導向佈局 — SoA, byte/ushort fields, packing the hot record into a cache line, contiguous fan-out lists. When activity is ~1.4, the bottleneck is L1/L2 cache misses and branch mispredicts — not algorithmic complexity. This is the single biggest single-core lever (and where our own +12% came from).SoA、byte/ushort 欄位、把熱記錄壓進一條 cache line、連續的 fan-out 列表。activity ~1.4 時,瓶頸是 L1/L2 cache miss 與分支誤判,不是演算法複雜度。這是單核最大的槓桿(我們自己的 +12% 也來自這)。
✓ Flat double-buffered event arrays扁平雙緩衝事件陣列 — drop priority queues / calendar wheels for zero/unit-delay; just a current + next array, scanned linearly. We already do this.在零/單位延遲下拋棄 priority queue / calendar wheel;只用 current + next 陣列、線性掃描。我們已經這樣做。
✓ Compiler-baseline awareness編譯器 baseline 意識 — devirtualization / making the hot branch 100% predictable; and the finding that the same change can be +C# / −Rust. Measured per engine, never assumed.去虛擬化 / 讓熱分支 100% 可預測;以及「同一改動可能 +C# / −Rust」的發現。逐引擎實測,從不假設。

Looks clever on paper, loses in practice (for low activity)紙上漂亮、實務上輸(對低 activity)

✗ ML-directed event prediction (GNN/RNN)ML 預測事件(GNN/RNN) — inference overhead dwarfs a 1.4-node BFS by orders of magnitude.推論成本比一次 1.4 節點的 BFS 高出好幾個數量級。
✗ Fancy schedulers (splay/heap/calendar queues)花俏排程器(splay/heap/calendar queue) — balancing overhead beats a plain array.維持平衡的成本大於直接用陣列。
✗ Parallel / GPU / FPGA / RTL-compile平行 / GPU / FPGA / RTL-compile — these are the real frontier for other regimes (huge designs, high activity, many instances) — but for a single low-activity instance they lose: we measured per-chip threading ~15× slower, bit-parallel BFS ~156× slower, batch AOT/codegen 3–6× slower, and the GPU (bit-sliced) backend ~10–18× slower (it re-evaluates ~14,700 nodes when only ~1.4 actually changed). And no — matrixizing the netlist SPICE-style doesn't unlock a "GPU one-shot": it removes only the node-ordering dependency, but the binding dependencies just relocate to the iterate-to-converge passes and the sequential per-half-cycle time-step, and a full-matrix solve is oblivious (computes every node). Our problem is latency- and sparsity-bound, not throughput-bound — GPU is the right tool for many instances, not one.這些在別的情境(超大設計、高 activity、多實例)才是真正前沿 —— 但對單一低-activity 實例會輸:我們實測逐晶片多執行緒慢 ~15×、bit-parallel BFS 慢 ~156×、批次 AOT/codegen 慢 3–6×、GPU(bit-sliced)後端慢 ~10–18×(平均只有 ~1.4 個節點變,它卻重算 ~14,700 個)。而且「矩陣化就能 GPU 一次算」不成立:矩陣只殺掉節點走訪順序的相依,真正綁住的相依性會搬到「迭代到收斂的 pass」與「每半週期的時間序列」,而矩陣全解是 oblivious(算每一個節點)。我們是延遲 + 稀疏 bound,不是吞吐 bound —— GPU 是給很多台 instance 用的,不是一台。

The important nuance: the field did advance 2005–2025 — but in the directions we deliberately excluded (parallelism, GPU, hardware emulation, compiled cycle-based simulation à la Verilator). Those win for big/high-activity/many-instance workloads. For one low-activity NES on one core, they don't — which is exactly why "no breakthrough" is true for us, not for the field at large.

重要的細微差別:這領域 2005–2025 確實有進步 —— 但都在我們刻意排除的方向(平行、GPU、硬體模擬、Verilator 式編譯 cycle-based 模擬)。那些對大型/高-activity/多實例工作負載會贏。對「單核、單顆、低-activity 的 NES」不會 —— 這正是為什麼「沒有突破」對我們成立,而非對整個領域成立。

How we checked this我們怎麼查證的

This summary isn't one person's opinion. It triangulates three sources: (1) our own ~3-week optimization campaign and its documented dead-end catalogue; (2) two independent queries to a frontier LLM (Gemini 3.1 Pro), framed in strict academic vocabulary and asked to honestly separate "genuinely effective" from "paper-elegant"; and (3) verification of the specific citations the LLM returned.

這份整理不是一個人的意見,而是三方交叉比對:(1) 我們自己約三週的優化實驗與其記錄的死路清單;(2) 兩次獨立詢問前沿 LLM(Gemini 3.1 Pro),用嚴格學術詞彙提問、並要求誠實區分「真正有效」與「紙上漂亮」;(3) 對 LLM 回傳的具體 cite 逐一查證。

On verification — important: the two LLM runs agreed on the conclusion and on the load-bearing citations (MIG 2014, ABC/AIG rewriting 2006, COSMOS 1987, selective-trace). We confirmed those independently. We also caught the LLM returning one citation that does not exist (a plausible-looking "cache-efficient simulator" paper that no database has) — so it was dropped. We only list references we could verify. (This is the same "measure / verify, don't assume" discipline the whole project runs on.)

關於查證 —— 重要:兩次 LLM 回答在結論與關鍵 cite(MIG 2014、ABC/AIG rewriting 2006、COSMOS 1987、selective-trace)上一致,我們也獨立確認了。同時我們也抓到 LLM 回傳了一個並不存在的 cite(一篇看起來很合理、但任何資料庫都查不到的「cache-efficient simulator」論文)—— 所以把它剔除了。本頁只列出我們能查證的參考。(這正是整個專案奉行的「實測 / 查證,不要假設」紀律。)

What this means for AprVisual這對 AprVisual 意味著什麼

It clarifies the project's positioning honestly:

它把專案的定位講清楚了:

We are not chasing a missing algorithm — the literature says there isn't one to copy for this regime.
我們不是在追一個缺失的演算法 —— 文獻顯示這個情境沒有現成可抄的。
The value is engineering validation + the fastest combination: take the classic methods (Bryant model, selective-trace fast-path, structural lowering), prove them on a real full NES netlist, cross-check on two bit-identical engines (C# + Rust), and measure every change.
價值在工程驗證 + 最快搭配:把經典方法(Bryant 模型、selective-trace fast-path、結構化 lowering)拿來,在一顆真實完整的 NES 網表上驗證,用兩個位元一致的引擎(C# + Rust)交叉比對,並逐項實測。
R-1 and the P-1 → P-4 prunes are the proof that this is worth doing. None is a new algorithm — each is a measured/provable realization that a large fraction of events are trivial (R-1: ~half are dynamic singletons; P-1 → P-4: ~84% of re-evaluations change nothing, and ~21% of those are provably skippable), which a "ceiling reached" assumption had hidden. Those measurements were worth +18% (R-1) and another ~+8% C# / ~+10% Rust (P-1 → P-4), all bit-exact.
R-1 與 P-1 → P-4 剪枝正是「這值得做」的證明。沒有一個是新演算法 —— 每一個都是實測/可證明得到的領悟:很大一部分 event 其實 trivial(R-1:約一半是動態 singleton;P-1 → P-4:約 84% 的重算什麼都沒改,其中約 21% 可被證明跳過),這些都被「天花板到了」的假設遮住了。那些實測分別值 +18%(R-1)與再 +約 8% C# / +約 10% Rust(P-1 → P-4),全部 bit-exact。

Bottom line: short-term, the breakthrough space is essentially tapped; the realistic, valuable work is rigorous engineering validation and finding the best combination on real hardware. A genuine leap would require a different computational model — and we've already tested the obvious candidates (IR/codegen, GPU, parallel) and shown they lose for this low-activity, single-instance regime. So we keep the engine honest, fast, and verifiable, and we keep measuring.

底線:短期內,突破空間基本上見底了;務實而有價值的工作,是嚴謹的工程驗證與在真實硬體上找出最佳搭配。真正的躍進需要換一個計算模型 —— 而我們已經測過明顯的候選(IR/codegen、GPU、平行),證明它們在這個低-activity、單實例情境會輸。所以我們把引擎保持誠實、快速、可驗證,然後持續實測。

Verified references查證過的參考

Amarù, Gaillardon, De Micheli — Majority-Inverter Graph: A Novel Data-Structure and Algorithms for Efficient Logic Optimization, DAC 2014 (EPFL Infoscience)
Berkeley ABC — logic synthesis & verification (AIG rewriting / SAT-sweeping; Mishchenko et al., DAC 2006)
R. E. Bryant — A Switch-Level Model and Simulator for MOS Digital Systems, IEEE Trans. Computers, 1984; and COSMOS (Bryant, Beatty, et al.), A Compiled Simulator for MOS Circuits, DAC 1987.R. E. Bryant —《A Switch-Level Model and Simulator for MOS Digital Systems》,IEEE Trans. Computers,1984;以及 COSMOS(Bryant、Beatty 等),《A Compiled Simulator for MOS Circuits》,DAC 1987。
E. G. Ulrich — selective-trace / event-driven logic simulation (foundational, 1960s–80s).E. G. Ulrich —— selective-trace / event-driven 邏輯模擬(奠基性,1960s–80s)。

See also AprVisual's own optimization notes and the simulator comparison for the measured numbers behind the "engineering, not algorithm" claim.

「工程而非演算法」這個說法背後的實測數字,另見 AprVisual 自己的優化筆記與模擬器比較。