First: the workload that breaks your intuition先看:那個會讓直覺失效的工作負載
Almost every speedup intuition is calibrated for a workload that is big, dense, throughput-bound, and continuous. This simulator is the exact opposite on every axis — that mismatch is the whole story:
幾乎所有「加速直覺」都是為「大、密、吞吐受限、連續」的工作負載校準的。這個模擬器在每一個軸上都剛好相反 —— 這個錯配就是整篇文章的核心:
| Axis面向 | Where the "obvious" tricks shine直覺技巧發光的地方 | This simulator (switch-level NES)這個模擬器(switch-level NES) |
|---|---|---|
| Work per step每步工作量 | large frontiers, dense matrices大 frontier、密集矩陣 | ~1.4 nodes change per event每個 event 平均只變 ~1.4 個節點 |
| Bound by受限於 | throughput (do lots in parallel)吞吐(大量平行) | latency (each half-cycle gates the next)延遲(每半週期卡住下一個) |
| Working set工作集 | spills cache; memory-bandwidth bound溢出快取;記憶體頻寬受限 | fits in 32 KB L1d塞進 32 KB L1d |
| Math運算 | continuous (linear algebra, floats)連續(線性代數、浮點) | discrete logic (0/1, GND-wins)離散邏輯(0/1、GND 贏) |
| Structure結構 | parallel / feed-forward / partitionable可平行 / 前饋 / 可切分 | ~94% one bidirectional feedback SCC; sequential in time~94% 一個雙向回饋 SCC;時間上序列 |
| Instances實例數 | many independent ones很多獨立的 | one (we want a single NES fast)一個(我們要單台 NES 快) |
Keep that table in mind. Every "obvious" idea below is excellent for the left column and loses on the right — and the numbers are ours, measured on this engine, not hand-waving.
記住這張表。下面每個「直覺」點子對左欄都很棒、對右欄都輸 —— 而且數字是我們自己在這個引擎上量的,不是嘴砲。
The tour — six ideas everyone proposes巡禮 —— 每個人都會提的六個點子
| "Just…"「不就…」 | Why it sounds right為何聽起來對 | What actually happened實測結果 | The real reason真正原因 |
|---|---|---|---|
| …make it a matrix & solve it…做成矩陣解掉 | SPICE does this; the dependency/ordering problem vanishes (LU ignores cycles).SPICE 就這樣做;相依/順序問題消失(LU 不在乎環)。 | ✗ it's the thing switch-level was invented to escape; the matrix-family (oblivious) measured 121× slower.這正是 switch-level 設計來逃離的;矩陣家族(oblivious)實測慢 121×。 | solving the whole system every step throws away the ~1.4-node sparsity.每步解整個系統 = 丟掉 ~1.4 節點的稀疏性。 |
| …run it on the GPU…丟 GPU 跑 | thousands of cores; "no dependency in the matrix" → compute all at once.幾千核;「矩陣裡沒相依」→ 一次全算。 | ✗ GPU (bit-sliced) backend measured ~10–18× slower.GPU(bit-sliced)後端實測慢 ~10–18×。 | latency, not throughput: the dependency relocates to per-pass & per-half-cycle barriers, and a CPU↔GPU round-trip per half-cycle dwarfs 1.4 nodes.是延遲不是吞吐:相依性搬到每 pass / 每半週期的 barrier,而每半週期一次 CPU↔GPU 來回遠大於 1.4 個節點。 |
| …use more threads…開多執行緒 | 8+ cores sitting idle; parallelize the settle.8+ 核閒著;把 settle 平行。 | ✗ per-chip threading measured ~15× slower.逐晶片多執行緒實測慢 ~15×。 | a barrier every half-cycle + ~1.4-node waves: sync cost (MESI ~40–100 ns) dwarfs the work (tens of ns). Visual NES hit the same wall in 2017.每半週期一個 barrier + ~1.4 節點的波:同步成本(MESI ~40–100 ns)遠大於工作(數十 ns)。Visual NES 2017 也撞同一面牆。 |
| …compile it (IR / codegen)…編譯成 code(IR / codegen) | native code beats interpreting a graph; surely faster.原生碼勝過直譯圖;一定更快。 | ✗ batch AOT/codegen measured 3–6× slower; HDL/GPU emit worse.批次 AOT/codegen 實測慢 3–6×;HDL/GPU emit 更慘。 | batch eval re-computes ~14,700 nodes when ~hundreds change; and the 94% bidirectional SCC won't turn into feed-forward code.批次評估在 ~幾百個變時重算 ~14,700 個;且 94% 雙向 SCC 變不成前饋碼。 |
| …use a smarter search algorithm…用更聰明的搜尋演算法 | dynamic-connectivity / direction-optimizing BFS have better big-O.dynamic-connectivity / direction-optimizing BFS 有更好的 big-O。 | ✗ bit-parallel/dense BFS 156× slower; RCM reorder ~1.0× (nothing).bit-parallel/dense BFS 慢 156×;RCM 重排 ~1.0×(沒用)。 | those win on huge frontiers; their constants + extra per-node memory crush a 2–4-node walk and break L1d residency.那些贏在大 frontier;它們的常數 + 每節點額外記憶體壓垮 2–4 節點的 walk,還破壞 L1d。 |
| …cache / memoize results…快取 / memoize 結果 | stop recomputing the same thing.別重算同樣的東西。 | ✗ generation-counter −3.9%; per-node active-counter −6%.世代計數器 −3.9%;每節點 active 計數器 −6%。 | the result is state-dependent; maintaining/keying the cache costs more than recomputing 1.4 nodes, and the extra state spills L1d.結果隨狀態變;維護/算 cache key 比重算 1.4 個節點還貴,額外狀態又溢出 L1d。 |
Two spotlights — the matrix and the GPU兩個聚光燈 —— 矩陣與 GPU
"Make it a matrix — then there's no dependency problem"「做成矩陣 —— 就沒有相依性問題了」
True: a matrix solve (SPICE-style nodal analysis, Gx=b) dissolves the ordering/SCC dependency — linear algebra doesn't care about cycles. But the dependency doesn't disappear; it relocates. You never solve a circuit in "one shot" — direct LU is itself sequential, and discrete logic is solved by relaxation (iterate to a fixpoint; our settle runs 12–45 passes), so a barrier now sits between passes instead of between nodes. And a matrix solve is oblivious — it touches every node, throwing away the fact that only ~1.4 changed. That's precisely why switch-level (Bryant, 1980s) was invented to avoid the matrix, and why our oblivious experiments ran 121× slower.
沒錯:矩陣解(SPICE 式 nodal analysis,Gx=b)化掉了順序/SCC 相依 —— 線性代數不在乎環。但相依性沒有消失,只是換了位置。你從來不是「一次」解完電路 —— 直接 LU 本身序列,離散邏輯靠 relaxation 迭代到 fixpoint(我們的 settle 跑 12–45 passes),所以 barrier 從「節點間」搬到「pass 之間」。而且矩陣解是 oblivious —— 碰每一個節點,丟掉「只有 ~1.4 個變」這件事。這正是 switch-level(Bryant, 1980s)被發明來避開矩陣的原因,也是我們 oblivious 實驗慢 121× 的原因。
"There's no dependency now — so just GPU it all at once"「現在沒相依性了 —— 那就 GPU 一次全算」
The GPU is a throughput machine: it needs thousands of independent work-items to fill its cores. Our problem is latency: each half-cycle must finish before the next can start (you can't parallelize across time for one instance), and within a half-cycle only ~1.4 nodes change. Doing the full-matrix/oblivious update on the GPU means doing ~10,000× the necessary work, plus a CPU↔GPU round-trip every half-cycle. Measured: ~10–18× slower. The GPU's real home is many independent instances (bit-slice thousands of NESes) — a different goal, and even then lane divergence forces an oblivious per-lane update, so you'd need to simulate >121 instances just to break even with one CPU core running event-driven. It still wouldn't make a single NES hit real time.
GPU 是吞吐機器:要幾千份獨立工作才餵得飽核心。我們的問題是延遲:每半週期必須先做完才能進下一個(單一 instance 無法跨時間平行),而半週期內只有 ~1.4 個節點變。在 GPU 上做矩陣全解/oblivious 更新 = 做 ~一萬倍多餘的工作,還每半週期一次 CPU↔GPU 來回。實測:慢 ~10–18×。GPU 真正的家是很多獨立 instance(bit-slice 上千台 NES)—— 那是另一個目標,而且即使如此,lane 發散逼每 lane 都 oblivious 更新,你得模擬 >121 台才追平「一個 CPU 核跑事件驅動」。而且單台 NES 仍然到不了 real time。
The one principle behind all six貫穿六者的那條原則
An optimization is not "good" or "bad" in the abstract — it's a bet on the shape of the workload. Matrices, GPUs, threads, codegen, fancy graph algorithms, and caches are all bets on big, dense, throughput-bound, parallel, continuous, many-instance. This simulator is tiny-per-event, sparse, latency-bound, sequential-in-time, discrete, single-instance. Every "obvious" idea is the right answer to a question we're not asking.
一個優化沒有抽象的「好」或「壞」—— 它是對「工作負載長相」的一個賭注。矩陣、GPU、執行緒、codegen、花俏圖演算法、快取,全都賭在大、密、吞吐受限、可平行、連續、多實例。而這個模擬器是每事件極小、稀疏、延遲受限、時間序列、離散、單實例。每個「直覺」點子,都是在回答一個我們沒在問的問題。
So what does work here? The opposite-shaped tools: shrink the hot data so it stays in L1d, keep the event-driven sparsity (only touch what changed), remove work and branches from the tiny inner loop, and — the biggest win of all — notice that ~70% of events are trivial singletons and resolve them in O(1) (our "R-1", +18%). None of those are glamorous; all of them are matched to the workload.
那這裡什麼有用?形狀相反的工具:把熱資料縮到留在 L1d、保住事件驅動的稀疏性(只碰有變的)、把工作與分支移出極小的內迴圈,以及 —— 最大的一勝 —— 發現約 70% 的 event 是 trivial singleton、用 O(1) 解掉(我們的「R-1」,+18%)。沒有一個是炫的;每一個都跟工作負載對得上。
The meta-meta-lesson: measure before you assume. Several of these were tried because they "obviously" should help; the only way to know was to build the minimal version and benchmark it (bit-exact). The ideas that survived weren't the elegant ones — they were the ones the hardware actually liked.
更上一層的教訓:假設之前先量。上面好幾個都是因為「明明該有用」才去試的;唯一知道答案的方法,是做出最小版本、benchmark(且 bit-exact)。活下來的不是優雅的點子,而是硬體真的喜歡的那些。
Where the numbers come from數字的出處
Every figure here is a measurement on this engine (AMD Ryzen 7 3700X, full_palette), not an estimate. The dead-end catalogue and per-change A/B numbers live in the repository's optimization notes; the field context (why academia has no 20-year breakthrough for this regime) is in the research landscape; how AprVisual compares to other netlist simulators is in the comparison.
這裡每個數字都是這個引擎上的實測(AMD Ryzen 7 3700X、full_palette),不是估計。死路清單與逐項 A/B 數字在 repo 的優化筆記裡;領域脈絡(為何學界對這個 regime 近 20 年沒突破)見研究現況;AprVisual 與其他 netlist 模擬器的比較見比較頁。