First, what S1 reproduces faithfully先講:S1 忠實重現了什麼
The golden core is MetalNES's, ported as-is — not claimed as ours: the connected-group BFS resolution (computeNodeGroup, L1583), the 256-entry flags→state lookup table (getNodeValue, L2005), the per-node c1c2 / gnd / pwr transistor sub-lists with early-break (addNodeToGroup, L1962–1999), the largest-capacitance float tie-break (!_group_flags → _max_state, L2017), the double-buffered settle queue (processQueue, L1536), and "a connection is a permanently-on transistor".
golden core 是 MetalNES 的、原樣移植,不算我們的:connected-group BFS 解析(computeNodeGroup,L1583)、256 項 flags→state 查表(getNodeValue,L2005)、每節點 c1c2 / gnd / pwr 電晶體子列表含 early-break(addNodeToGroup,L1962–1999)、最大電容 float tie-break(!_group_flags → _max_state,L2017)、double-buffered 收斂佇列(processQueue,L1536),以及「connection 就是一顆永遠導通的電晶體」。
A. Skip work省掉工作量 — ⚡
| S1 extensionS1 延伸 | What it does & vs MetalNES作用 & 對比 MetalNES |
|---|---|
| ⚡ Pure-logic fast-path純邏輯 fast-path | ~27% of nodes provably form a singleton group → resolved O(1), skipping the BFS. MetalNES recalcNode (L1866) calls computeNodeGroup for every node — no fast-path.~27% 節點必為單節點 group → O(1) 解析、跳過 BFS。MetalNES recalcNode(L1866)對每個節點都呼叫 computeNodeGroup,無 fast-path。 |
| ⚡ S1.5 lowering pre-passS1.5 lowering 前置 | Before simulating: union-find merge of always-on shorts, drop dead (gate=GND) transistors, dedup + dense compaction — 15,164→14,723 nodes, 27,305→26,775 transistors. MetalNES has no lowering; it simulates the raw assembled netlist.模擬前:union-find 合併永遠導通短路、去 gate=GND 死電晶體、dedup + 緻密重編號 —— 15,164→14,723 節點、27,305→26,775 電晶體。MetalNES 無 lowering,直接算組裝後原始 netlist。 |
| ⚡ O(1) in-group dedupO(1) group 去重 | A per-node _inGroup flag (cleared only on touched entries). MetalNES (L1934–1938) linearly scans the whole current _group on every node add.每節點一個 _inGroup 旗標(只清碰過的)。MetalNES(L1934–1938)每加一節點都線性掃描整個 _group。 |
| ⚡ Deferred capacitance read延後讀電容值 | The capacitance field is read only in the rare floating branch (<1% of walks). MetalNES (L1944–1947) updates max-capacitance/state on every node add. +12% on C#.電容欄位只在罕見的 floating 分支讀(<1% 的 walk)。MetalNES(L1944–1947)每加一節點都更新 max-capacitance/state。C# +12%。 |
| ⚡ Build-time pruningbuild 期剪枝 | When flattening the transistor lists, gate==GND transistors (can never conduct) are skipped outright, so they never appear in the hot walk.攤平電晶體列表時,gate==GND(永不導通)的電晶體直接跳過,熱迴圈裡根本不會出現。 |
Is the fast-path maxed out? Probably not — but it's delicate. It statically covers 26.7% of nodes, yet at runtime ~70% of all recalcs resolve to a singleton group — the other ~51.5% are dynamically singleton (a node whose pass-transistors all happen to be off this half-cycle). Fast-pathing those too is the obvious next step. The catch is twofold: (1) detecting it adds a branch to the very hottest loop, which can be a net loss — the same change was +0.4–1.3% on C# but −1.9–2.5% on Rust when we widened the static fast-path (see the compiler-divergence note); and (2) the floating "hold-previous" resolution is easy to get subtly wrong. So there's likely headroom here, but only via a bit-exact, per-engine-measured change — not a free win. fast-path 到頂了嗎?大概還沒 —— 但很細膩。它靜態覆蓋 26.7% 的節點,但執行期約 70% 的 recalc 結果其實是 singleton —— 另外約 51.5% 是動態 singleton(某節點的 pass-transistor 這個半週期剛好全關)。把這些也納入 fast-path 是顯而易見的下一步。難處有二:(1)偵測它要在最熱的迴圈多加一個分支,可能淨負 —— 我們放寬靜態 fast-path 時,同一改動在 C# +0.4–1.3%、在 Rust 卻 −1.9–2.5%(見編譯器分歧那段);(2)floating「維持前值」的解析極易寫錯。所以這裡很可能有空間,但只能透過位元精確、逐引擎實測的改動取得 —— 不是免費的午餐。
B. Shrink the hot data縮小熱資料 — 🧱 (the single biggest lever)
| S1 extensionS1 延伸 | What it does & vs MetalNES作用 & 對比 MetalNES |
|---|---|
| 🧱 Hot/cold NodeInfo split (SoA)hot/cold NodeInfo 拆分 | The hot record is exactly 16 bytes (¼ cache line): flags + the 3 channel sub-list indices. Cold fields (capacitance, fanout-gate list) live in separate arrays. MetalNES keeps everything in one nodeinfo struct.熱 record 剛好 16 bytes(¼ cache line):flags + 3 個通道子列表索引。冷欄位(電容、fanout gate 列表)放獨立陣列。MetalNES 全擠在一個 nodeinfo struct。 |
| 🧱 ushort transistor listushort 電晶體列表 | Node ids are <65K, so the flattened adjacency uses ushort — the hottest array shrinks 697KB→350KB. MetalNES uses transistorID (int).node id <65K,攤平的 adjacency 用 ushort —— 最熱陣列 697KB→350KB。MetalNES 用 transistorID(int)。 |
| 🧱 byte recalc hashbyte recalc hash | The "already-queued" marker is a byte (0/1): 58KB→14KB, fits L1d alongside the node states. MetalNES uses std::vector<int> (L167).「已入列」標記用 byte(0/1):58KB→14KB,能與 node states 一起塞進 L1d。MetalNES 用 std::vector<int>(L167)。 |
| 🧱 Unmanaged aligned arraysunmanaged 對齊陣列 | All hot arrays are NativeMemory.AlignedAlloc'd — no GC, controlled alignment, straight pointer loads. MetalNES uses std::vector.所有熱陣列用 NativeMemory.AlignedAlloc —— 零 GC、可控對齊、純指標讀取。MetalNES 用 std::vector。 |
| 🧱 Direct callback tablecallback 直查表 | The callback branch indexes a flat _callbackByNode[] array instead of hopping through the managed Nodes[nn].Callback object graph (MetalNES: _nodes[gnn]->callback, L1901).callback 分支直接索引扁平的 _callbackByNode[] 陣列,不走受管 Nodes[nn].Callback 物件圖(MetalNES:_nodes[gnn]->callback,L1901)。 |
C. Tighten the hot loop收緊熱迴圈 — ⚡
| S1 extensionS1 延伸 | What it does & vs MetalNES作用 & 對比 MetalNES |
|---|---|
| ⚡ setNodeState reworksetNodeState 重寫 | Loop-unswitch on the new state (the gate-high case has no c2-enqueue at all), inline enqueue (queue cursors hoisted to locals, no function call), and c1 skips the supply check (normalized at build). MetalNES setNodeState (L1597) re-reads the node, calls enqueueNode, checks is_pwr_gnd on c1, no unswitch. This is the highest-leverage path — setNodeState runs ~10× more often than recalcNode.依新狀態做 loop-unswitch(gate-high 那支完全不 enqueue c2)、inline enqueue(佇列游標提到 locals、不走函式呼叫)、c1 跳過 supply 檢查(build 期已正規化)。MetalNES setNodeState(L1597)每次重讀節點、呼叫 enqueueNode、c1 仍查 is_pwr_gnd、無 unswitch。這是槓桿最大的路徑 —— setNodeState 跑得比 recalcNode 多約 10 倍。 |
| ⚡ Iterative BFS (C#)iterative BFS(C#) | The group walk is rewritten iteratively so the .NET JIT inlines the whole chain (+~3% C#). MetalNES's addNodeToGroup is recursive (self-call at L1970). Rust keeps it recursive — LLVM already inlines that well.group walk 改成 iterative,讓 .NET JIT inline 整條鏈(C# +~3%)。MetalNES addNodeToGroup 是遞迴(L1970 自呼叫)。Rust 維持遞迴 —— LLVM 本來就 inline 得好。 |
| 🔬 Hard settle cap收斂硬上限 | A non-converging region aborts after 128 passes and clears the queue, so it can never hang the run. MetalNES only prints a warning at 100 passes (L1544) with no hard cap.不收斂的區域跑 128 次就中止 + 清隊列,絕不會卡死整次模擬。MetalNES 只在 100 次印 warning(L1544),無硬上限。 |
D. Engine, verification & handlers引擎、驗證與 handler — 🧪 / ⚡
| S1 extensionS1 延伸 | What it does & vs MetalNES作用 & 對比 MetalNES |
|---|---|
| 🧪 Twin C# + Rust enginesC# + Rust 雙引擎 | Two independent codegens kept bit-identical (same checksum) — cross-validation and a built-in performance comparison; the Rust side runs from a precomputed snapshot. MetalNES is a single C++ engine.兩個獨立 codegen 保持位元完全相同(checksum 一致)—— 互相驗證 + 內建效能對照;Rust 端從預算快照啟動。MetalNES 是單一 C++ 引擎。 |
| ⚡ Batch-settle memory handlers記憶體 handler batch-settle | A memory read drives the whole data bus by enqueuing every bit (SetHigh/Low-queued), then settles once — rather than settling per bit.記憶體讀取時把整條 data bus 的每個 bit 都 enqueue(SetHigh/Low-queued),迴圈後一次收斂 —— 而非逐 bit 收斂。 |
E. Fidelity extensions保真延伸 — 🔬 (not speed, but real design deltas)
| S1 extensionS1 延伸 | What it does & vs MetalNES作用 & 對比 MetalNES |
|---|---|
| 🔬 Keep pull-downs保留 pull-down | Both '+' (pull-up) and '-' (pull-down) segdefs are kept. MetalNES kept only '+'.segdef 的 '+'(pull-up)與 '-'(pull-down)都保留。MetalNES 只留 '+'。 |
| 🔬 Use the weak flag使用 weak 旗標 | The 7th transdefs column (weak / depletion-load) is actually used to separate strong pull-down from weak pull-up. MetalNES reads it but barely uses it.transdefs 第 7 欄(weak / depletion-load)真的拿來區分 strong pull-down 與 weak pull-up。MetalNES 讀了卻幾乎沒用。 |
| 🔬 NodeValue enumNodeValue 列舉 | A 4-state result (Low / High / HoldPrevious / Undefined) replaces MetalNES's bool. Honest caveat: the hot path still resolves via the byte LUT and behaves identically to MetalNES today — HoldPrevious is a planned semantic refinement, not yet a live branch.用 4 態結果(Low / High / HoldPrevious / Undefined)取代 MetalNES 的 bool。誠實說明:熱路徑目前仍經 byte LUT、行為與 MetalNES 一致 —— HoldPrevious 是規劃中的語意精修,還沒實際分流。 |
In one sentence一句話總結
The eight performance-effective extensions cluster in three places — skip work (fast-path, lowering, build-prune), shrink the hot data (hot/cold split, ushort/byte, unmanaged SoA), and tighten the hot loop (setNodeState rework, iterative-BFS-for-JIT). This independently matches what the Visual NES author found in 2017: the wins come from less work + smaller data + tighter codegen, not from higher-level data structures or cleverer algorithms.
八項對效能有效的延伸集中在三處 —— 省工(fast-path、lowering、build 剪枝)、縮熱資料(hot/cold 拆分、ushort/byte、unmanaged SoA)、收緊熱迴圈(setNodeState 重寫、iterative-BFS 解鎖 JIT)。這獨立印證了 Visual NES 作者 2017 年的結論:勝利來自更少工作 + 更小資料 + 更緊 codegen,不是更高階的資料結構或更聰明的演算法。
Where these ideas come from — credits & further reading這些想法的源頭 —— 出處與延伸閱讀
Almost none of these techniques are ours. We applied (and often failed with) ideas from a long line of research and practice. Whether a strategy won or lost for us, credit belongs to its originators. References below are ones we're confident of; a couple of attributions are marked where the canonical source is fuzzy.
這些技術幾乎沒有一個是我們原創的。我們只是套用(且常常失敗)了一長串研究與實務的點子。無論某個策略對我們是賺是賠,功勞都屬於原創者。以下是我們有把握的引用;少數來源較模糊的會標注。
| Used here for…我們用在… | Origin源頭 |
|---|---|
| The whole switch-level model (transistors as switches, settle by relaxation)整個開關級模型(電晶體=開關、鬆弛收斂) | R. E. Bryant, "A Switch-Level Model and Simulator for MOS Digital Systems", IEEE Trans. Computers C-33(2), 1984 (from his 1981 MIT PhD work, MOSSIM). |
| The settle loop = worklist / event-driven fixpoint iterationsettle 迴圈 = worklist / 事件驅動 fixpoint 迭代 | G. A. Kildall, "A Unified Approach to Global Program Optimization", POPL 1973 (iterative dataflow to a fixpoint); plus classic discrete-event simulation. |
| S1.5 lowering — merging always-on shorts into one nodeS1.5 lowering — 合併永遠導通短路成一個節點 | Union-Find / disjoint-set: B. A. Galler & M. J. Fisher, CACM 7(5), 1964; near-linear analysis (path compression + union by rank) R. E. Tarjan, JACM 22(2), 1975. |
| Loop / cycle detection (the S2 IR stage)迴路 / 環偵測(S2 IR 階段) | R. E. Tarjan, "Depth-First Search and Linear Graph Algorithms", SIAM J. Computing 1(2), 1972 (linear-time strongly-connected components). |
| Hot/cold split + Structure-of-Arrays (the biggest real win)hot/cold 拆分 + Structure-of-Arrays(最大的真實勝利) | Data-Oriented Design — M. Acton, "Data-Oriented Design and C++", CppCon 2014 (the canonical modern talk; SoA itself is a long-standing vectorization idea, not a single paper). |
| Knowing memory latency — not compute — is the wall認清瓶頸是記憶體延遲、不是運算 | Wm. A. Wulf & S. A. McKee, "Hitting the Memory Wall: Implications of the Obvious", ACM SIGARCH CAN 23(1), 1995. |
| RCM cache-locality renumbering (failed for us)RCM cache 區域性重編號 (對我們失敗) | E. Cuthill & J. McKee, "Reducing the Bandwidth of Sparse Symmetric Matrices", ACM Nat. Conf. 1969; the reverse (RCM) variant is attributed to J. A. George, 1971. |
| Bit-parallel / dense-scan BFS (failed for us)bit-parallel / dense-scan BFS (對我們失敗) | S. Beamer, K. Asanović, D. Patterson, "Direction-Optimizing Breadth-First Search", SC 2012; the Ligra framework, J. Shun & G. E. Blelloch, PPoPP 2013. Bit-packed boolean-graph roots: the "Four Russians" method (Arlazarov, Dinic, Kronrod, Faradzhev, 1970). [bit-packing root — approximate][bit-packing 根源 — 近似] |
| Profile-guided optimization (PGO), as Visual NES also usedPGO(Visual NES 也用過) | K. Pettis & R. C. Hansen, "Profile Guided Code Positioning", PLDI 1990. |
If we've mis-credited anything, please tell us via the feedback form — we'd rather fix it.
若有任何誤標,請透過意見回饋表單告訴我們 —— 我們很樂意更正。