AprVisual S1 — design extensions over MetalNES

First, what S1 reproduces faithfully先講:S1 忠實重現了什麼

The golden core is MetalNES's, ported as-is — not claimed as ours: the connected-group BFS resolution (computeNodeGroup, L1583), the 256-entry flags→state lookup table (getNodeValue, L2005), the per-node c1c2 / gnd / pwr transistor sub-lists with early-break (addNodeToGroup, L1962–1999), the largest-capacitance float tie-break (!_group_flags → _max_state, L2017), the double-buffered settle queue (processQueue, L1536), and "a connection is a permanently-on transistor".

golden core 是 MetalNES 的、原樣移植,不算我們的:connected-group BFS 解析(computeNodeGroup,L1583)、256 項 flags→state 查表(getNodeValue,L2005)、每節點 c1c2 / gnd / pwr 電晶體子列表含 early-break(addNodeToGroup,L1962–1999)、最大電容 float tie-break(!_group_flags → _max_state,L2017)、double-buffered 收斂佇列(processQueue,L1536),以及「connection 就是一顆永遠導通的電晶體」。

⚡ speed效能 🧱 memory / cache記憶體 / cache 🔬 fidelity保真 🧪 engine / method引擎 / 方法

A. Skip work省掉工作量 — ⚡

S1 extensionS1 延伸	What it does & vs MetalNES作用 & 對比 MetalNES
⚡ Pure-logic fast-path純邏輯 fast-path	~27% of nodes provably form a singleton group → resolved O(1), skipping the BFS. MetalNES `recalcNode` (L1866) calls `computeNodeGroup` for every node — no fast-path.~27% 節點必為單節點 group → O(1) 解析、跳過 BFS。MetalNES `recalcNode`(L1866)對每個節點都呼叫 `computeNodeGroup`,無 fast-path。
⚡ S1.5 lowering pre-passS1.5 lowering 前置	Before simulating: union-find merge of always-on shorts, drop dead (gate=GND) transistors, dedup + dense compaction — 15,164→14,723 nodes, 27,305→26,775 transistors. MetalNES has no lowering; it simulates the raw assembled netlist.模擬前:union-find 合併永遠導通短路、去 gate=GND 死電晶體、dedup + 緻密重編號 —— 15,164→14,723 節點、27,305→26,775 電晶體。MetalNES 無 lowering,直接算組裝後原始 netlist。
⚡ O(1) in-group dedupO(1) group 去重	A per-node `_inGroup` flag (cleared only on touched entries). MetalNES (L1934–1938) linearly scans the whole current `_group` on every node add.每節點一個 `_inGroup` 旗標(只清碰過的)。MetalNES(L1934–1938)每加一節點都線性掃描整個 `_group`。
⚡ Deferred capacitance read延後讀電容值	The capacitance field is read only in the rare floating branch (<1% of walks). MetalNES (L1944–1947) updates max-capacitance/state on every node add. +12% on C#.電容欄位只在罕見的 floating 分支讀(<1% 的 walk)。MetalNES(L1944–1947)每加一節點都更新 max-capacitance/state。C# +12%。
⚡ Build-time pruningbuild 期剪枝	When flattening the transistor lists, gate==GND transistors (can never conduct) are skipped outright, so they never appear in the hot walk.攤平電晶體列表時,gate==GND(永不導通)的電晶體直接跳過,熱迴圈裡根本不會出現。

Is the fast-path maxed out? Probably not — but it's delicate. It statically covers 26.7% of nodes, yet at runtime ~70% of all recalcs resolve to a singleton group — the other ~51.5% are dynamically singleton (a node whose pass-transistors all happen to be off this half-cycle). Fast-pathing those too is the obvious next step. The catch is twofold: (1) detecting it adds a branch to the very hottest loop, which can be a net loss — the same change was +0.4–1.3% on C# but −1.9–2.5% on Rust when we widened the static fast-path (see the compiler-divergence note); and (2) the floating "hold-previous" resolution is easy to get subtly wrong. So there's likely headroom here, but only via a bit-exact, per-engine-measured change — not a free win. fast-path 到頂了嗎?大概還沒 —— 但很細膩。它靜態覆蓋 26.7% 的節點,但執行期約 70% 的 recalc 結果其實是 singleton —— 另外約 51.5% 是動態 singleton(某節點的 pass-transistor 這個半週期剛好全關)。把這些也納入 fast-path 是顯而易見的下一步。難處有二:(1)偵測它要在最熱的迴圈多加一個分支,可能淨負 —— 我們放寬靜態 fast-path 時,同一改動在 C# +0.4–1.3%、在 Rust 卻 −1.9–2.5%(見編譯器分歧那段);(2)floating「維持前值」的解析極易寫錯。所以這裡很可能有空間,但只能透過位元精確、逐引擎實測的改動取得 —— 不是免費的午餐。

B. Shrink the hot data縮小熱資料 — 🧱 (the single biggest lever)

S1 extensionS1 延伸	What it does & vs MetalNES作用 & 對比 MetalNES
🧱 Hot/cold NodeInfo split (SoA)hot/cold NodeInfo 拆分	The hot record is exactly 16 bytes (¼ cache line): flags + the 3 channel sub-list indices. Cold fields (capacitance, fanout-gate list) live in separate arrays. MetalNES keeps everything in one `nodeinfo` struct.熱 record 剛好 16 bytes(¼ cache line):flags + 3 個通道子列表索引。冷欄位(電容、fanout gate 列表)放獨立陣列。MetalNES 全擠在一個 `nodeinfo` struct。
🧱 ushort transistor listushort 電晶體列表	Node ids are <65K, so the flattened adjacency uses `ushort` — the hottest array shrinks 697KB→350KB. MetalNES uses `transistorID` (int).node id <65K,攤平的 adjacency 用 `ushort` —— 最熱陣列 697KB→350KB。MetalNES 用 `transistorID`(int)。
🧱 byte recalc hashbyte recalc hash	The "already-queued" marker is a `byte` (0/1): 58KB→14KB, fits L1d alongside the node states. MetalNES uses `std::vector<int>` (L167).「已入列」標記用 `byte`(0/1):58KB→14KB,能與 node states 一起塞進 L1d。MetalNES 用 `std::vector<int>`(L167)。
🧱 Unmanaged aligned arraysunmanaged 對齊陣列	All hot arrays are `NativeMemory.AlignedAlloc`'d — no GC, controlled alignment, straight pointer loads. MetalNES uses `std::vector`.所有熱陣列用 `NativeMemory.AlignedAlloc` —— 零 GC、可控對齊、純指標讀取。MetalNES 用 `std::vector`。
🧱 Direct callback tablecallback 直查表	The callback branch indexes a flat `_callbackByNode[]` array instead of hopping through the managed `Nodes[nn].Callback` object graph (MetalNES: `_nodes[gnn]->callback`, L1901).callback 分支直接索引扁平的 `_callbackByNode[]` 陣列,不走受管 `Nodes[nn].Callback` 物件圖(MetalNES:`_nodes[gnn]->callback`,L1901)。

C. Tighten the hot loop收緊熱迴圈 — ⚡

S1 extensionS1 延伸	What it does & vs MetalNES作用 & 對比 MetalNES
⚡ setNodeState reworksetNodeState 重寫	Loop-unswitch on the new state (the gate-high case has no c2-enqueue at all), inline enqueue (queue cursors hoisted to locals, no function call), and c1 skips the supply check (normalized at build). MetalNES `setNodeState` (L1597) re-reads the node, calls `enqueueNode`, checks `is_pwr_gnd` on c1, no unswitch. This is the highest-leverage path — setNodeState runs ~10× more often than recalcNode.依新狀態做 loop-unswitch(gate-high 那支完全不 enqueue c2)、inline enqueue(佇列游標提到 locals、不走函式呼叫)、c1 跳過 supply 檢查(build 期已正規化)。MetalNES `setNodeState`(L1597)每次重讀節點、呼叫 `enqueueNode`、c1 仍查 `is_pwr_gnd`、無 unswitch。這是槓桿最大的路徑 —— setNodeState 跑得比 recalcNode 多約 10 倍。
⚡ Iterative BFS (C#)iterative BFS(C#)	The group walk is rewritten iteratively so the .NET JIT inlines the whole chain (+~3% C#). MetalNES's `addNodeToGroup` is recursive (self-call at L1970). Rust keeps it recursive — LLVM already inlines that well.group walk 改成 iterative,讓 .NET JIT inline 整條鏈(C# +~3%)。MetalNES `addNodeToGroup` 是遞迴(L1970 自呼叫)。Rust 維持遞迴 —— LLVM 本來就 inline 得好。
🔬 Hard settle cap收斂硬上限	A non-converging region aborts after 128 passes and clears the queue, so it can never hang the run. MetalNES only prints a warning at 100 passes (L1544) with no hard cap.不收斂的區域跑 128 次就中止 + 清隊列,絕不會卡死整次模擬。MetalNES 只在 100 次印 warning(L1544),無硬上限。

D. Engine, verification & handlers引擎、驗證與 handler — 🧪 / ⚡

S1 extensionS1 延伸	What it does & vs MetalNES作用 & 對比 MetalNES
🧪 Twin C# + Rust enginesC# + Rust 雙引擎	Two independent codegens kept bit-identical (same checksum) — cross-validation and a built-in performance comparison; the Rust side runs from a precomputed snapshot. MetalNES is a single C++ engine.兩個獨立 codegen 保持位元完全相同(checksum 一致)—— 互相驗證 + 內建效能對照;Rust 端從預算快照啟動。MetalNES 是單一 C++ 引擎。
⚡ Batch-settle memory handlers記憶體 handler batch-settle	A memory read drives the whole data bus by enqueuing every bit (SetHigh/Low-queued), then settles once — rather than settling per bit.記憶體讀取時把整條 data bus 的每個 bit 都 enqueue(SetHigh/Low-queued),迴圈後一次收斂 —— 而非逐 bit 收斂。

E. Fidelity extensions保真延伸 — 🔬 (not speed, but real design deltas)

S1 extensionS1 延伸	What it does & vs MetalNES作用 & 對比 MetalNES
🔬 Keep pull-downs保留 pull-down	Both `'+'` (pull-up) and `'-'` (pull-down) segdefs are kept. MetalNES kept only `'+'`.segdef 的 `'+'`(pull-up)與 `'-'`(pull-down)都保留。MetalNES 只留 `'+'`。
🔬 Use the weak flag使用 weak 旗標	The 7th transdefs column (weak / depletion-load) is actually used to separate strong pull-down from weak pull-up. MetalNES reads it but barely uses it.transdefs 第 7 欄(weak / depletion-load)真的拿來區分 strong pull-down 與 weak pull-up。MetalNES 讀了卻幾乎沒用。
🔬 NodeValue enumNodeValue 列舉	A 4-state result (Low / High / HoldPrevious / Undefined) replaces MetalNES's bool. Honest caveat: the hot path still resolves via the byte LUT and behaves identically to MetalNES today — HoldPrevious is a planned semantic refinement, not yet a live branch.用 4 態結果(Low / High / HoldPrevious / Undefined)取代 MetalNES 的 bool。誠實說明:熱路徑目前仍經 byte LUT、行為與 MetalNES 一致 —— HoldPrevious 是規劃中的語意精修,還沒實際分流。

In one sentence一句話總結

The eight performance-effective extensions cluster in three places — skip work (fast-path, lowering, build-prune), shrink the hot data (hot/cold split, ushort/byte, unmanaged SoA), and tighten the hot loop (setNodeState rework, iterative-BFS-for-JIT). This independently matches what the Visual NES author found in 2017: the wins come from less work + smaller data + tighter codegen, not from higher-level data structures. The biggest "less work" wins since — R-1 and the P-1 → P-4 event-count prunes — sharpen rather than contradict that: they are algorithmic (graph-level), but in the same spirit — provably doing less, never a cleverer data structure.

八項對效能有效的延伸集中在三處 —— 省工(fast-path、lowering、build 剪枝)、縮熱資料(hot/cold 拆分、ushort/byte、unmanaged SoA)、收緊熱迴圈(setNodeState 重寫、iterative-BFS 解鎖 JIT)。這獨立印證了 Visual NES 作者 2017 年的結論:勝利來自更少工作 + 更小資料 + 更緊 codegen,不是更高階的資料結構。之後最大的幾個「少做事」勝利 —— R-1 與 P-1 → P-4 事件數剪枝 —— 是強化而非推翻這點:它們確實是演算法層(圖層級)的,但精神一致 —— 可證明地少做,而非更聰明的資料結構。

Where these ideas come from — credits & further reading這些想法的源頭 —— 出處與延伸閱讀

Almost none of these techniques are ours. We applied (and often failed with) ideas from a long line of research and practice. Whether a strategy won or lost for us, credit belongs to its originators. References below are ones we're confident of; a couple of attributions are marked where the canonical source is fuzzy.

這些技術幾乎沒有一個是我們原創的。我們只是套用(且常常失敗)了一長串研究與實務的點子。無論某個策略對我們是賺是賠,功勞都屬於原創者。以下是我們有把握的引用;少數來源較模糊的會標注。

Used here for…我們用在…	Origin源頭
The whole switch-level model (transistors as switches, settle by relaxation)整個開關級模型(電晶體=開關、鬆弛收斂)	R. E. Bryant, "A Switch-Level Model and Simulator for MOS Digital Systems", IEEE Trans. Computers C-33(2), 1984 (from his 1981 MIT PhD work, MOSSIM).
The settle loop = worklist / event-driven fixpoint iterationsettle 迴圈 = worklist / 事件驅動 fixpoint 迭代	G. A. Kildall, "A Unified Approach to Global Program Optimization", POPL 1973 (iterative dataflow to a fixpoint); plus classic discrete-event simulation.
S1.5 lowering — merging always-on shorts into one nodeS1.5 lowering — 合併永遠導通短路成一個節點	Union-Find / disjoint-set: B. A. Galler & M. J. Fisher, CACM 7(5), 1964; near-linear analysis (path compression + union by rank) R. E. Tarjan, JACM 22(2), 1975.
Loop / cycle detection (the S2 IR stage)迴路 / 環偵測(S2 IR 階段)	R. E. Tarjan, "Depth-First Search and Linear Graph Algorithms", SIAM J. Computing 1(2), 1972 (linear-time strongly-connected components).
Hot/cold split + Structure-of-Arrays (the biggest real win)hot/cold 拆分 + Structure-of-Arrays(最大的真實勝利)	Data-Oriented Design — M. Acton, "Data-Oriented Design and C++", CppCon 2014 (the canonical modern talk; SoA itself is a long-standing vectorization idea, not a single paper).
Knowing memory latency — not compute — is the wall認清瓶頸是記憶體延遲、不是運算	Wm. A. Wulf & S. A. McKee, "Hitting the Memory Wall: Implications of the Obvious", ACM SIGARCH CAN 23(1), 1995.
RCM cache-locality renumbering (failed for us — though renumbering itself later shipped +3.6% under a different objective)RCM cache 區域性重編號 (對我們失敗 —— 但重編號本身後來換了目標函數,以 +3.6% 出貨)	E. Cuthill & J. McKee, "Reducing the Bandwidth of Sparse Symmetric Matrices", ACM Nat. Conf. 1969; the reverse (RCM) variant is attributed to J. A. George, 1971.
Bit-parallel / dense-scan BFS (failed for us)bit-parallel / dense-scan BFS (對我們失敗)	S. Beamer, K. Asanović, D. Patterson, "Direction-Optimizing Breadth-First Search", SC 2012; the Ligra framework, J. Shun & G. E. Blelloch, PPoPP 2013. Bit-packed boolean-graph roots: the "Four Russians" method (Arlazarov, Dinic, Kronrod, Faradzhev, 1970). [bit-packing root — approximate][bit-packing 根源 — 近似]
Profile-guided optimization (PGO), as Visual NES also usedPGO(Visual NES 也用過)	K. Pettis & R. C. Hansen, "Profile Guided Code Positioning", PLDI 1990.

If we've mis-credited anything, please tell us via the feedback form — we'd rather fix it.

若有任何誤標,請透過意見回饋表單告訴我們 —— 我們很樂意更正。