One algorithm, three generations一條演算法,三代傳承

Where AprVisual comes from: the visual6502 → MetalNES → AprVisual.S1 lineage, what each generation added, a measured four-way comparison of the whole family on one machine — and the original contributions we put on top. One sentence: visual6502 invented the algorithm, MetalNES composed it into a NES, and AprVisual turned it into a bit-verifiable research engine that is ~2.5× faster and contributed original techniques.

AprVisual 的出身:visual6502 → MetalNES → AprVisual.S1 的系譜、每一代各加了什麼、同一台機器上整個家族的四方實測比較 —— 以及我們疊在最上面的原創貢獻。一句話:visual6502 發明了演算法,MetalNES 把它組成一台 NES,AprVisual 把它變成逐位元可驗證、快約 2.5 倍、並貢獻了原創技術的研究級引擎。

The family tree血脈樹

Everything below descends from one piece of JavaScript — visual6502's chipsim.js — but the branches differ wildly in how faithful (or ambitious) they are. That alone explains most of the performance spread.

下面所有東西都源自同一份 JavaScript —— visual6502 的 chipsim.js —— 但各分支對它的「忠實度(或野心)」差很多,光這一點就解釋了大部分的效能差距。

Visual6502  chipsim.js  (JavaScript, MIT) ─── the common ancestor / 共同祖先
├── Visual 2A03 / 2C02 (JS, Quietust) ─ chipsim.js applied to the two NES chips
│      └── VisualNes (C++/C#) ──── a line-by-line port of those two JS sims, wired into a NES
├── perfect6502 (C) ────────────── an optimized REWRITE of chipsim.js, 6502 only
└── MetalNES (C++) ─────────────── a re-engineered descendant (LUT/handlers/modules/analog), whole NES
        └── AprVisual S1 (C#) ──── our port of MetalNES + our own optimization program
                └── rust-s1 (Rust) ─ a bit-exact twin of S1
Project專案Relationship to the JS與 JS 的關係Evidence證據
VisualNesline-by-line port逐字移植chipsim.cpp keeps the JS inefficiencies (O(n²) find() dedup) and even JS leftovers (eval(readTriggers[a]) as comments)chipsim.cpp 保留了 JS 的低效(O(n²) find() 去重),連 JS 殘跡(eval(readTriggers[a]) 註解)都還在
perfect6502optimized rewrite (6502 only)最佳化重寫(只有 6502)README: "derived from the JavaScript visual6502 implementation"; rebuilt in C with bitmap states + precomputed dependant listsREADME 自述源自 visual6502;以 C 重寫成 bitmap 狀態 + 預算 dependant 清單
MetalNESre-engineered descendant重新工程化的後代adds things chipsim.js never had: the 256-entry LUT, behavioral handlers, a module system, analog video/audio ladders加了 chipsim.js 沒有的東西:256 項 LUT、行為式 handler、模組系統、類比視訊/音訊 ladder
AprVisual S1MetalNES port + our optimizationsMetalNES 移植 + 我們的優化our SetNodeState/ProcessQueue/tlist layout = MetalNES's pre-prune prototype, function-for-function; then R-1, P-1→P-4, range-prune, the self-captured key, B1…我們的 SetNodeState/ProcessQueue/tlist 佈局 = MetalNES 加剪枝前的逐函式原型;之後再疊 R-1、P-1→P-4、範圍剪枝、自捕捉鍵、B1……

The shared semantics all generations run: event-driven settle; per event, build the conducting group (BFS over ON pass transistors), OR the members' flags, resolve by priority (GND wins → VCC/pull-up → external drive → hold); a purely-floating group resolves to the largest-capacitance member's previous state. In the literature this is Bryant's MOSSIM II model minus the X state — see where our work sits in the literature.

三代共享的語意:事件驅動 settle;每個事件建導通群組(對 ON pass 電晶體做 BFS)、OR 成員旗標、按優先序解析(GND 勝 → VCC/上拉 → 外部驅動 → 保持);純浮接群組由最大電容成員保值。在文獻裡這「就是」Bryant 的 MOSSIM II 模型(去掉 X 態)—— 見我們的工作在文獻中的位置

What MetalNES added over visual6502MetalNES 在 visual6502 之上加了什麼

The measured four-way comparison (one machine, one ROM, one unit)四方實測比較(同機、同 ROM、同單位)

We didn't take anyone's word for it: we compiled and benchmarked the whole family ourselves (2026-06-08, Ryzen 7 3700X, full_palette.nes, NES master-clock half-cycles). VisualNes and MetalNES build cleanly headless; MetalNES needed real surgery (it's a macOS Metal GUI app — we stubbed the GUI/raster/audio layers and drove the wire core directly, measuring engine + logic + RAM/ROM/bus, the same scope as our benchmark).

我們沒有聽信任何轉述,而是自己把整個家族編譯起來實測(2026-06-08,Ryzen 7 3700X,full_palette.nes,NES 主時脈半週期單位)。VisualNes 很好編;MetalNES 得動手術(它是 macOS 的 Metal GUI 程式 —— 我們把 GUI/繪圖/音訊層 stub 掉、直接驅動 wire 核心,量到的是引擎 + 邏輯 + RAM/ROM/匯流排,與我們 benchmark 的口徑相同)。

Project專案Scope範圍Lang語言hc/sAlgorithmic character演算法特徵
VisualNeswhole NES整台 NESC++~24Kliteral chipsim.js port: O(n²) group dedup, vector+shared_ptr in the hot path, zero prunes — the "unoptimized chipsim" baselinechipsim.js 逐字移植:O(n²) 群組去重、熱路徑進出 vector+shared_ptr零剪枝 —— 「未優化 chipsim」基準線
MetalNESwhole NES整台 NESC++~54Kour direct ancestor: flags-OR→256-LUT, double-buffered waves, single-sided turn-on — but no prunes, std::vector data我們的直系祖先:flags-OR→256-LUT、雙緩衝波、單側 turn-on —— 但無剪枝、std::vector 資料
perfect65026502 only只有 6502C~29K *a genuinely optimized chipsim rewrite: bitmap states, precomputed dependant lists, single-sided turn-on — but no charge-hold model at all真正最佳化的 chipsim 重寫:bitmap 狀態、預算 dependant 清單、單側 turn-on —— 但完全沒有電荷保持模型
AprVisual S1whole NES整台 NESC#~108K (2026-06-08)
~135.9K (now現況)
the MetalNES algorithm + R-1 dynamic singleton + P-1→P-4 prunes + unmanaged SoA layout at the time of the comparison; since extended with range-prune, the self-captured locality key and the B1 pair path — bit-exact (golden checksums + a 10M-half-cycle SMB1 gate) throughout, ~316× from real time比較當時 = MetalNES 演算法 + R-1 動態單例 + P-1→P-4 剪枝 + 非託管 SoA 佈局;其後再疊範圍剪枝、自我捕捉 locality 鍵與 B1 成對路徑 —— 全程 bit-exact(golden checksum + SMB1 一千萬半週期門),離實機 ~316×
rust-s1whole NES整台 NESRust~118.5K (now現況)an independently implemented, bit-identical twin of S1 (same goldens) — the per-platform adoption record (e.g. B1: +7–9% on C# at boost vs +14.5% on Rust) doubles as cross-engine validation獨立實作、與 S1 逐位元相同的攣生(同一組 golden)—— 各平台分別採用的紀錄(如 B1:C# boost 下 +7~9% vs Rust +14.5%)同時就是跨引擎驗證

* perfect6502's "hc" is a 6502-clock half-cycle; ours is a NES master-clock half-cycle (the 6502 runs at master÷12). The units differ by an order of magnitude, so its 29K must not be ranked against the whole-console rows — it sits on a separate (CPU-only) branch. See the unit conversions.

* perfect6502 的「hc」是 6502 時脈半週期;我們的是 NES 主時脈半週期(6502 = 主時脈 ÷12)。單位差一個數量級,29K 不能與整機列並排 —— 它屬於另一條(僅 CPU)分支。換算見換算器

The cleanest finding is a progress bar: 24K (VisualNes — unoptimized chipsim) → ~54K (MetalNES — a better-engineered ancestor) → ~108K (S1 at the time of the comparison). MetalNES→S1 = +100% on the same algorithm, same scope, same unit, same machine, bit-exact — that delta is our prune family + dynamic fast-path + data layout. Nothing in any sibling could be "carried over to beat us"; instead all three independently confirmed our lineage is faithful and our optimizations lead. (S1 has since reached ~135.9K with the range-prune + self-captured key + B1 pair path — ~2.5× the ancestor.)

最乾淨的發現是一條進度條:24K(VisualNes —— 未優化 chipsim)→ ~54K(MetalNES —— 工程更好的祖先)→ ~108K(比較當時的 S1)。MetalNES→S1 = +100%,同演算法、同範圍、同單位、同機器、bit-exact —— 這段差距「就是」我們的剪枝家族 + 動態 fast-path + 資料佈局。三個近親身上沒有任何「搬過來就能贏我們」的東西;反而各自獨立佐證了我們的血脈忠實、優化領先。(S1 之後又靠範圍剪枝 + 自捕捉鍵 + B1 成對路徑到達 ~135.9K —— 約為祖先的 2.5 倍。)

What we looked for in each sibling (and what we found)我們在每個近親身上找什麼(找到了什麼)

What AprVisual added — and the original finaleAprVisual 加了什麼 —— 以及原創的終章

Correctness first: bit-exact golden checksums (full-state FNV-1a at 300k/400k/1M half-cycles + a 10M-half-cycle SMB1 gate), a C# engine and an independently implemented Rust twin producing bit-identical output, both '+'/'-' pulls kept, and "driven high" explicitly distinguished from "floating hold". Every optimization below shipped only after passing those gates, measured with interleaved-paired A/B.

正確性先行:逐位元黃金檢查(全狀態 FNV-1a,300k/400k/1M 半週期三條 + SMB1 一千萬半週期門)、C# 引擎與獨立實作且輸出逐位元相同的 Rust 攣生、保留 '+'/'-' 兩種 pull、明確區分「驅動為高」與「浮接保值」。以下每一項優化都通過這些門檻(interleaved-paired A/B 量測)才出貨。

LayerWhat項目Gain增益
Data layout資料佈局unmanaged SoA hot data, 16-byte NodeInfo with inline payload, zero bounds checks; lowering (always-on shorts merged)非託管 SoA 熱資料、16B NodeInfo + 行內 payload、零邊界檢查;lowering(常開短路合併)base +3.7% +4.2%
Fast pathfast-pathstatic singleton O(1); R-1 dynamic singleton (all pass gates OFF ⇒ group is {nn} right now)靜態單例 O(1);R-1 動態單例(通道閘全 OFF ⇒ 群組此刻必為 {nn})+18.6% / +12.5%
Prunes剪枝P-1 same-state turn-on prune + P-2→P-4 (isolation + capacitance-guarded extensions) — delete ~21% of all re-evaluations at the source (how they work)P-1 同態 turn-on 剪枝 + P-2→P-4(隔離 + 電容守衛擴展)—— 從源頭刪掉 ~21% 重算(運作原理)+11.85% +7.7%/+10%
Renumbering重編號range-prune + the self-captured first-touch key — the original RCM-improved design (below)範圍剪枝 + 自我捕捉初次觸碰鍵 —— 原創的 RCM 改良版(見下)+3.6% +6.2%
Fast pathfast-pathB1 pair path — provably-two-node groups resolved inline (size-2 = 77% of all walks)B1 成對路徑 —— 可證明兩節點群組就地解析(size-2 = 77% 走訪)+7–9% (boost) / +14.5%

The finale: the original RCM-improved design終章:原創的 RCM 改良版

Classic RCM renumbers nodes by graph adjacency for cache locality — we measured it useless here (the hot set is already cache-resident; the bound is dependent-load chains). The reversal came from changing the objective of renumbering, twice: (1) range-prune — sort by static prune class, so the hottest loop's mask lookups become register range compares, verified against recomputed ground truth at every reset with an automatic safe fallback; (2) the self-captured first-touch key — the engine warms itself up at load and records the production cascade's true first-pop order through a cold instrumented copy of the settle loop, then rebuilds with that order. No file, no flag, any ROM, immune to workload drift. The decisive empirical finding: the key's value is the pruned cascade's order, not cache-line density. The full story →

傳統 RCM 按圖鄰接性重排編號以求快取區域性 —— 在這裡實測無效(熱集早已常駐快取;瓶頸是相依載入鏈)。翻盤靠的是兩度換掉重編號的目標函數:(1) 範圍剪枝 —— 以靜態剪枝類別排序,讓最熱迴圈的遮罩查表變成暫存器區間比較,每次 reset 對照重算的基準真相驗證、不符自動安全退化;(2) 自我捕捉初次觸碰鍵 —— 引擎載入時自己暖機,用 settle 迴圈的冷儀器化副本記錄生產級聯的真實初次彈出順序,再以該順序重建。零檔案、零旗標、任何 ROM 通用、免疫工作負載漂移。決定性的實證:鍵的價值在已剪枝級聯的順序,不是快取行密度。完整故事 →

Where does all this sit relative to MOSSIM II, IRSIM, COSMOS and the academic literature — and which parts are genuinely original? That got its own page: prior art & original contributions →

這一切相對於 MOSSIM II、IRSIM、COSMOS 與學術文獻的位置 —— 哪些部分是真正原創?另有專頁:文獻定位與原創貢獻 →