AprVisual — the lineage: visual6502 → MetalNES → AprVisual.S1

The family tree血脈樹

Everything below descends from one piece of JavaScript — visual6502's chipsim.js — but the branches differ wildly in how faithful (or ambitious) they are. That alone explains most of the performance spread.

下面所有東西都源自同一份 JavaScript —— visual6502 的 chipsim.js —— 但各分支對它的「忠實度(或野心)」差很多,光這一點就解釋了大部分的效能差距。

Visual6502  chipsim.js  (JavaScript, MIT) ─── the common ancestor / 共同祖先
├── Visual 2A03 / 2C02 (JS, Quietust) ─ chipsim.js applied to the two NES chips
│      └── VisualNes (C++/C#) ──── a line-by-line port of those two JS sims, wired into a NES
├── perfect6502 (C) ────────────── an optimized REWRITE of chipsim.js, 6502 only
└── MetalNES (C++) ─────────────── a re-engineered descendant (LUT/handlers/modules/analog), whole NES
        └── AprVisual S1 (C#) ──── our port of MetalNES + our own optimization program
                └── rust-s1 (Rust) ─ a bit-exact twin of S1

Project專案	Relationship to the JS與 JS 的關係	Evidence證據
VisualNes	line-by-line port逐字移植	`chipsim.cpp` keeps the JS inefficiencies (O(n²) `find()` dedup) and even JS leftovers (`eval(readTriggers[a])` as comments)`chipsim.cpp` 保留了 JS 的低效(O(n²) `find()` 去重),連 JS 殘跡(`eval(readTriggers[a])` 註解)都還在
perfect6502	optimized rewrite (6502 only)最佳化重寫(只有 6502)	README: "derived from the JavaScript visual6502 implementation"; rebuilt in C with bitmap states + precomputed dependant listsREADME 自述源自 visual6502;以 C 重寫成 bitmap 狀態 + 預算 dependant 清單
MetalNES	re-engineered descendant重新工程化的後代	adds things chipsim.js never had: the 256-entry LUT, behavioral handlers, a module system, analog video/audio ladders加了 chipsim.js 沒有的東西:256 項 LUT、行為式 handler、模組系統、類比視訊/音訊 ladder
AprVisual S1	MetalNES port + our optimizationsMetalNES 移植 + 我們的優化	our `SetNodeState`/`ProcessQueue`/tlist layout = MetalNES's pre-prune prototype, function-for-function; then R-1, P-1→P-4, range-prune, the self-captured key, B1…我們的 `SetNodeState`/`ProcessQueue`/tlist 佈局 = MetalNES 加剪枝前的逐函式原型;之後再疊 R-1、P-1→P-4、範圍剪枝、自捕捉鍵、B1……

The shared semantics all generations run: event-driven settle; per event, build the conducting group (BFS over ON pass transistors), OR the members' flags, resolve by priority (GND wins → VCC/pull-up → external drive → hold); a purely-floating group resolves to the largest-capacitance member's previous state. In the literature this is Bryant's MOSSIM II model minus the X state — see where our work sits in the literature.

三代共享的語意:事件驅動 settle;每個事件建導通群組(對 ON pass 電晶體做 BFS)、OR 成員旗標、按優先序解析(GND 勝 → VCC/上拉 → 外部驅動 → 保持);純浮接群組由最大電容成員保值。在文獻裡這「就是」Bryant 的 MOSSIM II 模型(去掉 X 態)—— 見我們的工作在文獻中的位置。

What MetalNES added over visual6502MetalNES 在 visual6502 之上加了什麼

Whole-system composition — not just one CPU: the Visual2A03 + Visual2C02 netlists plus the board's TTL chips, composed into a complete NES-001 via .js module definitions (pins / modules / connections / pullups / forceCompute / memory).
整機組裝 —— 不只一顆 CPU:Visual2A03 + Visual2C02 兩顆晶片網表加主機板 TTL,用 .js 模組定義(pins / modules / connections / pullups / forceCompute / memory)組成完整 NES-001。
An optimized C++ engine — native data structures and the 256-entry FlagsToState lookup table: one OR-ed flag byte → one table read replaces the priority-ladder branches (and distinguishes "driven high" from "hold previous").
優化的 C++ 引擎 —— 原生資料結構與 256 項 FlagsToState 查找表:旗標 OR 成一個 byte、查一次表就出值(並區分「驅動為高」與「浮接保值」)。
Behavioral memory — RAM/ROM as handlers (the callback = fake-transistor mechanism) instead of transistors, removing the largest mass of pointless simulation.
行為式記憶體 —— RAM/ROM 掛 handler(callback = 假電晶體機制)而非電晶體,省下最大宗的無謂模擬。
ForceCompute — the special Gnd+Pwr-cancel resolution for certain bus nodes; plus analog video/audio voltage ladders (beyond our digital scope — depth, not speed).
ForceCompute —— 特定匯流排節點的 Gnd+Pwr 互消特殊解析;還有類比視訊/音訊電壓 ladder(超出我們的數位範圍 —— 是深度,不是速度)。
Trade-offs: only '+' pull-ups kept (the '-' column dropped), and no bit-exact verification methodology.
代價:segdefs 只保留 '+' 上拉(丟了 '-'),也沒有逐位元等價的驗證方法論。

The measured four-way comparison (one machine, one ROM, one unit)四方實測比較(同機、同 ROM、同單位)

We didn't take anyone's word for it: we compiled and benchmarked the whole family ourselves (2026-06-08, Ryzen 7 3700X, full_palette.nes, NES master-clock half-cycles). VisualNes and MetalNES build cleanly headless; MetalNES needed real surgery (it's a macOS Metal GUI app — we stubbed the GUI/raster/audio layers and drove the wire core directly, measuring engine + logic + RAM/ROM/bus, the same scope as our benchmark).

我們沒有聽信任何轉述,而是自己把整個家族編譯起來實測(2026-06-08,Ryzen 7 3700X,full_palette.nes,NES 主時脈半週期單位)。VisualNes 很好編;MetalNES 得動手術(它是 macOS 的 Metal GUI 程式 —— 我們把 GUI/繪圖/音訊層 stub 掉、直接驅動 wire 核心,量到的是引擎 + 邏輯 + RAM/ROM/匯流排,與我們 benchmark 的口徑相同)。

Project專案	Scope範圍	Lang語言	hc/s	Algorithmic character演算法特徵
VisualNes	whole NES整台 NES	C++	~24K	literal chipsim.js port: O(n²) group dedup, `vector`+`shared_ptr` in the hot path, zero prunes — the "unoptimized chipsim" baselinechipsim.js 逐字移植:O(n²) 群組去重、熱路徑進出 `vector`+`shared_ptr`、零剪枝 —— 「未優化 chipsim」基準線
MetalNES	whole NES整台 NES	C++	~54K	our direct ancestor: flags-OR→256-LUT, double-buffered waves, single-sided turn-on — but no prunes, `std::vector` data我們的直系祖先:flags-OR→256-LUT、雙緩衝波、單側 turn-on —— 但無剪枝、`std::vector` 資料
perfect6502	6502 only只有 6502	C	~29K *	a genuinely optimized chipsim rewrite: bitmap states, precomputed dependant lists, single-sided turn-on — but no charge-hold model at all真正最佳化的 chipsim 重寫:bitmap 狀態、預算 dependant 清單、單側 turn-on —— 但完全沒有電荷保持模型
AprVisual S1	whole NES整台 NES	C#	~108K (2026-06-08) → ~135.9K (now現況)	the MetalNES algorithm + R-1 dynamic singleton + P-1→P-4 prunes + unmanaged SoA layout at the time of the comparison; since extended with range-prune, the self-captured locality key and the B1 pair path — bit-exact (golden checksums + a 10M-half-cycle SMB1 gate) throughout, ~316× from real time比較當時 = MetalNES 演算法 + R-1 動態單例 + P-1→P-4 剪枝 + 非託管 SoA 佈局;其後再疊範圍剪枝、自我捕捉 locality 鍵與 B1 成對路徑 —— 全程 bit-exact(golden checksum + SMB1 一千萬半週期門),離實機 ~316×
rust-s1	whole NES整台 NES	Rust	~118.5K (now現況)	an independently implemented, bit-identical twin of S1 (same goldens) — the per-platform adoption record (e.g. B1: +7–9% on C# at boost vs +14.5% on Rust) doubles as cross-engine validation獨立實作、與 S1 逐位元相同的攣生(同一組 golden)—— 各平台分別採用的紀錄(如 B1:C# boost 下 +7~9% vs Rust +14.5%)同時就是跨引擎驗證

* perfect6502's "hc" is a 6502-clock half-cycle; ours is a NES master-clock half-cycle (the 6502 runs at master÷12). The units differ by an order of magnitude, so its 29K must not be ranked against the whole-console rows — it sits on a separate (CPU-only) branch. See the unit conversions.

* perfect6502 的「hc」是 6502 時脈半週期;我們的是 NES 主時脈半週期(6502 = 主時脈 ÷12)。單位差一個數量級,29K 不能與整機列並排 —— 它屬於另一條(僅 CPU)分支。換算見換算器。

The cleanest finding is a progress bar: 24K (VisualNes — unoptimized chipsim) → ~54K (MetalNES — a better-engineered ancestor) → ~108K (S1 at the time of the comparison). MetalNES→S1 = +100% on the same algorithm, same scope, same unit, same machine, bit-exact — that delta is our prune family + dynamic fast-path + data layout. Nothing in any sibling could be "carried over to beat us"; instead all three independently confirmed our lineage is faithful and our optimizations lead. (S1 has since reached ~135.9K with the range-prune + self-captured key + B1 pair path — ~2.5× the ancestor.)

最乾淨的發現是一條進度條:24K(VisualNes —— 未優化 chipsim)→ ~54K(MetalNES —— 工程更好的祖先)→ ~108K(比較當時的 S1)。MetalNES→S1 = +100%,同演算法、同範圍、同單位、同機器、bit-exact —— 這段差距「就是」我們的剪枝家族 + 動態 fast-path + 資料佈局。三個近親身上沒有任何「搬過來就能贏我們」的東西;反而各自獨立佐證了我們的血脈忠實、優化領先。(S1 之後又靠範圍剪枝 + 自捕捉鍵 + B1 成對路徑到達 ~135.9K —— 約為祖先的 2.5 倍。)

What we looked for in each sibling (and what we found)我們在每個近親身上找什麼(找到了什麼)

perfect6502: its precomputed per-node dependant lists (flat CSR, rising/falling split) are cache-friendly — but flattening them would destroy the (c1,c2) pairing our P-1 prune needs, and the prune saves more. Its 1-bit bitmap states are 8× denser — but our flags/prunes need bytes, and our own bit-parallel experiments were measured dead ends. It has no charge-hold model at all (floating → 0) — survivable for a bare 6502, impossible for a whole NES.
perfect6502:它預算好的每節點 dependant 清單(扁平 CSR、升/降緣拆分)對快取友善 —— 但扁平化會毀掉我們 P-1 剪枝需要的 (c1,c2) 配對,而剪枝省得更多。它的 1-bit bitmap 狀態密 8 倍 —— 但我們的旗標/剪枝需要 byte,且自家位元平行實驗早已實測判死。它完全沒有電荷保持模型(浮接→0)—— 裸 6502 撐得過,整台 NES 不行。
VisualNes: honestly, nothing to take — it is a correctness-reference-grade port. Its value is being the unoptimized baseline that quantifies the family's spread.
VisualNes:坦白說沒有可學的 —— 它是「正確性參考」等級的移植。它的價值是當未優化基準線,把整個家族的差距量化出來。
MetalNES: it IS the source of our algorithm, so nothing new to learn — we took everything and kept going. What it uniquely has is the analog video/audio ladder simulation (depth we deliberately scope out).
MetalNES:它「就是」我們演算法的來源,所以沒有新東西 —— 我們全拿了還往前走。它獨有的是類比視訊/音訊 ladder 模擬(我們刻意不納入範圍的深度)。

What AprVisual added — and the original finaleAprVisual 加了什麼 —— 以及原創的終章

Correctness first: bit-exact golden checksums (full-state FNV-1a at 300k/400k/1M half-cycles + a 10M-half-cycle SMB1 gate), a C# engine and an independently implemented Rust twin producing bit-identical output, both '+'/'-' pulls kept, and "driven high" explicitly distinguished from "floating hold". Every optimization below shipped only after passing those gates, measured with interleaved-paired A/B.

正確性先行:逐位元黃金檢查(全狀態 FNV-1a,300k/400k/1M 半週期三條 + SMB1 一千萬半週期門)、C# 引擎與獨立實作且輸出逐位元相同的 Rust 攣生、保留 '+'/'-' 兩種 pull、明確區分「驅動為高」與「浮接保值」。以下每一項優化都通過這些門檻(interleaved-paired A/B 量測)才出貨。

Layer層	What項目	Gain增益
Data layout資料佈局	unmanaged SoA hot data, 16-byte NodeInfo with inline payload, zero bounds checks; lowering (always-on shorts merged)非託管 SoA 熱資料、16B NodeInfo + 行內 payload、零邊界檢查;lowering(常開短路合併)	base +3.7% +4.2%
Fast pathfast-path	static singleton O(1); R-1 dynamic singleton (all pass gates OFF ⇒ group is {nn} right now)靜態單例 O(1);R-1 動態單例(通道閘全 OFF ⇒ 群組此刻必為 {nn})	+18.6% / +12.5%
Prunes剪枝	P-1 same-state turn-on prune + P-2→P-4 (isolation + capacitance-guarded extensions) — delete ~21% of all re-evaluations at the source (how they work)P-1 同態 turn-on 剪枝 + P-2→P-4(隔離 + 電容守衛擴展)—— 從源頭刪掉 ~21% 重算(運作原理)	+11.85% +7.7%/+10%
Renumbering重編號	range-prune + the self-captured first-touch key — the original RCM-improved design (below)範圍剪枝 + 自我捕捉初次觸碰鍵 —— 原創的 RCM 改良版(見下)	+3.6% +6.2%
Fast pathfast-path	B1 pair path — provably-two-node groups resolved inline (size-2 = 77% of all walks)B1 成對路徑 —— 可證明兩節點群組就地解析(size-2 = 77% 走訪)	+7–9% (boost) / +14.5%

The finale: the original RCM-improved design終章:原創的 RCM 改良版

Classic RCM renumbers nodes by graph adjacency for cache locality — we measured it useless here (the hot set is already cache-resident; the bound is dependent-load chains). The reversal came from changing the objective of renumbering, twice: (1) range-prune — sort by static prune class, so the hottest loop's mask lookups become register range compares, verified against recomputed ground truth at every reset with an automatic safe fallback; (2) the self-captured first-touch key — the engine warms itself up at load and records the production cascade's true first-pop order through a cold instrumented copy of the settle loop, then rebuilds with that order. No file, no flag, any ROM, immune to workload drift. The decisive empirical finding: the key's value is the pruned cascade's order, not cache-line density. The full story →

傳統 RCM 按圖鄰接性重排編號以求快取區域性 —— 在這裡實測無效(熱集早已常駐快取;瓶頸是相依載入鏈)。翻盤靠的是兩度換掉重編號的目標函數:(1) 範圍剪枝 —— 以靜態剪枝類別排序,讓最熱迴圈的遮罩查表變成暫存器區間比較,每次 reset 對照重算的基準真相驗證、不符自動安全退化;(2) 自我捕捉初次觸碰鍵 —— 引擎載入時自己暖機,用 settle 迴圈的冷儀器化副本記錄生產級聯的真實初次彈出順序,再以該順序重建。零檔案、零旗標、任何 ROM 通用、免疫工作負載漂移。決定性的實證:鍵的價值在已剪枝級聯的順序,不是快取行密度。完整故事 →

Where does all this sit relative to MOSSIM II, IRSIM, COSMOS and the academic literature — and which parts are genuinely original? That got its own page: prior art & original contributions →

這一切相對於 MOSSIM II、IRSIM、COSMOS 與學術文獻的位置 —— 哪些部分是真正原創?另有專頁:文獻定位與原創貢獻 →