AprVisual

A Visual6502-style switch-level NES simulator — and an honest log of what made it faster, and what didn't.

Visual6502 風格的開關級 NES 模擬器 —— 以及一份「什麼有效、什麼沒效」的誠實優化筆記。

Transistor-accurate NES (2A03 + 2C02) simulation in software. This site is an experience-sharing / research log; contributions and faster hardware results are very welcome.

在軟體裡做電晶體精準的 NES(2A03 + 2C02)模擬。這個網站是經驗分享 / 研究筆記,歡迎一起集思廣益,也歡迎在更快的電腦上跑出你的結果。

⬇ Download benchmark⬇ 下載 benchmark 工具 View on GitHub前往 GitHub

Last updated最後更新: 2026-05-30 13:29 (UTC)

01What AprVisual is — and the wall we hit專案定位 —— 以及我們撞上的牆

AprVisual started from one goal: take Visual6502-style transistor netlists of the NES CPU (2A03) and PPU (2C02) and simulate them exactly at the switch level, while pushing that simulation as close to real time as software possibly can. Pure netlist simulation is a fundamentally heavy thing — the gap to real silicon speed is on the order of hundreds of times, and that gap is essentially algorithmic, not a coding detail.

AprVisual 從一個目標出發:把 NES CPU(2A03)與 PPU(2C02)的 Visual6502 風格電晶體網表,在開關級(switch-level)精確模擬,並把這個模擬盡量推到接近實機速度。純 netlist 模擬本質上非常吃資源 —— 離真實矽晶速度的鴻溝是數百倍等級,而且這個鴻溝是演算法層面的,不是寫法細節。

So the original blueprint was an automated pipeline (fully programmatic, no hand-tuning per chip) that lifts the netlist into higher-level abstractions for speed. Four stages:

所以原本的藍圖是一條自動化流程(全程程式化、不靠逐晶片人工調整),把 netlist 往上抽象化以換取效率。共四個階段:

S1 Switch-level engine — a from-scratch C# rewrite of MetalNES's wire / group-resolution core. The foundation. 開關級引擎 —— 從頭以 C# 重寫 MetalNES 的 wire / group-resolution 核心。整個專案的地基。

S2 Netlist → IR — loop/SCC detection, extract boolean logic into an Expr IR, build an IR interpreter. Netlist → IR —— loop/SCC 偵測,把布林邏輯抽取成 Expr IR,做一個 IR 直譯器。

S3 CPU proof — the IR must be per-node equivalent to S1, and measurably faster than the raw switch-level interpreter. CPU 驗證 —— IR 必須與 S1 逐節點等價,且明顯快於原始開關級直譯器。

S4 Codegen + GPU — emit C++/Verilog + a bit-sliced CUDA/GPU kernel, with per-node equivalence to the CPU IR. Codegen + GPU —— 產生 C++/Verilog + 位元切片的 CUDA/GPU kernel,與 CPU IR 逐節點等價。

The counter-intuitive part反直覺的地方

When you ask an AI — or most software engineers — how to speed this up, the answer is almost always the same: build an IR, do codegen, or throw it on a GPU. We actually built and verified those paths. And we found problems nobody mentioned up front:

當你問 AI —— 或多數軟體工程師 —— 怎麼加速,答案幾乎一致:做 IR、做 codegen、丟 GPU。我們真的把這些路徑實作並驗證了,結果發現一堆當初沒人提到的問題:

Current direction: push S1 — the pure switch-level engine — to its absolute limit. The IR/codegen path is parked, not deleted; it's a verified dead-end for the performance goal, not for correctness.目前方向:把 S1 —— 純開關級引擎 —— 推到極致。IR/codegen 路徑是擱置、不是刪除;它對效能目標是已驗證的死路,但對正確性不是。

01·5What actually runs as transistors實際吃網表的部分 & 總規模

Staying honest about fidelity matters, so here's exactly what is simulated as a real transistor netlist versus a software behavioral model. Every NES internal logic chip runs 100% transistor netlist — the CPU, the PPU, all four board TTL chips, and both controllers. Only memory cell arrays, the master oscillator, and the video output are behavioral.

對「模擬到多真」保持誠實很重要,所以這裡明確列出哪些是真電晶體網表、哪些是軟體行為模型。所有 NES 內部邏輯晶片 100% 跑電晶體網表 —— CPU、PPU、四顆主機板 TTL、兩個手把。只有記憶體的儲存陣列、主振盪器、影像輸出是行為模型。

Component元件Simulation模擬方式Transistors電晶體數
2C02 (PPU)✅ transistor netlist✅ 電晶體網表16,794
2A03 (CPU, modified 6502 + APU)✅ transistor netlist✅ 電晶體網表10,946
2× controllers (nes-pad → 4021 → 8× pslatch)✅ transistor netlist✅ 電晶體網表352
74LS373 (address latch)✅ transistor netlist✅ 電晶體網表82
74LS139 (address decode)✅ transistor netlist✅ 電晶體網表38
2× 74LS368 (controller tri-state buffers)✅ transistor netlist✅ 電晶體網表28
74HC04 (PPU A13 inverter)✅ transistor netlist✅ 電晶體網表6
RAM / VRAM / PRG / CHR ROM🔧 hybrid: control pins netlist + cell array behavioral🔧 hybrid:控制腳網表 + 陣列行為~68 (pins only)~68(僅控制腳)
CIC (nes-cic1)⚠️ stub — reset inverter only, not a real lockout chip⚠️ stub —— 只是 reset 反相器,非真鎖區晶片1
clock / video output主時鐘 / 影像輸出🔧 pure software handler🔧 純軟體 handler
27,305
transistors (assembled)電晶體(組裝後)
15,164
nodes (assembled)節點(組裝後)
26,775
transistors after loweringlowering 後電晶體
14,723
nodes after loweringlowering 後節點

Could this be smaller — a future speed-up? Possibly. Fewer nodes means smaller hot arrays and fewer that can ever go dirty, so shrinking the netlist itself is a candidate breakthrough. But the easy, safe cuts are already taken (the lowering above). Going deeper — folding constant / supply-tied nodes, collapsing logic-equivalent sub-networks — runs straight into the correctness traps that sank earlier attempts: cross-coupled latches with two stable states (which made prune-merge render a black screen) and the floating-capacitance tie-break (which "dead-end skip" broke). So it's a real but delicate avenue — any further reduction must be proven per-node equivalent and verified against a PPU visual frame, not just a CPU checksum.這還能更小嗎 —— 日後的速度突破口?有可能。節點少代表熱陣列更小、能變髒的節點更少,所以把 netlist 本身縮小是個候選突破點。但容易又安全的縮減已經撿光了(上面的 lowering)。再往下 —— 折疊常數 / 接電源的節點、收合邏輯等價子網路 —— 會直接撞上葬送先前嘗試的正確性陷阱:有兩個穩態的 cross-coupled latch(就是讓 prune-merge 渲染黑屏的元兇),以及 floating 電容 tie-break(「dead-end skip」踩的雷)。所以這是真實但細膩的方向 —— 任何進一步縮減都必須證明逐節點等價,並且用 PPU 畫面實圖驗證,而不只是 CPU checksum。

02What worked, what didn't — the interesting part過程心得 —— 有趣的部分

S1's base is an equivalent reimplementation of large parts of MetalNES — specifically its wire_compute group-resolution core, which is itself an optimized port of Visual6502's chipsim. On top of that foundation we ran a long campaign of optimization strategies. They split into two layers, and both are full of results that contradict intuition.

S1 的基礎是大量等價重寫 MetalNES —— 特別是它的 wire_compute group-resolution 核心(而那本身又是 Visual6502 chipsim 的最佳化移植)。在這個基礎上,我們跑了一長串優化策略,分成兩個層面,而兩邊都充滿反直覺的結果。

Built on MetalNES — what S1 actually adds on top建立在 MetalNES 之上 —— S1 額外加了哪些有效處理

S1 faithfully reproduces MetalNES's golden core — the connected-group BFS resolution, the 256-entry flags→state lookup table, the per-node c1c2/gnd/pwr transistor sub-lists with early-break, and the largest-capacitance float tie-break. Those are MetalNES's, ported as-is. On top of that, S1 adds these processing-level optimizations, each verified bit-identical to the un-optimized model:

S1 忠實重現了 MetalNES 的 golden core —— connected-group BFS 解析、256 項 flags→state 查表、每節點 c1c2/gnd/pwr 電晶體子列表(含 early-break)、以及最大電容的 float tie-break。這些是 MetalNES 的,原樣移植。在這之上,S1 額外加了以下處理層級的優化,每一項都驗證過與未優化模型位元完全相同:

S1 additionS1 額外做的vs MetalNES對比 MetalNES
Pure-logic fast-path純邏輯 fast-pathO(1) resolve for ~27% of nodes that provably form a singleton group — skips the BFS entirely. MetalNES runs the full group walk for every node.對 ~27% 必為單節點 group 的節點做 O(1) 解析 —— 完全跳過 BFS。MetalNES 對每個節點都跑完整 group walk。
S1.5 lowering pre-passS1.5 lowering 前置Before simulating: union-find merge of always-on shorts, drop dead (gate=GND) transistors, dedup + dense compaction → 15,164→14,723 nodes, 27,305→26,775 transistors. MetalNES simulates the raw assembled netlist.模擬前:union-find 合併永遠導通短路、移除死(gate=GND)電晶體、dedup + 緻密重編號 → 15,164→14,723 節點、27,305→26,775 電晶體。MetalNES 直接模擬組裝後的原始 netlist。
O(1) in-group dedupO(1) group 去重A per-node _inGroup flag. MetalNES linearly scans the whole current group on every node add.用每節點一個 _inGroup 旗標。MetalNES 每加一個節點都線性掃描整個當前 group。
Deferred capacitance read延後讀電容值MetalNES updates max-capacitance/state on every node add; S1 defers that read to the rare floating branch (<1% of walks) → +12% on C#.MetalNES 每加一個節點都更新 max-capacitance/state;S1 把它延後到罕見的 floating 分支(<1% 的 walk)→ C# +12%。
Hot-data shrink + SoA熱資料縮減 + SoAbyte states, ushort node ids, hot/cold split packed to a quarter cache line, unmanaged arrays — vs MetalNES's std::vector<node_info> with int fields. The single biggest lever.byte 狀態、ushort 節點 id、hot/cold 拆分壓到 ¼ cache line、unmanaged 陣列 —— 對比 MetalNES 的 std::vector<node_info>(int 欄位)。最大的單一槓桿。
Iterative BFS (C#)iterative BFS(C#)MetalNES's group walk is recursive; S1 makes it iterative so the .NET JIT inlines the whole chain (+~3% C#). Left recursive on Rust, where LLVM already inlines it.MetalNES 的 group walk 是遞迴;S1 改成 iterative 讓 .NET JIT inline 整條鏈(C# +~3%)。Rust 維持遞迴,因為 LLVM 本來就 inline 得好。
Twin C# + Rust enginesC# + Rust 雙引擎Two independent codegens, bit-identical (same checksum) — cross-validation and a built-in performance comparison.兩個獨立 codegen,位元完全相同(checksum 一致)—— 互相驗證,也內建效能對照。

So the fast-path and the lowering pre-pass are genuine, effective processing strategies S1 adds beyond the reference — the "no breakthrough" caveat below is about the broad search for new algorithms, most of which were dead-ends.

所以 fast-path 與 lowering 前置是 S1 在參考實作之外、真正有效的額外處理策略 —— 下面那句「沒有突破」指的是大範圍尋找新演算法的部分,而那些多半是死路。

New here? Switch-level simulation primer →第一次接觸?開關級模擬入門科普 →

MetalNES vs Visual6502 →MetalNES 對 Visual6502 →   S1 vs MetalNES →S1 對 MetalNES →

The full lineage: switch-level model (Bryant, 1980s) → Visual6502 (chipsim.js) → MetalNES → AprVisual S1 — an educational walk-through with source-line citations at each step.

完整脈絡:開關級模型(Bryant, 1980s)→ Visual6502(chipsim.js)→ MetalNES → AprVisual S1 —— 教育性導覽,每一段都附源碼行號對照。

Layer A — algorithmic (graph / topology / matrix)A 層 —— 演算法(圖學 / 拓樸 / 矩陣)

Two strategies here stuck and are real wins — the pure-logic fast-path and the lowering pre-pass (above). Beyond those, there was no sweeping algorithmic breakthrough: the role was to drive an AI to propose, implement, and benchmark dozens of mathematically-elegant strategies, then keep only the few that survive engineering reality. The other big contribution is the negative catalogue: cases that should win on paper but lose on real hardware, and exactly why.

這層有兩個策略站住了、是真正的勝利 —— 純邏輯 fast-pathlowering 前置(見上表)。除此之外沒有大範圍的演算法突破:角色是驅使 AI 去提出、實作、benchmark 數十種數學上很漂亮的策略,再只留下少數能撐過工程現實的。另一個重要貢獻是這份負面案例清單:那些紙上應該贏、但在真實硬體上輸的策略,以及確切的原因。

Layer B — programming / compilerB 層 —— 程式設計 / 編譯器

This is the more familiar territory: hot-loop shape, memory layout, branch behaviour, what the JIT/LLVM will and won't do. Many proposals here, AI did the coding and verification. Plenty of "this is obviously better" ideas that turned out worse — and a few real, repeatable wins.

這是比較熟悉的領域:熱迴圈形狀、記憶體佈局、分支行為、JIT/LLVM 會做與不會做的事。這層提了很多構想,由 AI 實作與驗證。一堆「這明明更好」卻變更糟的點子 —— 也有幾個真實、可重現的勝利。

Negative cases (and why they fail)負面案例(以及為何失敗)

An important caveat: most of these failures are not flaws in the strategy itself — they're mismatches with the hardware it ran on. A method can be perfectly correct, even more elegant on paper, yet collapse the moment its instruction footprint (i-cache) or working set (d-cache) spills out of cache and starts thrashing. So what failed wasn't the idea — it's that the idea didn't fit this CPU's operating envelope. The flip side is the interesting part: as hardware advances — larger caches, more memory bandwidth — some of these more aggressive strategies could turn out to be exactly the right fit.一個重要的但書:這些策略的失敗,多半不是策略本身有問題,而是它和所跑的硬體不合拍。一個方法可以完全正確、甚至在紙上更優雅,但只要它的指令足跡(i-cache)或工作集(d-cache)一溢出快取、開始顛簸(thrashing),就會崩潰。所以失敗的不是那個點子,而是它剛好不符合這顆 CPU 的使用情境。而反過來才是有意思的地方:哪天硬體更進步 —— 更大的快取、更高的記憶體頻寬 —— 這些較激進的策略,可能反而剛好適用。

Wins (what actually moved the needle)真正有效的

The meta-lesson: the same change can be +C# and −Rust後設教訓:同一改動可能 +C# 卻 −Rust

A recurring, important finding: an identical source-level change is often a win on one compiler and a loss on the other. Example — widening a fast-path classifier (which adds one branch to the hottest inlined function) measured +0.4–1.3% on C#/.NET-JIT but −1.9–2.5% on Rust/LLVM, because under LLVM's already-tight codegen that one extra branch costs more than the work it saves. Never sync a hot-path change across engines blindly — measure each. And for sub-2% effects, batched A/B is untrustworthy; we use interleaved paired measurement (alternate builds each round) to beat time-drift.

一個反覆出現的重要發現:完全相同的 source-level 改動,常常在一個編譯器是賺、另一個是賠。例如 —— 放寬 fast-path 分類(等於在最熱的 inline 函式多加一個分支),在 C#/.NET-JIT 量到 +0.4~1.3%,在 Rust/LLVM 卻 −1.9~2.5%,因為在 LLVM 已經很緊的 codegen 下,那一個分支的成本超過它省下的工作。絕不要把熱路徑改動盲目同步到兩個引擎 —— 各自實測。而且當效果小於 2% 時,batched A/B 不可信;我們用交錯配對量測(每輪輪流換 build)來打敗時間漂移。

03Where it stands — and a call to help最終成果 —— 以及邀請

Real time was known to be out of reach from day one. But once the engine got fast enough, it crossed into being a genuinely useful, verifiable tool. The numbers below are our actual run on the dev machine (the screenshot at the bottom of this section), benchmarking 300,000 master half-cycles of full_palette:

從第一天就知道 realtime 達不到。但引擎快到一個程度後,它就跨進「真正可用、可驗證的工具」這個門檻了。以下是我們在開發機上的實際跑分(本節下方的截圖),測試 full_palette 的 300,000 個 master half-cycle:

71.9K
Rust — hc/s · 9.94 s/frameRust — hc/s · 9.94 秒/frame
67.3K
C# — hc/s · 10.62 s/frameC# — hc/s · 10.62 秒/frame
598×
Rust gap to real-time (0.167%)Rust 離實機(占 0.167%)
638×
C# gap to real-time (0.157%)C# 離實機(占 0.157%)

On this machine Rust renders a frame in 9.94 s and C# in 10.62 s (both engines produce a bit-identical result — same checksum 0x794A43A8DF169ADA). NES NTSC real-time needs 42,954,552 hc/s, so we're still ~600× short — the near-term goal is to stably hold one frame under 10 seconds. That gap is exactly why outside ideas and faster machines are valuable.

在這台機器上,Rust 算一張 frame 要 9.94 秒、C# 要 10.62 秒(兩個引擎輸出位元完全相同 —— checksum 同為 0x794A43A8DF169ADA)。NES NTSC 實機需要每秒 42,954,552 個 hc,所以我們還差約 600 倍 —— 近期目標是穩定把單 frame 壓進 10 秒內。這個差距正是外部點子與更快機器有價值的原因。

The machine these numbers came from這些數字來自的機器

CPU處理器AMD Ryzen 7 3700X — 8 cores / 16 threads, Zen 2, 3.6 GHz base (~4.4 GHz boost)8 核 / 16 緒,Zen 2,3.6 GHz 基頻(~4.4 GHz boost)
Cache快取L1d 32 KB & L1i 32 KB per core, L2 512 KB per core, L3 32 MB每核 L1d 32 KB、L1i 32 KB,每核 L2 512 KB,L3 32 MB
Memory記憶體32 GB DDR4-2133 (4 × 8 GB)
OSWindows 11 Home (build 26200)

Note: the engine is single-threaded and memory-latency bound — its speed tracks per-core IPC and L1/L2 latency far more than core count. A faster single core will beat more cores here.註:引擎是單執行緒、且受記憶體延遲限制 —— 速度主要看單核 IPC 與 L1/L2 延遲,而不是核心數。更快的單核會比更多核心有用。

Have a faster CPU? Run the benchmark and share your numbers. The package is portable, self-contained (no .NET install), and ships both the C# and Rust engines for Windows + macOS (Apple Silicon).

有更快的 CPU 嗎?跑跑看 benchmark,分享你的數字。套件是可攜、自包含的(不用裝 .NET),同時附上 C# 與 Rust 引擎,支援 Windows + macOS(Apple Silicon)。

⬇ Download AprVisualBenchMark.zip⬇ 下載 AprVisualBenchMark.zip   All releases所有版本

📊 Then upload your result to the community leaderboard ↗📊 跑完把結果上傳到社群排行榜 ↗

AprVisual benchmark result on the dev machine
Benchmark output on the dev machine — the PERFORMANCE block reports hc/s, the real-time gap, and seconds per frame.開發機上的 benchmark 輸出 —— PERFORMANCE 區塊列出 hc/s、離實機的差距、每 frame 秒數。

04How it compares to other netlist NES / 6502 simulators與其他 netlist NES / 6502 模擬器的比較

Speed claims across these projects use different units (6502 clocks, chip steps, trace lines, master half-cycles), so this is not a clean apples-to-apples table — hardware era, tracing, feature completeness and frame definitions all differ. With that caveat, here's the public picture:

這些專案的速度說法單位不同(6502 clock、chip step、trace line、master half-cycle),所以這不是乾淨的同基準比較 —— 硬體年代、tracing、功能完整度、frame 定義都不一樣。帶著這個但書,以下是公開資料的全貌:

Project專案Scope範圍Public speed公開速度~ per frame約每 frame
Visual6502 (JS)6502 transistor-level6502 電晶體級~1 clock/s (animated) – 250Hz+ (expert)~1 clock/s(動畫)– 250Hz+(expert)n/a (CPU only)
perfect65026502 NMOS netlist (C)6502 NMOS netlist(C)~1/30 of 1MHz 6502 on a 2025 CPU2025 CPU 上約 1MHz 6502 的 1/30n/a (CPU only)
Visual NESVisual 2A03 + 2C02 (C++/C#)Visual 2A03 + 2C02(C++/C#)~5000Hz (dual-chip), ~7500Hz after data shrink + PGO (2017, old i5)~5000Hz(雙晶片),縮資料 + PGO 後 ~7500Hz(2017,舊 i5)~30–60 s
MetalNESfull NES-001 board, transistor-level完整 NES-001 主機板,電晶體級user/press reports (not a benchmark)使用者/媒體轉述(非 benchmark)~1–2 min
AprVisual S1 (C#)NES switch-level, pure BFSNES 開關級,純 BFS67.3K hc/s (this machine, 300k hc)10.62 s
AprVisual rust-s1NES switch-level, pure BFSNES 開關級,純 BFS71.9K hc/s (this machine, 300k hc)9.94 s

Takeaways幾個結論

Sources: Visual6502 & NESdev wiki, perfect6502 README, Visual NES README + author's 2017 nesdev threads, MetalNES README + press/HN reports. AprVisual figures are this machine's run (full_palette, 300k hc).來源:Visual6502 與 NESdev wiki、perfect6502 README、Visual NES README + 作者 2017 nesdev 討論串、MetalNES README + 媒體/HN 報導。AprVisual 數字為本機實測(full_palette、300k hc)。

Read the full comparison →閱讀完整比較深入頁 →

Per-project breakdowns, the unit caveats, frame-time derivations, and all sources.各專案逐一拆解、單位注意、frame-time 推算、完整來源。