01What AprVisual is — and the wall we hit專案定位 —— 以及我們撞上的牆
AprVisual started from one goal: take Visual6502-style transistor netlists of the NES CPU (2A03) and PPU (2C02) and simulate them exactly at the switch level, while pushing that simulation as close to real time as software possibly can. Pure netlist simulation is a fundamentally heavy thing — the gap to real silicon speed is on the order of hundreds of times, and that gap is essentially algorithmic, not a coding detail.
AprVisual 從一個目標出發:把 NES CPU(2A03)與 PPU(2C02)的 Visual6502 風格電晶體網表,在開關級(switch-level)精確模擬,並把這個模擬盡量推到接近實機速度。純 netlist 模擬本質上非常吃資源 —— 離真實矽晶速度的鴻溝是數百倍等級,而且這個鴻溝是演算法層面的,不是寫法細節。
So the original blueprint was an automated pipeline (fully programmatic, no hand-tuning per chip) that lifts the netlist into higher-level abstractions for speed. Four stages:
所以原本的藍圖是一條自動化流程(全程程式化、不靠逐晶片人工調整),把 netlist 往上抽象化以換取效率。共四個階段:
S1 Switch-level engine — a from-scratch C# rewrite of MetalNES's wire / group-resolution core. The foundation. 開關級引擎 —— 從頭以 C# 重寫 MetalNES 的 wire / group-resolution 核心。整個專案的地基。
S2 Netlist → IR — loop/SCC detection, extract boolean logic into an Expr IR, build an IR interpreter. Netlist → IR —— loop/SCC 偵測,把布林邏輯抽取成 Expr IR,做一個 IR 直譯器。
S3 CPU proof — the IR must be per-node equivalent to S1, and measurably faster than the raw switch-level interpreter. CPU 驗證 —— IR 必須與 S1 逐節點等價,且明顯快於原始開關級直譯器。
S4 Codegen + GPU — emit C++/Verilog + a bit-sliced CUDA/GPU kernel, with per-node equivalence to the CPU IR. Codegen + GPU —— 產生 C++/Verilog + 位元切片的 CUDA/GPU kernel,與 CPU IR 逐節點等價。
The counter-intuitive part反直覺的地方
When you ask an AI — or most software engineers — how to speed this up, the answer is almost always the same: build an IR, do codegen, or throw it on a GPU. We actually built and verified those paths. And we found problems nobody mentioned up front:
當你問 AI —— 或多數軟體工程師 —— 怎麼加速,答案幾乎一致:做 IR、做 codegen、丟 GPU。我們真的把這些路徑實作並驗證了,結果發現一堆當初沒人提到的問題:
- After IR + codegen, the generated code bloats massively. We can automatically extract the netlist into logic gates — it works, in theory and in practice — but the codebase explodes and it runs worse, sometimes slower than just simulating the netlist directly.
- 產生 IR + codegen 之後,程式碼嚴重膨脹。我們確實能自動把 netlist 抽取成邏輯閘 —— 理論與實作都成立 —— 但 codebase 爆炸,而且執行得更糟,有時甚至比直接算 netlist 還慢。
- Information loss, especially around timing and correctness verification, makes the abstracted model hard to trust against the golden switch-level model.
- 抽象化會有資訊喪失問題,尤其是 timing 與正確性驗證,讓抽象後的模型難以對齊開關級 golden model。
- We even auto-generated hardware description languages (Verilog/HLSL) — but the whole point is to solve this in software, not to hand the performance problem to hardware.
- 我們甚至自動產生了硬體描述語言(Verilog/HLSL)—— 但本專案的重點是用軟體解決,不是把效能問題丟給硬體。
- Going further (abstracting gates into components or a behavioral layer) inevitably needs manual intervention and contradicts the project's whole intent: a real, solid hardware simulation. Keep abstracting and you've lost the point.
- 再往上抽象(把邏輯閘變成元件或行為層)勢必要人工介入,而且和專案初衷衝突:我們要的是真實紮實的硬體模擬。一直抽象下去就失去意義了。
Current direction: push S1 — the pure switch-level engine — to its absolute limit. The IR/codegen path is parked, not deleted; it's a verified dead-end for the performance goal, not for correctness.目前方向:把 S1 —— 純開關級引擎 —— 推到極致。IR/codegen 路徑是擱置、不是刪除;它對效能目標是已驗證的死路,但對正確性不是。
01·5What actually runs as transistors實際吃網表的部分 & 總規模
Staying honest about fidelity matters, so here's exactly what is simulated as a real transistor netlist versus a software behavioral model. Every NES internal logic chip runs 100% transistor netlist — the CPU, the PPU, all four board TTL chips, and both controllers. Only memory cell arrays, the master oscillator, and the video output are behavioral.
對「模擬到多真」保持誠實很重要,所以這裡明確列出哪些是真電晶體網表、哪些是軟體行為模型。所有 NES 內部邏輯晶片 100% 跑電晶體網表 —— CPU、PPU、四顆主機板 TTL、兩個手把。只有記憶體的儲存陣列、主振盪器、影像輸出是行為模型。
| Component元件 | Simulation模擬方式 | Transistors電晶體數 |
|---|---|---|
| 2C02 (PPU) | ✅ transistor netlist✅ 電晶體網表 | 16,794 |
| 2A03 (CPU, modified 6502 + APU) | ✅ transistor netlist✅ 電晶體網表 | 10,946 |
| 2× controllers (nes-pad → 4021 → 8× pslatch) | ✅ transistor netlist✅ 電晶體網表 | 352 |
| 74LS373 (address latch) | ✅ transistor netlist✅ 電晶體網表 | 82 |
| 74LS139 (address decode) | ✅ transistor netlist✅ 電晶體網表 | 38 |
| 2× 74LS368 (controller tri-state buffers) | ✅ transistor netlist✅ 電晶體網表 | 28 |
| 74HC04 (PPU A13 inverter) | ✅ transistor netlist✅ 電晶體網表 | 6 |
| RAM / VRAM / PRG / CHR ROM | 🔧 hybrid: control pins netlist + cell array behavioral🔧 hybrid:控制腳網表 + 陣列行為 | ~68 (pins only)~68(僅控制腳) |
| CIC (nes-cic1) | ⚠️ stub — reset inverter only, not a real lockout chip⚠️ stub —— 只是 reset 反相器,非真鎖區晶片 | 1 |
| clock / video output主時鐘 / 影像輸出 | 🔧 pure software handler🔧 純軟體 handler | — |
- The PPU and CPU alone are ~98% of all transistors; every TTL chip, controller, CIC and memory control-pin together is under 2%.
- 光是 PPU + CPU 就佔 ~98%;所有 TTL、手把、CIC、記憶體控制腳加起來不到 2%。
- Memory cell arrays (2 KB ×2 RAM/VRAM, 32 KB PRG, 8 KB CHR, optional 8 KB work RAM) are not counted — they're behavioral byte arrays. Simulating them as real 6T cells would add hundreds of thousands of transistors.
- 記憶體的儲存陣列(2 KB×2 RAM/VRAM、32 KB PRG、8 KB CHR、選用的 8 KB 工作 RAM)不計入 —— 它們是行為 byte 陣列。若真用 6T cell 模擬會多出數十萬顆電晶體。
- CIC is a 1-transistor reset-inverter stub, not a real region-lockout chip — NROM doesn't need lockout, so its 4-bit MCU was never modelled.
- CIC 只是 1 顆電晶體的 reset 反相殘樁,不是真的鎖區晶片 —— NROM 不需要鎖區,所以它的 4-bit MCU 從沒被模擬。
- After S1.5 lowering (merge always-on shorts, drop dead transistors, dedup), the engine actually simulates 26,775 transistors over 14,723 nodes — that's the real per-half-cycle workload.
- 經 S1.5 lowering(合併永遠導通短路、去死電晶體、dedup)後,引擎實際模擬的是 26,775 顆電晶體、14,723 個節點 —— 這才是每半週期真正的工作量。
Could this be smaller — a future speed-up? Possibly. Fewer nodes means smaller hot arrays and fewer that can ever go dirty, so shrinking the netlist itself is a candidate breakthrough. But the easy, safe cuts are already taken (the lowering above). Going deeper — folding constant / supply-tied nodes, collapsing logic-equivalent sub-networks — runs straight into the correctness traps that sank earlier attempts: cross-coupled latches with two stable states (which made prune-merge render a black screen) and the floating-capacitance tie-break (which "dead-end skip" broke). So it's a real but delicate avenue — any further reduction must be proven per-node equivalent and verified against a PPU visual frame, not just a CPU checksum.這還能更小嗎 —— 日後的速度突破口?有可能。節點少代表熱陣列更小、能變髒的節點更少,所以把 netlist 本身縮小是個候選突破點。但容易又安全的縮減已經撿光了(上面的 lowering)。再往下 —— 折疊常數 / 接電源的節點、收合邏輯等價子網路 —— 會直接撞上葬送先前嘗試的正確性陷阱:有兩個穩態的 cross-coupled latch(就是讓 prune-merge 渲染黑屏的元兇),以及 floating 電容 tie-break(「dead-end skip」踩的雷)。所以這是真實但細膩的方向 —— 任何進一步縮減都必須證明逐節點等價,並且用 PPU 畫面實圖驗證,而不只是 CPU checksum。
02What worked, what didn't — the interesting part過程心得 —— 有趣的部分
S1's base is an equivalent reimplementation of large parts of MetalNES — specifically its wire_compute group-resolution core, which is itself an optimized port of Visual6502's chipsim. On top of that foundation we ran a long campaign of optimization strategies. They split into two layers, and both are full of results that contradict intuition.
S1 的基礎是大量等價重寫 MetalNES —— 特別是它的 wire_compute group-resolution 核心(而那本身又是 Visual6502 chipsim 的最佳化移植)。在這個基礎上,我們跑了一長串優化策略,分成兩個層面,而兩邊都充滿反直覺的結果。
Built on MetalNES — what S1 actually adds on top建立在 MetalNES 之上 —— S1 額外加了哪些有效處理
S1 faithfully reproduces MetalNES's golden core — the connected-group BFS resolution, the 256-entry flags→state lookup table, the per-node c1c2/gnd/pwr transistor sub-lists with early-break, and the largest-capacitance float tie-break. Those are MetalNES's, ported as-is. On top of that, S1 adds these processing-level optimizations, each verified bit-identical to the un-optimized model:
S1 忠實重現了 MetalNES 的 golden core —— connected-group BFS 解析、256 項 flags→state 查表、每節點 c1c2/gnd/pwr 電晶體子列表(含 early-break)、以及最大電容的 float tie-break。這些是 MetalNES 的,原樣移植。在這之上,S1 額外加了以下處理層級的優化,每一項都驗證過與未優化模型位元完全相同:
| S1 additionS1 額外做的 | vs MetalNES對比 MetalNES |
|---|---|
| Pure-logic fast-path純邏輯 fast-path | O(1) resolve for ~27% of nodes that provably form a singleton group — skips the BFS entirely. MetalNES runs the full group walk for every node.對 ~27% 必為單節點 group 的節點做 O(1) 解析 —— 完全跳過 BFS。MetalNES 對每個節點都跑完整 group walk。 |
| S1.5 lowering pre-passS1.5 lowering 前置 | Before simulating: union-find merge of always-on shorts, drop dead (gate=GND) transistors, dedup + dense compaction → 15,164→14,723 nodes, 27,305→26,775 transistors. MetalNES simulates the raw assembled netlist.模擬前:union-find 合併永遠導通短路、移除死(gate=GND)電晶體、dedup + 緻密重編號 → 15,164→14,723 節點、27,305→26,775 電晶體。MetalNES 直接模擬組裝後的原始 netlist。 |
| O(1) in-group dedupO(1) group 去重 | A per-node _inGroup flag. MetalNES linearly scans the whole current group on every node add.用每節點一個 _inGroup 旗標。MetalNES 每加一個節點都線性掃描整個當前 group。 |
| Deferred capacitance read延後讀電容值 | MetalNES updates max-capacitance/state on every node add; S1 defers that read to the rare floating branch (<1% of walks) → +12% on C#.MetalNES 每加一個節點都更新 max-capacitance/state;S1 把它延後到罕見的 floating 分支(<1% 的 walk)→ C# +12%。 |
| Hot-data shrink + SoA熱資料縮減 + SoA | byte states, ushort node ids, hot/cold split packed to a quarter cache line, unmanaged arrays — vs MetalNES's std::vector<node_info> with int fields. The single biggest lever.byte 狀態、ushort 節點 id、hot/cold 拆分壓到 ¼ cache line、unmanaged 陣列 —— 對比 MetalNES 的 std::vector<node_info>(int 欄位)。最大的單一槓桿。 |
| Iterative BFS (C#)iterative BFS(C#) | MetalNES's group walk is recursive; S1 makes it iterative so the .NET JIT inlines the whole chain (+~3% C#). Left recursive on Rust, where LLVM already inlines it.MetalNES 的 group walk 是遞迴;S1 改成 iterative 讓 .NET JIT inline 整條鏈(C# +~3%)。Rust 維持遞迴,因為 LLVM 本來就 inline 得好。 |
| Twin C# + Rust enginesC# + Rust 雙引擎 | Two independent codegens, bit-identical (same checksum) — cross-validation and a built-in performance comparison.兩個獨立 codegen,位元完全相同(checksum 一致)—— 互相驗證,也內建效能對照。 |
So the fast-path and the lowering pre-pass are genuine, effective processing strategies S1 adds beyond the reference — the "no breakthrough" caveat below is about the broad search for new algorithms, most of which were dead-ends.
所以 fast-path 與 lowering 前置是 S1 在參考實作之外、真正有效的額外處理策略 —— 下面那句「沒有突破」指的是大範圍尋找新演算法的部分,而那些多半是死路。
New here? Switch-level simulation primer →第一次接觸?開關級模擬入門科普 →
MetalNES vs Visual6502 →MetalNES 對 Visual6502 → S1 vs MetalNES →S1 對 MetalNES →
The full lineage: switch-level model (Bryant, 1980s) → Visual6502 (chipsim.js) → MetalNES → AprVisual S1 — an educational walk-through with source-line citations at each step.
完整脈絡:開關級模型(Bryant, 1980s)→ Visual6502(chipsim.js)→ MetalNES → AprVisual S1 —— 教育性導覽,每一段都附源碼行號對照。
Layer A — algorithmic (graph / topology / matrix)A 層 —— 演算法(圖學 / 拓樸 / 矩陣)
Two strategies here stuck and are real wins — the pure-logic fast-path and the lowering pre-pass (above). Beyond those, there was no sweeping algorithmic breakthrough: the role was to drive an AI to propose, implement, and benchmark dozens of mathematically-elegant strategies, then keep only the few that survive engineering reality. The other big contribution is the negative catalogue: cases that should win on paper but lose on real hardware, and exactly why.
這層有兩個策略站住了、是真正的勝利 —— 純邏輯 fast-path 與 lowering 前置(見上表)。除此之外沒有大範圍的演算法突破:角色是驅使 AI 去提出、實作、benchmark 數十種數學上很漂亮的策略,再只留下少數能撐過工程現實的。另一個重要貢獻是這份負面案例清單:那些紙上應該贏、但在真實硬體上輸的策略,以及確切的原因。
Layer B — programming / compilerB 層 —— 程式設計 / 編譯器
This is the more familiar territory: hot-loop shape, memory layout, branch behaviour, what the JIT/LLVM will and won't do. Many proposals here, AI did the coding and verification. Plenty of "this is obviously better" ideas that turned out worse — and a few real, repeatable wins.
這是比較熟悉的領域:熱迴圈形狀、記憶體佈局、分支行為、JIT/LLVM 會做與不會做的事。這層提了很多構想,由 AI 實作與驗證。一堆「這明明更好」卻變更糟的點子 —— 也有幾個真實、可重現的勝利。
Negative cases (and why they fail)負面案例(以及為何失敗)
- ✗ Multi-threading / per-chip parallelism多執行緒 / 逐晶片平行 — ~15× slower. The simulation settles in tiny waves (the average connected group is ~1.4 nodes); per-wave work is far too small to amortize thread sync, and a barrier every half-cycle dominates. Visual NES's author reported the same lock-contention wall in 2017. 約 慢 15 倍。模擬是以極小的 wave 收斂(平均連通 group 只有 ~1.4 個節點),每個 wave 的工作量遠不足以攤平 thread 同步成本,而每半週期一個 barrier 就主導了一切。Visual NES 作者 2017 年也回報了同樣的 lock contention 牆。
- ✗ Bit-parallel / dense-scan BFS位元平行 / dense-scan BFS — bit-identical results but ~156× slower. Bit-parallel overhead is built for huge frontiers; it crushes the 99% of walks that touch one or two nodes. 結果位元完全相同,但慢約 156 倍。位元平行的成本是為龐大 frontier 設計的,卻被那 99% 只走一兩個節點的小 walk 壓垮。
- ✗ Cache-locality renumbering (RCM / Cuthill-McKee)cache 區域性重編號(RCM) — predicted 1.2–1.5×; measured (C#) ~1.04× (boot) / ~0.98× (steady) — i.e. nothing, even slightly negative. The hot working set already fits in L1d; reordering helps a cache capacity problem we don't have, and the access is jumpy/sparse, not a sequential scan. 預期 1.2–1.5×;C# 實測 ~1.04×(開機期)/ ~0.98%(穩態)—— 等於沒用、甚至略負。熱資料的 working set 已經塞進 L1d;重排解決的是我們沒有的 cache 容量問題,而存取模式是跳躍/稀疏的,不是順序掃描。
- ✗ "Dead-end" / unobserved-node skip「死路」/ 未被觀察節點跳過 — broke correctness. A node with no fan-out gates still flows its state into any group it's pulled into — "no observer" was an illusion. 弄壞正確性。一個沒有 fan-out 的節點,只要被拉進某個 group,它的狀態仍會流進去 ——「沒有觀察者」是錯覺。
- ✗ Fancy data structures (hashset, presence arrays, counters)花俏資料結構(hashset、presence array、計數器) — net-negative, matching Visual NES's findings. Maintaining a per-node supply-counter cost −6% because the write path runs ~10× more often than the read it was meant to speed up. 淨負效益,與 Visual NES 的發現一致。維護每節點的 supply 計數器代價 −6%,因為寫入路徑比它想加速的讀取路徑多跑約 10 倍。
- ✗ Observability merge-pruning ("prune-merge")可觀測性合併修剪(prune-merge) — collapsing nodes whose individual state can't be independently observed. An early ~1.3× win in an experimental branch — but it broke PPU rendering, and once the JIT inline cascade landed, on the optimized S1 baseline it went net-negative (−4% C# / −7% Rust). A speedup that survived neither correctness nor a faster baseline. 把「無法被獨立觀測」的節點合併掉。在實驗分支早期看似 ~1.3× 的勝利 —— 但它弄壞了 PPU 畫面,而且等 JIT inline cascade 進來、在最佳化後的 S1 baseline 上它變成淨負(C# −4% / Rust −7%)。一個既過不了正確性、也撐不過更快 baseline 的加速。
- ✗ Levelized scheduling分層排程(levelized scheduling) — ordering each settle wave by topological level to cut redundant re-evaluations. It did reduce the work slightly (dirty-set −1.3%, fewer glitches) — but the counting-sort to maintain that order cost −15% hc/s: net negative. Root cause: ~94.5% of the netlist is one giant pass-transistor-coupled SCC, so there's no useful static topological order to exploit. (The structure was the wall, not the idea.) 把每道 settle 波依拓樸層級排序以減少重複評估。它確實稍微降低了運算量(dirty-set −1.3%、突波變少)—— 但維持順序的 counting-sort 代價是 −15% hc/s:淨負。根因:這顆 netlist 約 94.5% 是同一個 pass-transistor 耦合的大 SCC,沒有有效的靜態拓樸序可利用。(牆是結構,不是這個點子本身。)
- ✗ "Oblivious" full-sweep evaluation「oblivious」全掃描評估 — dropping the dirty-set and recomputing every node each half-cycle. Measured ~121× slower: it re-evaluates ~14,700 nodes (× several sweeps to a fixpoint) when only a few hundred actually changed. The same "recompute everything" cause is exactly why the batch AOT/codegen (~3–6×) and GPU (~10–18×) backends also lost to the event-driven interpreter. 丟掉 dirty-set,每半週期重算所有節點。實測 ~121× 慢:實際只有幾百個節點變動,它卻重算 ~14,700 個(再 ×數趟掃到 fixpoint)。同樣「全部重算」的根因,也正是批次 AOT/codegen(~3–6×)與 GPU(~10–18×)後端輸給事件驅動直譯器的原因。
- ✗ Generation-counter dedup世代計數器去重 — replacing "clear the visited flags" with a monotonic epoch counter to skip the clear. −3.9%: the wider counter and its periodic reset cost more than the per-call clear it removed. 把「清掉 visited 旗標」改成單調遞增的世代計數器以省下清除步驟。−3.9%:較寬的計數器加上它的週期性重置,比它省掉的每次清除還貴。
An important caveat: most of these failures are not flaws in the strategy itself — they're mismatches with the hardware it ran on. A method can be perfectly correct, even more elegant on paper, yet collapse the moment its instruction footprint (i-cache) or working set (d-cache) spills out of cache and starts thrashing. So what failed wasn't the idea — it's that the idea didn't fit this CPU's operating envelope. The flip side is the interesting part: as hardware advances — larger caches, more memory bandwidth — some of these more aggressive strategies could turn out to be exactly the right fit.一個重要的但書:這些策略的失敗,多半不是策略本身有問題,而是它和所跑的硬體不合拍。一個方法可以完全正確、甚至在紙上更優雅,但只要它的指令足跡(i-cache)或工作集(d-cache)一溢出快取、開始顛簸(thrashing),就會崩潰。所以失敗的不是那個點子,而是它剛好不符合這顆 CPU 的使用情境。而反過來才是有意思的地方:哪天硬體更進步 —— 更大的快取、更高的記憶體頻寬 —— 這些較激進的策略,可能反而剛好適用。
Wins (what actually moved the needle)真正有效的
- ✓ Memory shrink記憶體縮減 —
the biggest, most reliable lever.
int → ushort/byte, hot/cold struct splitting (SoA), packing the hot per-node record to a quarter cache line. Keeping the working set in L1d matters more than any clever loop. 最大、最可靠的槓桿。int → ushort/byte、hot/cold 結構拆分(SoA)、把每節點的熱記錄壓到 1/4 cache line。把 working set 留在 L1d 比任何聰明的迴圈都重要。 - ✓ Unlocking the JIT inline cascade解鎖 JIT inline cascade — rewriting the recursive group walk as an iterative loop let the .NET JIT inline the whole chain (+~3% on C#). The exact opposite of helpful on Rust — LLVM already inlines the recursive form. 把遞迴的 group walk 改成 iterative loop,讓 .NET JIT 能 inline 整條鏈(C# +約 3%)。在 Rust 上卻完全相反 —— LLVM 本來就把遞迴形式 inline 得很好。
- ✓ Removing work from the hot loop把工作移出熱迴圈 — deferring a rarely-needed field read out of the per-visit path was +12% on C#; bounds-check elision was +12% on Rust. 把一個很少用到的欄位讀取延後、移出每次拜訪的路徑,C# +12%;Rust 移除 bounds check +12%。
The meta-lesson: the same change can be +C# and −Rust後設教訓:同一改動可能 +C# 卻 −Rust
A recurring, important finding: an identical source-level change is often a win on one compiler and a loss on the other. Example — widening a fast-path classifier (which adds one branch to the hottest inlined function) measured +0.4–1.3% on C#/.NET-JIT but −1.9–2.5% on Rust/LLVM, because under LLVM's already-tight codegen that one extra branch costs more than the work it saves. Never sync a hot-path change across engines blindly — measure each. And for sub-2% effects, batched A/B is untrustworthy; we use interleaved paired measurement (alternate builds each round) to beat time-drift.
一個反覆出現的重要發現:完全相同的 source-level 改動,常常在一個編譯器是賺、另一個是賠。例如 —— 放寬 fast-path 分類(等於在最熱的 inline 函式多加一個分支),在 C#/.NET-JIT 量到 +0.4~1.3%,在 Rust/LLVM 卻 −1.9~2.5%,因為在 LLVM 已經很緊的 codegen 下,那一個分支的成本超過它省下的工作。絕不要把熱路徑改動盲目同步到兩個引擎 —— 各自實測。而且當效果小於 2% 時,batched A/B 不可信;我們用交錯配對量測(每輪輪流換 build)來打敗時間漂移。
03Where it stands — and a call to help最終成果 —— 以及邀請
Real time was known to be out of reach from day one. But once the engine got fast enough, it crossed into being a genuinely useful, verifiable tool. The numbers below are our actual run on the dev machine (the screenshot at the bottom of this section), benchmarking 300,000 master half-cycles of full_palette:
從第一天就知道 realtime 達不到。但引擎快到一個程度後,它就跨進「真正可用、可驗證的工具」這個門檻了。以下是我們在開發機上的實際跑分(本節下方的截圖),測試 full_palette 的 300,000 個 master half-cycle:
On this machine Rust renders a frame in 9.94 s and C# in 10.62 s (both engines produce a bit-identical result — same checksum 0x794A43A8DF169ADA). NES NTSC real-time needs 42,954,552 hc/s, so we're still ~600× short — the near-term goal is to stably hold one frame under 10 seconds. That gap is exactly why outside ideas and faster machines are valuable.
在這台機器上,Rust 算一張 frame 要 9.94 秒、C# 要 10.62 秒(兩個引擎輸出位元完全相同 —— checksum 同為 0x794A43A8DF169ADA)。NES NTSC 實機需要每秒 42,954,552 個 hc,所以我們還差約 600 倍 —— 近期目標是穩定把單 frame 壓進 10 秒內。這個差距正是外部點子與更快機器有價值的原因。
The machine these numbers came from這些數字來自的機器
| CPU處理器 | AMD Ryzen 7 3700X — 8 cores / 16 threads, Zen 2, 3.6 GHz base (~4.4 GHz boost)8 核 / 16 緒,Zen 2,3.6 GHz 基頻(~4.4 GHz boost) |
|---|---|
| Cache快取 | L1d 32 KB & L1i 32 KB per core, L2 512 KB per core, L3 32 MB每核 L1d 32 KB、L1i 32 KB,每核 L2 512 KB,L3 32 MB |
| Memory記憶體 | 32 GB DDR4-2133 (4 × 8 GB) |
| OS | Windows 11 Home (build 26200) |
Note: the engine is single-threaded and memory-latency bound — its speed tracks per-core IPC and L1/L2 latency far more than core count. A faster single core will beat more cores here.註:引擎是單執行緒、且受記憶體延遲限制 —— 速度主要看單核 IPC 與 L1/L2 延遲,而不是核心數。更快的單核會比更多核心有用。
Have a faster CPU? Run the benchmark and share your numbers. The package is portable, self-contained (no .NET install), and ships both the C# and Rust engines for Windows + macOS (Apple Silicon).
有更快的 CPU 嗎?跑跑看 benchmark,分享你的數字。套件是可攜、自包含的(不用裝 .NET),同時附上 C# 與 Rust 引擎,支援 Windows + macOS(Apple Silicon)。
⬇ Download AprVisualBenchMark.zip⬇ 下載 AprVisualBenchMark.zip All releases所有版本
📊 Then upload your result to the community leaderboard ↗📊 跑完把結果上傳到社群排行榜 ↗
04How it compares to other netlist NES / 6502 simulators與其他 netlist NES / 6502 模擬器的比較
Speed claims across these projects use different units (6502 clocks, chip steps, trace lines, master half-cycles), so this is not a clean apples-to-apples table — hardware era, tracing, feature completeness and frame definitions all differ. With that caveat, here's the public picture:
這些專案的速度說法單位不同(6502 clock、chip step、trace line、master half-cycle),所以這不是乾淨的同基準比較 —— 硬體年代、tracing、功能完整度、frame 定義都不一樣。帶著這個但書,以下是公開資料的全貌:
| Project專案 | Scope範圍 | Public speed公開速度 | ~ per frame約每 frame |
|---|---|---|---|
| Visual6502 (JS) | 6502 transistor-level6502 電晶體級 | ~1 clock/s (animated) – 250Hz+ (expert)~1 clock/s(動畫)– 250Hz+(expert) | n/a (CPU only) |
| perfect6502 | 6502 NMOS netlist (C)6502 NMOS netlist(C) | ~1/30 of 1MHz 6502 on a 2025 CPU2025 CPU 上約 1MHz 6502 的 1/30 | n/a (CPU only) |
| Visual NES | Visual 2A03 + 2C02 (C++/C#)Visual 2A03 + 2C02(C++/C#) | ~5000Hz (dual-chip), ~7500Hz after data shrink + PGO (2017, old i5)~5000Hz(雙晶片),縮資料 + PGO 後 ~7500Hz(2017,舊 i5) | ~30–60 s |
| MetalNES | full NES-001 board, transistor-level完整 NES-001 主機板,電晶體級 | user/press reports (not a benchmark)使用者/媒體轉述(非 benchmark) | ~1–2 min |
| AprVisual S1 (C#) | NES switch-level, pure BFSNES 開關級,純 BFS | 67.3K hc/s (this machine, 300k hc) | 10.62 s |
| AprVisual rust-s1 | NES switch-level, pure BFSNES 開關級,純 BFS | 71.9K hc/s (this machine, 300k hc) | 9.94 s |
Takeaways幾個結論
- Full NES transistor-level software simulation publicly lands in the tens-of-seconds to minutes per frame range — never the FPS range of an ordinary emulator.
- 完整 NES 電晶體級軟體模擬,公開資料都落在每 frame 數十秒到數分鐘的等級 —— 從來不是一般模擬器的 FPS 等級。
- AprVisual's current numbers sit clearly above the public early figures for Visual NES and MetalNES — but note those are from a 2017 i5 (often with tracing) and 2022-era user reports, so it is not a fair head-to-head.
- AprVisual 目前的數字明顯高於 Visual NES 與 MetalNES 的公開早期數字 —— 但那些來自 2017 年的 i5(常帶 tracing)與 2022 年的使用者回報,所以不是公平的正面對決。
- Our experience matches Visual NES's exactly: the hot spot is the recursive group / connected-component search, and data shrink + cache + PGO beat fancier data structures. Independent projects converging on the same conclusion is a strong signal.
- 我們的經驗和 Visual NES 完全吻合:熱點都在遞迴的 group / connected-component 搜尋,而且縮資料 + cache + PGO 勝過更花俏的資料結構。兩個獨立專案得到同樣結論,是很強的訊號。
- perfect6502 shows even a single 6502 netlist, in highly-optimized C on a 2025 CPU, only hits ~1/30 real time — transistor-level exactness is just expensive.
- perfect6502 顯示:即使是單顆 6502 netlist、高度最佳化的 C、2025 CPU,也只有約 1/30 realtime —— 電晶體級的精確就是這麼貴。
Sources: Visual6502 & NESdev wiki, perfect6502 README, Visual NES README + author's 2017 nesdev threads, MetalNES README + press/HN reports. AprVisual figures are this machine's run (full_palette, 300k hc).來源:Visual6502 與 NESdev wiki、perfect6502 README、Visual NES README + 作者 2017 nesdev 討論串、MetalNES README + 媒體/HN 報導。AprVisual 數字為本機實測(full_palette、300k hc)。
Read the full comparison →閱讀完整比較深入頁 →
Per-project breakdowns, the unit caveats, frame-time derivations, and all sources.各專案逐一拆解、單位注意、frame-time 推算、完整來源。