The overfitting test: three more chips, same engine過擬合測試:多三顆晶片,同一引擎
We took three standalone netlists straight from the Visual 6502 project and ran them on the same WireCore engine. None has a cartridge, a PPU, or a test ROM, so there is nothing to load. Instead we drive each chip at the pin boundary with an infinite NOP sled: whenever the CPU asserts a read, the harness forces the data bus to that chip's NOP opcode (0xEA for the 6502, 0x01 for the 6800, 0x00 for the Z80) rather than consulting memory. The program counter then walks forever through an endless field of no-ops — a clean, perfectly reproducible workload that exercises fetch, decode, and the entire timing state machine without any program at all.
我們直接從 Visual 6502 專案取三張獨立網表,跑在同一個 WireCore 引擎上。它們都沒有卡匣、沒有 PPU、也沒有測試 ROM,所以根本沒東西可載入。取而代之,我們在引腳邊界用一條無限 NOP sled 驅動:每當 CPU 發出讀取,測試平台就把資料匯流排強制成該晶片的 NOP 指令值(6502 是 0xEA、6800 是 0x01、Z80 是 0x00),而不去查記憶體。程式計數器於是永遠走在無止盡的 no-op 田野上——一個乾淨、完全可重現的工作負載,不需要任何程式就能操練取指、解碼與整個時序狀態機。
The simulation core never changed. The existing parser already understood raw segdefs/transdefs/nodenames arrays, so loading a new chip needed only a small loader plus a ~20-line per-chip clock/bus driver (the 6800's two-phase φ1/φ2 clock and the Z80's _rd/_mreq/_wr protocol differ from the 6502's). The whole pipeline — lowering, the P-1…P-4 prunes, the class-major renumber, the self-captured relayout, the unmanaged hot path — applies unchanged. All three power on and execute (the address bus advances as the program counter increments). The engine is not specialized to the NES 6502.
模擬核心從未改動。既有解析器本來就看得懂 raw 的 segdefs/transdefs/nodenames 陣列,所以載入新晶片只需要一個小載入器,加上一個約 20 行的每晶片時鐘/匯流排驅動(6800 的兩相 φ1/φ2 時鐘、Z80 的 _rd/_mreq/_wr 協定都與 6502 不同)。整條管線——lowering、P-1…P-4 剪枝、class-major 重排、自我捕捉重佈局、非託管熱路徑——原封不動套用。三顆都成功上電並執行(位址匯流排隨程式計數器遞增而前進)。引擎並非對 NES 6502 特化。
Splitting language from algorithm把語言與演算法分開
A headline like "584× faster than the original" is almost meaningless, because it mixes two completely different things: the original sim is JavaScript, ours is C#. To separate them we transliterated the reference algorithm itself — the recursive group-walk of chipsim.js, object-per-node, no prunes, no fast path, no struct-of-arrays — into C#. Its transistor counts match the JavaScript original exactly (3510 / 3923 / 6781), so it is a faithful port, not a strawman. Now three engines run the identical NOP sled, on the same machine, thread-pinned:
「比原版快 584×」這種頭條幾乎沒有意義,因為它混了兩件完全不同的事:原版模擬是 JavaScript、我們是 C#。為了分離,我們把參考演算法本身——chipsim.js 的遞迴 group-walk,object-per-node、無剪枝、無 fast path、無 struct-of-arrays——移植成 C#。它的電晶體數與 JavaScript 原版完全相同(3510 / 3923 / 6781),所以是忠實移植、不是稻草人。現在三個引擎跑同一條 NOP sled、同機、釘核:
| Chip晶片 | Nodes節點 | Visual 6502 JS (Node.js)Visual 6502 JS(Node.js) | C# naive | AprVisual | Language語言 | Algorithm演算法 |
|---|---|---|---|---|---|---|
| MOS 6502 | 1704 | 249 | 17,914 | 157,154 | ~72× | ~8.8× |
| Motorola 6800 | 2012 | 149 | 12,582 | 94,921 | ~84× | ~7.5× |
| Zilog Z80 | 3595 | 166 | 12,255 | 62,449 | ~74× | ~5.1× |
Half-cycles per second. "Language" = JavaScript→C# at the same (naive) algorithm; "Algorithm" = naive→AprVisual in the same (C#) language. Every configuration is bit-exact: the per-node checksum is invariant across no-renumber / renumber / BFS-key / self-capture, so the renumber and locality machinery is performance-only. (Updated 2026-06-19: the AprVisual / "C# ours" columns were re-measured on the current synced engine — release benchmark-2026.06.19, best of 5, boost clock; the JS / C# naive / C++ columns are the prior-engine snapshot.)
每秒半週期數。「語言」=同一(naive)演算法下的 JavaScript→C#;「演算法」=同一(C#)語言下的 naive→AprVisual。每種設定皆位元精確:每節點 checksum 在 no-renumber / renumber / BFS-key / self-capture 之間不變,故重排與局部性機制純屬效能。
The honest headline誠實的頭條
The raw JavaScript→AprVisual ratio is enormous — 376× to 599× — but it factors cleanly: roughly 70–85× is just JavaScript versus C# at the identical algorithm, and only about 5–9× is our algorithm, measured in a single controlled language. That ~5–9× is the number to believe and cite — and it agrees with the ~2.5× our engine beats its optimized C++ ancestor (MetalNES), since a tuned C++ baseline is itself several times faster than the naive port. It would have been easy, and wrong, to sell the 584× as the contribution.
原始的 JavaScript→AprVisual 比值很驚人——376× 到 599×——但它乾淨地因式分解:其中約 70–85× 只是同演算法下的 JavaScript 對 C#,真正屬於我們演算法、在單一受控語言裡量到的,只有約 5–9×。那個 ~5–9× 才是該相信、該引用的數字——它也與我們引擎相對已優化的 C++ 祖先(MetalNES)快約 2.5× 一致,因為一個調優過的 C++ 基線本身就比 naive 移植快上數倍。把 584× 當成貢獻會很容易、卻是錯的。
Native or managed? We ported the algorithms to C++ toonative 還是 managed?我們也把演算法移植成 C++
The "language" factor above bundles JavaScript and C# together — so we asked the obvious question directly: is managed C# the slow part, and would native C++ be much faster? We transliterated the same naive algorithm into C++ (clang -O3), and then ported our full optimized engine to C++ too — the whole hot path, including the cache layout (a packed 16-byte node struct), the fast-path dispatch, the two-node pair path, and the class-major range-prune. Both load the identical netlist; the full port is validated bit-exact against C# (the per-node checksum matched on all three chips).
上面那個「語言」因子把 JavaScript 和 C# 綁在一起——所以我們直接問那個顯而易見的問題:managed 的 C# 是不是慢的那一塊,換成 native 的 C++ 會快很多嗎?我們把同一個 naive 演算法移植成 C++(clang -O3),接著把我們完整的最佳化引擎也移植成 C++——整條熱路徑,連 cache 佈局(packed 16-byte 節點結構)、fast-path 派發、兩節點 pair path、class-major range-prune 都包含。兩者載入同一份網表;完整版通過 bit-exact 驗證(三顆晶片的每節點 checksum 都與 C# 相同)。
| Chip晶片 | C# naive | C++ naive | C++ ours-fullC++ 我們(完整) | C# ours (full)C# 我們(完整) |
|---|---|---|---|---|
| MOS 6502 | 17,914 | 26,239 | 118,918 | 157,154 |
| Motorola 6800 | 12,582 | 17,137 | 61,630 | 94,921 |
| Zilog Z80 | 12,255 | 18,452 | 40,255 | 62,449 |
The language gap is small — and it flips with tuning. At the naive algorithm, native C++ is ~1.4–1.5× faster than managed C#. But at the full optimized engine, C# is ~1.27–1.56× faster than a faithful C++ port — an inversion. The reason: the C# hot path is heavily micro-tuned for the .NET JIT + dynamic PGO (the inline cascade, 64-bit dual-loads, profile-ordered branches); the straightforward clang -O3 port replicates the algorithm and the cache layout, but not that instruction-level tuning and has no PGO. So this is "heavily-tuned managed beats faithful-but-untuned native" — not a language ceiling; equal tuning effort would likely flip it back. Across JS→C#→C++ the language never moves the needle more than ~1.5×; the ~70–85× outlier is purely the JavaScript interpreter, and our algorithm contributes ~5–9×.
語言差距很小——而且隨微調而反轉。在 naive 演算法下,native C++ 比 managed C# 快 ~1.4–1.5×。但在完整最佳化引擎,C# 反而比忠實的 C++ 移植快 ~1.27–1.56×——反轉了。原因:C# 熱路徑是針對 .NET JIT + 動態 PGO 重度微調的(inline cascade、64-bit 雙對讀、按剖析排序的分支);clang -O3 的忠實移植複製了演算法與 cache 佈局,卻沒做那層指令級微調、也沒 PGO。所以這是「重度微調的 managed 贏過忠實但未微調的 native」——不是語言天花板;同等微調投入很可能再反轉回來。綜觀 JS→C#→C++,語言這軸從不超過 ~1.5×;~70–85× 的離群值純粹是 JavaScript 直譯器,而我們的演算法貢獻 ~5–9×。
A side-lesson: the cache layout is an optimization. Our first C++ port used split per-field arrays (one for flags, one for the channel payload, …) and lost to C#. It only caught up once it used the same packed 16-byte node struct (four per cache line) — reading one node then touches one cache line, not five. A faithful port of "our algorithm" has to port the layout, not just the logic.
附帶教訓:cache 佈局本身就是優化。我們第一版 C++ 用分離的逐欄位陣列(flags 一條、通道 payload 一條……),輸給 C#;直到改用同樣的 packed 16-byte 節點結構(一條 cache line 裝四個)才追上——讀一個節點只碰一條 cache line、不是五條。忠實移植「我們的演算法」必須連佈局一起移植,不只邏輯。
A second CPU architecture: does it all hold on ARM?換一顆 CPU 架構:在 ARM 上還成立嗎?
Everything above ran on one x64 desktop (AMD Zen 2). A natural question is whether these findings are specific to that CPU. So we re-ran the entire study on a completely different machine — a Raspberry Pi 5 (Arm Cortex-A76, ARMv8.2, 4 cores @ 2.4 GHz) — a different instruction set, microarchitecture, and memory system, with the same .NET 11 + Node + compiler toolchain (here g++ -O3 in place of clang). The engine source was not touched — only the build needed an arm64 target flag.
上面全部都跑在一顆 x64 桌機(AMD Zen 2)上。一個自然的問題是:這些發現是不是只對那一顆 CPU 成立?所以我們把整套研究搬到一台完全不同的機器重跑——一台 Raspberry Pi 5(Arm Cortex-A76,ARMv8.2,4 核 @ 2.4 GHz)——不同的指令集、微架構與記憶體系統,工具鏈一樣是 .NET 11 + Node + 編譯器(這裡用 g++ -O3 代替 clang)。引擎原始碼一行都沒動——只有 build 要加一個 arm64 目標旗標。
Bit-exact across two instruction sets. The full NES engine (S1) produced the identical full-state checksum on ARM as on x64 — 0x9174E19D961CB6E5, bit-for-bit — confirming the simulation is deterministic across completely different hardware (pure integer/bit logic, no floating point). On ARM it sustains ~71.7K hc/s (≈0.52× the desktop's ~138.7K — the clock + microarchitecture gap), and it became the first ARM entry on the public leaderboard.
跨兩種指令集位元精確。完整的 NES 引擎(S1)在 ARM 上產生與 x64 完全相同的全狀態 checksum——0x9174E19D961CB6E5,逐位元一致——證明模擬在完全不同的硬體上是確定性的(純整數/位元邏輯、無浮點)。ARM 上維持約 71.7K hc/s(約桌機 ~138.7K 的 0.52×——就是時脈+微架構的差距),也成為公開排行榜上第一個 ARM 條目。
The cross-language comparison reproduces too — same NOP sled, three bare CPUs, all the engines, on the A76:
跨語言對照也重現了——同一條 NOP sled、三顆裸 CPU、全部引擎,在 A76 上:
| Chip晶片 | JS naive | C# naive | C++ naive | C++ ours-fullC++ 我們(完整) | C# oursC# 我們 |
|---|---|---|---|---|---|
| MOS 6502 | 148 | 10,909 | 16,720 | 69,163 | 74,955 |
| Motorola 6800 | 89 | 6,818 | 11,124 | 35,566 | 38,116 |
| Zilog Z80 | 99 | 7,063 | 12,282 | 27,631 | 28,356 |
Pi 5 (Cortex-A76), hc/s median, NOP sled, unpinned. Every "ours" engine is bit-exact per chip (C# ours = C++ ours-core = C++ ours-full; checksums 0x97C10609CF8BF86F / 0x713AF51D83FAAC5E / 0x0CE4E8586F898641 for 6502 / 6800 / z80).
Pi 5(Cortex-A76),hc/s 中位數,NOP sled,未釘核。每顆晶片的「我們的」引擎彼此位元精確(C# ours = C++ ours-core = C++ ours-full;6502 / 6800 / z80 checksum 為 0x97C10609CF8BF86F / 0x713AF51D83FAAC5E / 0x0CE4E8586F898641)。
Every structural conclusion holds on the new CPU. The language dividend (JS→compiled) is ~71–77× on ARM (x64: ~70–85×). The algorithmic gain (naive→ours, same language) is ~4–7× (x64: ~5–9×). And the inversion still holds — heavily-tuned C# beats the faithful C++ port — but it shrinks to ~1.03–1.08× on ARM (x64 was ~1.27–1.56×). So on ARM the two compiled languages sit within ~8% of each other: the language axis matters even less here. (Most likely the .NET arm64 JIT + dynamic PGO is less mature than on x64, and/or g++ on the A76 does relatively better than clang did on x64 — either way it is tuning/runtime, not a language ceiling.) The fuzz-vs-NOP sign reproduces as well (6502 and Z80 slower under fuzz, 6800 faster). Net: none of these findings is overfit to one microarchitecture.
每一條結構性結論在新 CPU 上都成立。語言紅利(JS→編譯)在 ARM 上是 ~71–77×(x64:~70–85×)。演算法增益(同語言 naive→ours)是 ~4–7×(x64:~5–9×)。而反轉仍然成立——重度調優的 C# 贏過忠實的 C++ 移植——但在 ARM 上縮小到 ~1.03–1.08×(x64 是 ~1.27–1.56×)。也就是在 ARM 上兩個編譯語言只差 ~8%:語言這軸在這裡更不重要。(最可能是 .NET 的 arm64 JIT + 動態 PGO 不如 x64 成熟、和/或 g++ 在 A76 上相對 clang-on-x64 更強——無論如何都是調優/runtime 的事,不是語言天花板。)fuzz 對 NOP 的正負號也重現(6502 與 Z80 在 fuzz 下較慢、6800 較快)。結論:這些發現沒有一條是對單一微架構過擬合的。
Three ways to stress a chip with no ROM三種無 ROM 壓測晶片的方法
With no program to run, we drive the chip at the pin boundary three ways: the NOP sled (steady and regular), random-bus fuzzing (maximum entropy), and reset-hold (the chip frozen in reset, so only the clock tree toggles).
沒有程式可跑,我們就在引腳邊界用三種方式驅動:NOP sled(規律穩定)、隨機匯流排 fuzzing(最大熵)、reset-hold(把晶片凍在 reset、只剩時鐘樹翻動)。
Fuzzing has a trap. Random opcodes include ones that halt the CPU — the 6502's KIL/JAM (×12), the 6800's HCF "halt and catch fire" (×5), the Z80's HALT (×1). Feed one and the processor locks up, so you end up timing a dead chip, not stress. Unfiltered, the 6502 fuzz reads a misleadingly-fast 232K hc/s jammed (address bus stuck at 0xFFFF); excluding the 12 KIL opcodes drops it to a real 71K. We exclude every halt/lock opcode (list confirmed with Gemini) so the program counter keeps advancing.
fuzzing 有個陷阱。隨機 opcode 裡有會讓 CPU 停擺的碼——6502 的 KIL/JAM(×12)、6800 的 HCF「halt and catch fire」(×5)、Z80 的 HALT(×1)。餵到一個,處理器就鎖死,於是你量到的是死掉的晶片、不是壓力。沒過濾時,6502 的 fuzz 假快到 232K hc/s(位址匯流排卡在 0xFFFF);排除那 12 個 KIL 後掉到真實的 71K。我們排除所有 halt/lock 碼(清單經 Gemini 確認),讓程式計數器持續前進。
| Chip晶片 | NOP sledNOP sled | Random fuzz (jam-excluded)隨機 fuzz(排除鎖死碼) | Reset-holdReset-hold |
|---|---|---|---|
| MOS 6502 | 149,736 | 70,941 | 155,915 |
| Motorola 6800 | 88,313 | 114,222 | 165,764 |
| Zilog Z80 | 62,952 | 43,955 | 161,838 |
Our engine, hc/s, same machine, thread-pinned. Reset-hold is fastest and nearly chip-independent (~156–166K) — it measures the engine's base clock-tree overhead. Fuzz vs NOP is chip-dependent, and the JavaScript and C# engines agree on the sign: the 6502 and Z80 run slower under fuzz (genuine stress — varied execution churns more nodes), but the 6800 runs faster (its NOP, 0x01, is an unusually heavy instruction). So the a-priori "fuzz = worst case" expectation only holds for two of the three chips.
我們引擎,hc/s,同機、釘核。Reset-hold 最快且近乎與晶片無關(~156–166K)——量的是引擎的基礎時鐘樹開銷。Fuzz vs NOP 隨晶片而異,而且 JavaScript 與 C# 引擎同號:6502 與 Z80 在 fuzz 下較慢(真壓力——隨機執行翻動更多節點),但 6800 反而更快(它的 NOP=0x01 是相對重的指令)。所以「fuzz = 最壞情況」的先驗預期,只對三顆裡的兩顆成立。
How fast is that, in kilohertz?換算成 kHz,有多快?
A bare CPU needs none of the NES's master-clock division (the 2A03 runs at the master clock over twelve), so one CPU cycle is just two half-cycles and the simulated clock is the half-cycle rate halved. Against the home computers these chips actually powered, one modern core gets surprisingly close:
裸 CPU 不需要 NES 的主時鐘分頻(2A03 跑在主時鐘的十二分之一),所以一個 CPU 週期就只是兩個半週期,模擬時脈就是半週期速率除以二。對照這些晶片當年實際驅動的家用電腦,單顆現代核心其實近得驚人:
| Chip晶片 | Simulated clock模擬時脈 | 1980s host (real clock)1980s 宿主(實機時脈) | Gap to the real machine距實機 |
|---|---|---|---|
| MOS 6502 | ~79.3 kHz | Apple II / C64 (1.0 MHz)Apple II / C64(1.0 MHz) | ~13× |
| Motorola 6800 | ~48.2 kHz | SWTPC 6800 (~1 MHz)SWTPC 6800(約 1 MHz) | ~21× |
| Zilog Z80 | ~31.4 kHz | ZX Spectrum / MSX / CPC (3.5–4 MHz)ZX Spectrum / MSX / CPC(3.5–4 MHz) | ~111–127× |
A single core is thus far closer to interactively re-running these vintage machines — transistor by transistor — than it is to the whole NES (which is ~310× from real time), simply because there's one die instead of two and no clock divider in the way. Matching the originals outright stays out of reach on one core, but the gap is now a concrete number rather than an abstraction.
因此單顆核心要逐電晶體地互動重跑這些古董機器,比起整台 NES(距即時 約 316×)近得多,純粹因為只有一顆晶粒而非兩顆、且中間沒有分頻器擋路。要在單核上完全追平原機仍不可達,但這個差距如今是個具體數字,而非抽象。
Where the numbers come from數字的出處
Same machine (AMD Ryzen 7 3700X, thread-pinned). NOP-sled workload, 1,000,000 half-cycles per round, steady-state median of 6 rounds (the first .NET rounds run below steady state during JIT tiering / PGO warm-up). Updated 2026-06-21: the AprVisual ours engine now carries the 2026.06.20 turn-off-dedupe + GndPwr fast-path wins — bit-exact on all three chips (reproduces the published golden checksums); the kHz figures apply each chip's interleaved-A/B gain (+0.6–1.5%) to this cool-machine baseline.
Toolchains & versions. The "JavaScript" engine is not a browser run — it is the original Visual 6502 chipsim.js + wires.js + macros.js loaded verbatim and run headless under Node.js 24.13.0 (V8) by our tools/visual6502-node harness, which stubs only the DOM/UI functions — so it is the genuine site's simulation core, just without a browser (the expert-mode UI is a separate layer). C# (the naive port and the AprVisual engine) is built Release on .NET 11 (x64). C++ (both the naive port and the ours-core port) is clang/LLVM 22.1.6, -O3 -std=c++17. The C# benches are thread-pinned (--pin); the C++ binaries are not (no such flag), but the language gaps here dwarf the ~10% pinning effect. The two C++ programs load a netlist exported from C# (--export-netlist / --export-engine) so the data is provably identical — no .js parser in C++.
Tools: src/AprVisual.etc/ (--cpu-bench <dir> --chip 6502|6800|z80 [--naive] [--workload nop|fuzz|reset]), tools/visual6502-node (JS), tools/cpp-naive + tools/cpp-ours (C++). 6800 and Z80 follow the original harness in skipping the weak (depletion) transistors. Full study notes: MD/note/2026-06-15-…. Companion pages: the reducibility study (S2), the event-count prunes, self-captured relayout, the netlist-family comparison.
同機(AMD Ryzen 7 3700X、釘核)、NOP-sled 工作負載、每輪 100 萬半週期,取 6 輪穩態中位數(.NET 前幾輪因 JIT 分層 / PGO 暖機而低於穩態)。2026-06-21 更新:AprVisual 的 ours 引擎現已含 2026.06.20 的 turn-off 去重 + GndPwr fast-path 兩個優化 —— 三顆晶片皆 bit-exact(重現 published golden checksum);上方 kHz 是把各晶片 interleaved-A/B 增益(+0.6~1.5%)套到此冷機基線。
工具鏈與版本。這裡的「JavaScript」引擎不是瀏覽器跑的——而是把原始 Visual 6502 的 chipsim.js+wires.js+macros.js 原封不動載入、在 Node.js 24.13.0(V8)下 headless 跑(我們的 tools/visual6502-node 測試平台只 stub 掉 DOM/UI),所以它就是官網真正的模擬核心、只是沒有瀏覽器(expert 模式 UI 是另一層)。C#(naive 移植與 AprVisual 引擎)以 .NET 11、x64 Release 編譯。C++(naive 移植與 ours-core 移植)為 clang/LLVM 22.1.6、-O3 -std=c++17。C# benchmark 有釘核(--pin);C++ 二進位沒有(無此旗標),但這裡的語言差距遠大於釘核 ~10% 的效應。兩個 C++ 程式載入的是從 C# 匯出的網表(--export-netlist / --export-engine),保證資料完全相同——C++ 端沒有 .js parser。
工具:src/AprVisual.etc/(--cpu-bench <dir> --chip 6502|6800|z80 [--naive] [--workload nop|fuzz|reset])、tools/visual6502-node(JS)、tools/cpp-naive+tools/cpp-ours(C++)。6800 與 Z80 比照原始測試平台跳過 weak(depletion)電晶體。完整研究筆記:MD/note/2026-06-15-…。延伸頁面:可化約性研究(S2)、事件數剪枝、自我捕捉重佈局、網表家族比較。