AprVisual — performance ↔ real-time calculator

Throughput吞吐量 K hc/s

⇅ the two boxes stay in sync兩個欄位會互相連動

Target FPS目標 FPS fps

FPS is capped at 60.0988 — the real NES NTSC frame rate. You can't ask for more than the hardware produces.FPS 上限為 60.0988 —— 也就是 NES NTSC 實機的畫面更新率。不能要求比硬體本身更快。

—

per frame每張畫面

—

effective FPS換算 FPS

—

slower than real time比實機慢幾倍

—

of real-time speed佔實機速度

—

throughput needed for that FPS達到該 FPS 所需吞吐量

⎇ the same throughput, expressed as the chips' clock speeds同一個吞吐量,換算成晶片時脈速度

—

simulated master clock
real NES: 21.477 MHz模擬出的主時脈
實機:21.477 MHz

—

equivalent 6502 (2A03) clock
real NES: 1.79 MHz等效 6502(2A03)時脈
實機:1.79 MHz

—

equivalent PPU (2C02) clock
real NES: 5.37 MHz等效 PPU(2C02)時脈
實機:5.37 MHz

Loading the #1 result from the leaderboard…

How the numbers are derived數字怎麼來的

Everything follows from two facts about the NES and one definition of the unit:

所有換算都來自關於 NES 的兩個事實,以及單位的一個定義:

Quantity量	Value數值	Meaning意義
hc (half-cycle)	unit of work工作單位	one toggle of the chip's master clock — the engine's fundamental step.主時脈翻轉一次 —— 引擎的基本步進。
Real-time rate實機速率	`42,954,552 hc/s`	how many half-cycles a real NES runs every second.一台真實 NES 每秒跑過的半週期數。
NES NTSC frame rateNES NTSC 更新率	`60.0988 fps`	frames a real NES draws per second.真實 NES 每秒畫出的張數。
Half-cycles per frame每張畫面的半週期	`≈ 714,732 hc`	= 42,954,552 ÷ 60.0988 — the work in one frame.= 42,954,552 ÷ 60.0988 —— 一張畫面的工作量。

From there, given a throughput R (in hc/s):

有了這些,給定吞吐量 R(單位 hc/s)後:

Seconds per frame = 714,732 ÷ R
每張畫面秒數 = 714,732 ÷ R
Effective FPS = R ÷ 714,732
換算 FPS = R ÷ 714,732
Slower than real time = 42,954,552 ÷ R (the "×" figure)
比實機慢幾倍 = 42,954,552 ÷ R (就是那個「×」數字)
% of real-time speed = R ÷ 42,954,552 × 100%
佔實機速度百分比 = R ÷ 42,954,552 × 100%

Chip-clock equivalents換算成晶片時脈

An hc is one half-cycle of the master clock (one toggle of the board's clk node — the engine's fundamental step), and the NES derives every other clock from it by fixed division. So a throughput R (in hc/s) maps directly onto "how fast the simulated chips are running":

hc 是主時脈的一個半週期(主機板 clk 節點翻轉一次 —— 引擎的基本步進),而 NES 的其他時脈都是主時脈的固定分頻。所以吞吐量 R(hc/s)可以直接換算成「模擬出的晶片跑多快」:

Clock時脈	Conversion換算式	Real NES (NTSC)實機(NTSC)
Master clock主時脈	`R ÷ 2` (1 cycle = 2 half-cycles1 週期 = 2 個半週期)	`21.477272 MHz`
6502 / 2A03 CPU clock6502 / 2A03 CPU 時脈	`R ÷ 24` (CPU = master ÷ 12CPU = 主時脈 ÷ 12)	`1.789773 MHz`
PPU / 2C02 pixel clockPPU / 2C02 像素時脈	`R ÷ 8` (PPU = master ÷ 4PPU = 主時脈 ÷ 4)	`5.369318 MHz`

Example: 135.9K hc/s ⇒ a 67.9 kHz master clock ⇒ an equivalent ~5.7 kHz 6502 and ~17.0 kHz PPU — against the real 1.79 MHz / 5.37 MHz. Careful when comparing with other projects' "Hz": many quote CPU-clock-equivalent figures (divide ours by 24 first), and most simulate one chip, not the whole console.例:135.9K hc/s ⇒ 主時脈 67.9 kHz ⇒ 等效 6502 約 5.7 kHz、等效 PPU 約 17.0 kHz —— 對照實機的 1.79 MHz / 5.37 MHz。跟別的專案的「Hz」比較時要小心:很多專案標的是 CPU 時脈口徑(我們的數字要先 ÷24),而且多數只模擬一顆晶片、不是整台主機。

Reality check: the dev machine's C# engine reaches ~135.9K hc/s at boost (2026-06, range-prune + self-captured locality key + the B1 pair path). Plug it in and you get ~5.3 s per frame, ~0.190 FPS, and ~316× slower than real time — i.e. about 0.32% of real-time speed. That gap is the whole point of the project: see why it can't realistically be closed on one CPU core.對照一下:開發機的 C# 引擎在 boost 下約 135.9K hc/s(2026-06,範圍剪枝 + 自我捕捉 locality 鍵 + B1 成對路徑)。代進去就是每張畫面約 5.3 秒、約 0.190 FPS、比實機慢約 316 倍 —— 也就是只有實機速度的約 0.32%。這個差距正是整個專案的重點:見為什麼單核 CPU 實務上補不上這個差距。

Will this gap ever close? — what the research found這個差距補得起來嗎? —— 我們研究的心得

The chip is the obstacle, not the code難的是晶片,不是程式

The 2A03 and 2C02 are early-1980s NMOS designs. Unlike a modern chip — clean synchronous digital logic that maps neatly onto Boolean gates and registers — these are dense, irregular switch-level networks, where a signal propagates as conduction ripples through small clusters of transistors. The engine already runs at that network's natural minimum granularity: each event settles an average conducting group of only ~1.4 nodes. There is simply no large, regular block of computation to batch, vectorize, or hand off to a GPU.

2A03 與 2C02 是 1980 年代初的 NMOS 設計。跟現代晶片那種「乾淨的同步數位邏輯、能整齊對應到布林閘與暫存器」不同,它們是密集又不規則的開關級網路 —— 訊號是靠一小群電晶體的導通彼此牽動而傳播的。我們的引擎已經跑在這個網路天生的最小顆粒度上:每個事件平均只穩定一個約 1.4 個節點的導通群。根本沒有一塊又大又規則的運算,可以拿去批次化、向量化、或丟給 GPU。

The rule: fully automatic, no manual abstraction規則:全自動,不靠人工抽象

A deliberate constraint of this project is no hand-written abstraction. We don't author a faster behavioral model of the CPU by hand; the speed has to come purely from the program automatically shedding work — detecting what cannot change and skipping it — on a single CPU core. Under that rule, almost every acceleration strategy we tried lost to the plain event-driven interpreter:

這個專案刻意立了一條規則:不做人工抽象。我們不靠手寫去做一個更快的 CPU 行為模型;速度必須完全來自「程式自動削掉計算」 —— 自動判斷哪些不會變、就跳過 —— 而且只用單一 CPU 核心。在這條規則下,我們試過的加速策略幾乎全部輸給最樸素的事件驅動直譯器:

Strategy tried試過的策略	Measured result實測結果
IR interpreterIR 直譯器	−2.5% (slower更慢)
Ahead-of-time codegen預先編譯 codegen	3–6× slower慢 3–6 倍
GPU kernel (single instance)GPU kernel(單一實例)	~10.7× slower慢約 10.7 倍
Bit-parallel BFS位元平行 BFS	~156× slower慢約 156 倍
Per-chip multithreading每晶片多執行緒	~15× slower慢約 15 倍

We even proved that ~98.9% of the chip's activity is reducible to logic + registers — yet that reduction still didn't beat the interpreter on one core, because the interpreter was already doing the minimum work per event. The full account is in why IR / codegen hit the wall and the study paper.

我們甚至證明了這顆晶片約 98.9% 的活動可以化簡成邏輯 + 暫存器 —— 但這個化簡在單核上仍然贏不了直譯器,因為直譯器本來就已經把每個事件的工作做到最少了。完整說明見為什麼 IR / codegen 撞牆與研究論文。

Today's wall is memory movement, not arithmetic今天的牆是記憶體搬運,不是算術

So the bottleneck now isn't how many operations the ALU can do — it's getting the data to the ALU. The hot loop chases pointers through node and transistor tables that don't fit in the fast caches, so it's gated by L1/L2 cache latency and memory movement, not compute. That's why throughput tracks CPU clock and cache latency far more than core count, and why any micro-optimization that doesn't shrink the data or cut cache misses tends to do nothing at all. We are sitting squarely on the memory-latency floor.

所以現在的瓶頸不是「ALU 一秒能算幾次」,而是「怎麼把資料餵到 ALU」。熱迴圈要在放不進高速快取的節點/電晶體表裡一路追指標,被卡住的是 L1/L2 快取延遲與記憶體搬運,而不是運算量。這就是為什麼吞吐量主要看 CPU 時脈與快取延遲、而不是核心數;也是為什麼那些不能縮小資料、或減少快取失誤的微優化,通常做了等於沒做。我們現在就穩穩卡在記憶體延遲這塊地板上。

But that floor isn't fixed forever. It moves down with hardware — faster memory, larger and lower-latency caches, higher per-core IPC — and with continued, patient squeezing of the data layout. The current #1 already does ~0.16 frames/second; reaching 1 frame/second and beyond is only about a 6× climb, well within reach of better hardware plus incremental wins. Full real-time is much further (the multiplier shown above), and software cleverness alone won't close it on one core — but because the limit is a memory-latency floor rather than a fundamental compute wall, it keeps inching down as machines improve. Overcoming real NES speed may not be as far off as the raw multiplier makes it look.

但這塊地板不是永遠固定的。它會隨硬體往下移 —— 更快的記憶體、更大且更低延遲的快取、更高的單核 IPC —— 也會隨著我們持續、耐心地壓榨資料佈局而下降。目前的第一名已經到約每秒 0.16 張;要做到每秒 1 張起跳,大約只差 6 倍,靠更好的硬體再加上一點一滴的改進,並非遙不可及。完整的實機即時還遠得多(就是上面那個倍數),光靠軟體的小聰明在單核上補不起來 —— 但因為極限是一塊記憶體延遲地板、而不是某個基本的運算天花板,它會隨著機器進步一直往下挪。克服真實 NES 速度的那天,也許沒有那個倍數看起來的那麼遙遠。

Notes & caveats說明與注意事項

The #1 default is fetched live from the community leaderboard — the fastest recorded result so far (anyone with a faster CPU can take the top spot by running the benchmark). If the leaderboard can't be reached, the calculator falls back to the dev machine's ~135.9K example.
第一名預設值是即時從社群排行榜抓取的 —— 目前有紀錄的最快結果(任何擁有更快 CPU 的人都能藉由跑 benchmark 登頂)。若連不到排行榜,換算器會退回開發機的 ~135.9K 範例。
The hot loop is memory-latency-bound, so a given machine's hc/s tracks its CPU clock and boost/thermal state — the same engine reads ~135.9K at boost on a cool machine, ~10% lower on a heat-soaked day, and lower still pinned to base clock. Treat any single number as a point on that curve.
熱迴圈受記憶體延遲限制,所以同一台機器的 hc/s 會隨 CPU 時脈與 boost/熱狀態變動 —— 同一個引擎涼機 boost 下約 135.9K、機器熱透時低 ~10%、鎖在基頻時更低。任何單一數字都只是那條曲線上的一個點。
FPS here is the simulation's own throughput expressed as frames/second — not a smooth 60 fps you could play at. Don't compare it directly with another netlist sim's "Hz", which may define a step differently.
這裡的 FPS 是模擬本身的吞吐量換算成「張/秒」 —— 不是可以拿來玩、順順的 60 fps。也別直接拿去跟別的網表模擬器的「Hz」比,因為各家對「一步」的定義不同。