Performance ↔ real-time calculator效能 ↔ 實機換算器

Type a throughput in K hc/s (thousand master half-cycles per second) and see how long one NES frame takes, the effective FPS, and how far it still is from real time. Or set a target FPS (capped at the real NES rate) and read off the throughput it would need. The default is loaded live from the #1 recorded result on the leaderboard.

輸入一個吞吐量(以 K hc/s,每秒千個主時脈半週期為單位),就能看到一張 NES 畫面要算多久、換算成多少 FPS、以及離實機即時還差多遠。也可以反過來設定目標 FPS(上限為 NES 真實速率),回推它需要多少吞吐量。預設值會即時帶入排行榜上目前有紀錄的第一名

Throughput吞吐量 K hc/s
the two boxes stay in sync兩個欄位會互相連動
Target FPS目標 FPS fps

FPS is capped at 60.0988 — the real NES NTSC frame rate. You can't ask for more than the hardware produces.FPS 上限為 60.0988 —— 也就是 NES NTSC 實機的畫面更新率。不能要求比硬體本身更快。

per frame每張畫面
effective FPS換算 FPS
slower than real time比實機慢幾倍
of real-time speed佔實機速度
throughput needed for that FPS達到該 FPS 所需吞吐量

Loading the #1 result from the leaderboard…

正在從排行榜載入第一名資料…

📊 See the full community leaderboard ↗📊 看完整社群排行榜 ↗

How the numbers are derived數字怎麼來的

Everything follows from two facts about the NES and one definition of the unit:

所有換算都來自關於 NES 的兩個事實,以及單位的一個定義:

QuantityValue數值Meaning意義
hc (half-cycle)unit of work工作單位one toggle of the chip's master clock — the engine's fundamental step.主時脈翻轉一次 —— 引擎的基本步進。
Real-time rate實機速率42,954,552 hc/show many half-cycles a real NES runs every second.一台真實 NES 每秒跑過的半週期數。
NES NTSC frame rateNES NTSC 更新率60.0988 fpsframes a real NES draws per second.真實 NES 每秒畫出的張數。
Half-cycles per frame每張畫面的半週期≈ 714,732 hc= 42,954,552 ÷ 60.0988 — the work in one frame.= 42,954,552 ÷ 60.0988 —— 一張畫面的工作量。

From there, given a throughput R (in hc/s):

有了這些,給定吞吐量 R(單位 hc/s)後:

Reality check: the dev machine's C# engine reaches ~91K hc/s at boost. Plug it in and you get ~7.9 s per frame, ~0.127 FPS, and ~473× slower than real time — i.e. about 0.21% of real-time speed. That gap is the whole point of the project: see why it can't realistically be closed on one CPU core.對照一下:開發機的 C# 引擎在 boost 下約 91K hc/s。代進去就是每張畫面約 7.9 秒、約 0.127 FPS、比實機慢約 473 倍 —— 也就是只有實機速度的約 0.21%。這個差距正是整個專案的重點:見為什麼單核 CPU 實務上補不上這個差距

Will this gap ever close? — what the research found這個差距補得起來嗎? —— 我們研究的心得

The chip is the obstacle, not the code難的是晶片,不是程式

The 2A03 and 2C02 are early-1980s NMOS designs. Unlike a modern chip — clean synchronous digital logic that maps neatly onto Boolean gates and registers — these are dense, irregular switch-level networks, where a signal propagates as conduction ripples through small clusters of transistors. The engine already runs at that network's natural minimum granularity: each event settles an average conducting group of only ~1.4 nodes. There is simply no large, regular block of computation to batch, vectorize, or hand off to a GPU.

2A03 與 2C02 是 1980 年代初的 NMOS 設計。跟現代晶片那種「乾淨的同步數位邏輯、能整齊對應到布林閘與暫存器」不同,它們是密集又不規則的開關級網路 —— 訊號是靠一小群電晶體的導通彼此牽動而傳播的。我們的引擎已經跑在這個網路天生的最小顆粒度上:每個事件平均只穩定一個約 1.4 個節點的導通群。根本沒有一塊又大又規則的運算,可以拿去批次化、向量化、或丟給 GPU。

The rule: fully automatic, no manual abstraction規則:全自動,不靠人工抽象

A deliberate constraint of this project is no hand-written abstraction. We don't author a faster behavioral model of the CPU by hand; the speed has to come purely from the program automatically shedding work — detecting what cannot change and skipping it — on a single CPU core. Under that rule, almost every acceleration strategy we tried lost to the plain event-driven interpreter:

這個專案刻意立了一條規則:不做人工抽象。我們不靠手寫去做一個更快的 CPU 行為模型;速度必須完全來自「程式自動削掉計算」 —— 自動判斷哪些不會變、就跳過 —— 而且只用單一 CPU 核心。在這條規則下,我們試過的加速策略幾乎全部輸給最樸素的事件驅動直譯器:

Strategy tried試過的策略Measured result實測結果
IR interpreterIR 直譯器−2.5% (slower更慢)
Ahead-of-time codegen預先編譯 codegen3–6× slower慢 3–6 倍
GPU kernel (single instance)GPU kernel(單一實例)~10.7× slower慢約 10.7 倍
Bit-parallel BFS位元平行 BFS~156× slower慢約 156 倍
Per-chip multithreading每晶片多執行緒~15× slower慢約 15 倍

We even proved that ~98.9% of the chip's activity is reducible to logic + registers — yet that reduction still didn't beat the interpreter on one core, because the interpreter was already doing the minimum work per event. The full account is in why IR / codegen hit the wall and the study paper.

我們甚至證明了這顆晶片約 98.9% 的活動可以化簡成邏輯 + 暫存器 —— 但這個化簡在單核上仍然贏不了直譯器,因為直譯器本來就已經把每個事件的工作做到最少了。完整說明見為什麼 IR / codegen 撞牆研究論文

Today's wall is memory movement, not arithmetic今天的牆是記憶體搬運,不是算術

So the bottleneck now isn't how many operations the ALU can do — it's getting the data to the ALU. The hot loop chases pointers through node and transistor tables that don't fit in the fast caches, so it's gated by L1/L2 cache latency and memory movement, not compute. That's why throughput tracks CPU clock and cache latency far more than core count, and why any micro-optimization that doesn't shrink the data or cut cache misses tends to do nothing at all. We are sitting squarely on the memory-latency floor.

所以現在的瓶頸不是「ALU 一秒能算幾次」,而是「怎麼把資料餵到 ALU」。熱迴圈要在放不進高速快取的節點/電晶體表裡一路追指標,被卡住的是 L1/L2 快取延遲與記憶體搬運,而不是運算量。這就是為什麼吞吐量主要看 CPU 時脈與快取延遲、而不是核心數;也是為什麼那些不能縮小資料、或減少快取失誤的微優化,通常做了等於沒做。我們現在就穩穩卡在記憶體延遲這塊地板上。

But that floor isn't fixed forever. It moves down with hardware — faster memory, larger and lower-latency caches, higher per-core IPC — and with continued, patient squeezing of the data layout. The current #1 already does ~0.16 frames/second; reaching 1 frame/second and beyond is only about a 6× climb, well within reach of better hardware plus incremental wins. Full real-time is much further (the multiplier shown above), and software cleverness alone won't close it on one core — but because the limit is a memory-latency floor rather than a fundamental compute wall, it keeps inching down as machines improve. Overcoming real NES speed may not be as far off as the raw multiplier makes it look.

但這塊地板不是永遠固定的。它會隨硬體往下移 —— 更快的記憶體、更大且更低延遲的快取、更高的單核 IPC —— 也會隨著我們持續、耐心地壓榨資料佈局而下降。目前的第一名已經到約每秒 0.16 張;要做到每秒 1 張起跳,大約只差 6 倍,靠更好的硬體再加上一點一滴的改進,並非遙不可及。完整的實機即時還遠得多(就是上面那個倍數),光靠軟體的小聰明在單核上補不起來 —— 但因為極限是一塊記憶體延遲地板、而不是某個基本的運算天花板,它會隨著機器進步一直往下挪。克服真實 NES 速度的那天,也許沒有那個倍數看起來的那麼遙遠。

Notes & caveats說明與注意事項