FPS is capped at 60.0988 — the real NES NTSC frame rate. You can't ask for more than the hardware produces.FPS 上限為 60.0988 —— 也就是 NES NTSC 實機的畫面更新率。不能要求比硬體本身更快。
Loading the #1 result from the leaderboard…
正在從排行榜載入第一名資料…
How the numbers are derived數字怎麼來的
Everything follows from two facts about the NES and one definition of the unit:
所有換算都來自關於 NES 的兩個事實,以及單位的一個定義:
| Quantity量 | Value數值 | Meaning意義 |
|---|---|---|
| hc (half-cycle) | unit of work工作單位 | one toggle of the chip's master clock — the engine's fundamental step.主時脈翻轉一次 —— 引擎的基本步進。 |
| Real-time rate實機速率 | 42,954,552 hc/s | how many half-cycles a real NES runs every second.一台真實 NES 每秒跑過的半週期數。 |
| NES NTSC frame rateNES NTSC 更新率 | 60.0988 fps | frames a real NES draws per second.真實 NES 每秒畫出的張數。 |
| Half-cycles per frame每張畫面的半週期 | ≈ 714,732 hc | = 42,954,552 ÷ 60.0988 — the work in one frame.= 42,954,552 ÷ 60.0988 —— 一張畫面的工作量。 |
From there, given a throughput R (in hc/s):
有了這些,給定吞吐量 R(單位 hc/s)後:
- Seconds per frame = 714,732 ÷
R - 每張畫面秒數 = 714,732 ÷
R - Effective FPS =
R÷ 714,732 - 換算 FPS =
R÷ 714,732 - Slower than real time = 42,954,552 ÷
R(the "×" figure) - 比實機慢幾倍 = 42,954,552 ÷
R(就是那個「×」數字) - % of real-time speed =
R÷ 42,954,552 × 100% - 佔實機速度百分比 =
R÷ 42,954,552 × 100%
Reality check: the dev machine's C# engine reaches ~91K hc/s at boost. Plug it in and you get ~7.9 s per frame, ~0.127 FPS, and ~473× slower than real time — i.e. about 0.21% of real-time speed. That gap is the whole point of the project: see why it can't realistically be closed on one CPU core.對照一下:開發機的 C# 引擎在 boost 下約 91K hc/s。代進去就是每張畫面約 7.9 秒、約 0.127 FPS、比實機慢約 473 倍 —— 也就是只有實機速度的約 0.21%。這個差距正是整個專案的重點:見為什麼單核 CPU 實務上補不上這個差距。
Will this gap ever close? — what the research found這個差距補得起來嗎? —— 我們研究的心得
The chip is the obstacle, not the code難的是晶片,不是程式
The 2A03 and 2C02 are early-1980s NMOS designs. Unlike a modern chip — clean synchronous digital logic that maps neatly onto Boolean gates and registers — these are dense, irregular switch-level networks, where a signal propagates as conduction ripples through small clusters of transistors. The engine already runs at that network's natural minimum granularity: each event settles an average conducting group of only ~1.4 nodes. There is simply no large, regular block of computation to batch, vectorize, or hand off to a GPU.
2A03 與 2C02 是 1980 年代初的 NMOS 設計。跟現代晶片那種「乾淨的同步數位邏輯、能整齊對應到布林閘與暫存器」不同,它們是密集又不規則的開關級網路 —— 訊號是靠一小群電晶體的導通彼此牽動而傳播的。我們的引擎已經跑在這個網路天生的最小顆粒度上:每個事件平均只穩定一個約 1.4 個節點的導通群。根本沒有一塊又大又規則的運算,可以拿去批次化、向量化、或丟給 GPU。
The rule: fully automatic, no manual abstraction規則:全自動,不靠人工抽象
A deliberate constraint of this project is no hand-written abstraction. We don't author a faster behavioral model of the CPU by hand; the speed has to come purely from the program automatically shedding work — detecting what cannot change and skipping it — on a single CPU core. Under that rule, almost every acceleration strategy we tried lost to the plain event-driven interpreter:
這個專案刻意立了一條規則:不做人工抽象。我們不靠手寫去做一個更快的 CPU 行為模型;速度必須完全來自「程式自動削掉計算」 —— 自動判斷哪些不會變、就跳過 —— 而且只用單一 CPU 核心。在這條規則下,我們試過的加速策略幾乎全部輸給最樸素的事件驅動直譯器:
| Strategy tried試過的策略 | Measured result實測結果 |
|---|---|
| IR interpreterIR 直譯器 | −2.5% (slower更慢) |
| Ahead-of-time codegen預先編譯 codegen | 3–6× slower慢 3–6 倍 |
| GPU kernel (single instance)GPU kernel(單一實例) | ~10.7× slower慢約 10.7 倍 |
| Bit-parallel BFS位元平行 BFS | ~156× slower慢約 156 倍 |
| Per-chip multithreading每晶片多執行緒 | ~15× slower慢約 15 倍 |
We even proved that ~98.9% of the chip's activity is reducible to logic + registers — yet that reduction still didn't beat the interpreter on one core, because the interpreter was already doing the minimum work per event. The full account is in why IR / codegen hit the wall and the study paper.
我們甚至證明了這顆晶片約 98.9% 的活動可以化簡成邏輯 + 暫存器 —— 但這個化簡在單核上仍然贏不了直譯器,因為直譯器本來就已經把每個事件的工作做到最少了。完整說明見為什麼 IR / codegen 撞牆與研究論文。
Today's wall is memory movement, not arithmetic今天的牆是記憶體搬運,不是算術
So the bottleneck now isn't how many operations the ALU can do — it's getting the data to the ALU. The hot loop chases pointers through node and transistor tables that don't fit in the fast caches, so it's gated by L1/L2 cache latency and memory movement, not compute. That's why throughput tracks CPU clock and cache latency far more than core count, and why any micro-optimization that doesn't shrink the data or cut cache misses tends to do nothing at all. We are sitting squarely on the memory-latency floor.
所以現在的瓶頸不是「ALU 一秒能算幾次」,而是「怎麼把資料餵到 ALU」。熱迴圈要在放不進高速快取的節點/電晶體表裡一路追指標,被卡住的是 L1/L2 快取延遲與記憶體搬運,而不是運算量。這就是為什麼吞吐量主要看 CPU 時脈與快取延遲、而不是核心數;也是為什麼那些不能縮小資料、或減少快取失誤的微優化,通常做了等於沒做。我們現在就穩穩卡在記憶體延遲這塊地板上。
But that floor isn't fixed forever. It moves down with hardware — faster memory, larger and lower-latency caches, higher per-core IPC — and with continued, patient squeezing of the data layout. The current #1 already does ~0.16 frames/second; reaching 1 frame/second and beyond is only about a 6× climb, well within reach of better hardware plus incremental wins. Full real-time is much further (the multiplier shown above), and software cleverness alone won't close it on one core — but because the limit is a memory-latency floor rather than a fundamental compute wall, it keeps inching down as machines improve. Overcoming real NES speed may not be as far off as the raw multiplier makes it look.
但這塊地板不是永遠固定的。它會隨硬體往下移 —— 更快的記憶體、更大且更低延遲的快取、更高的單核 IPC —— 也會隨著我們持續、耐心地壓榨資料佈局而下降。目前的第一名已經到約每秒 0.16 張;要做到每秒 1 張起跳,大約只差 6 倍,靠更好的硬體再加上一點一滴的改進,並非遙不可及。完整的實機即時還遠得多(就是上面那個倍數),光靠軟體的小聰明在單核上補不起來 —— 但因為極限是一塊記憶體延遲地板、而不是某個基本的運算天花板,它會隨著機器進步一直往下挪。克服真實 NES 速度的那天,也許沒有那個倍數看起來的那麼遙遠。
Notes & caveats說明與注意事項
- The #1 default is fetched live from the community leaderboard — the fastest recorded result so far (anyone with a faster CPU can take the top spot by running the benchmark). If the leaderboard can't be reached, the calculator falls back to the dev machine's ~91K example.
- 第一名預設值是即時從社群排行榜抓取的 —— 目前有紀錄的最快結果(任何擁有更快 CPU 的人都能藉由跑 benchmark 登頂)。若連不到排行榜,換算器會退回開發機的 ~91K 範例。
- The hot loop is memory-latency-bound, so a given machine's hc/s tracks its CPU clock and boost/thermal state — the same engine reads ~91K at boost but ~76.5K pinned at base 3.6 GHz. Treat any single number as a point on that curve.
- 熱迴圈受記憶體延遲限制,所以同一台機器的 hc/s 會隨 CPU 時脈與 boost/熱狀態變動 —— 同一個引擎 boost 下約 91K,鎖在基頻 3.6 GHz 則約 76.5K。任何單一數字都只是那條曲線上的一個點。
- FPS here is the simulation's own throughput expressed as frames/second — not a smooth 60 fps you could play at. Don't compare it directly with another netlist sim's "Hz", which may define a step differently.
- 這裡的 FPS 是模擬本身的吞吐量換算成「張/秒」 —— 不是可以拿來玩、順順的 60 fps。也別直接拿去跟別的網表模擬器的「Hz」比,因為各家對「一步」的定義不同。