AprVisual — the dead end that came back: renumbering, from useless to +3.6%

Act 1 — why RCM died (and deserved to)第一幕 —— RCM 為何死(而且死得有理)

Classic RCM (reverse Cuthill-McKee) renumbers a graph's vertices to reduce bandwidth — neighbours get nearby ids, so neighbouring data shares cache lines. The bet: the BFS group walk jumps between neighbours, so co-locating them cuts cache misses. Measured (2026-05): ~1.04× at boot, ~0.98× steady-state — nothing. The diagnosis then was already right: the hot working set is small enough to be cache-resident, so there was no capacity problem for locality to fix.

經典 RCM(reverse Cuthill-McKee)重編圖的頂點以縮小頻寬 —— 鄰居拿到相近的 ID,鄰接資料共享 cache line。賭注是:BFS 群走訪在鄰居間跳躍,把它們放一起能省 cache miss。實測(2026-05):開機期 ~1.04×、穩態 ~0.98× —— 等於沒有。當時的診斷其實已經對了:熱工作集小到可以常駐快取,根本沒有「容量」問題可以讓區域性去解。

Act 2 — the diagnosis that re-opened the case第二幕 —— 重啟調查的診斷

A month later, three experiments ran back-to-back, all attacking memory behaviour, all instrumented (interleaved-paired, bit-exact gates):

一個月後,三個實驗連著跑,全都攻記憶體行為、全都有儀器(interleaved-paired、位元等價閘門):

Experiment實驗	What it proved證明了什麼	Result結果
Shrink the wave queues (int→ushort, −60KB)佇列瘦身(int→ushort,−60KB)	streamed data is free (HW prefetcher)串流資料免費(硬體 prefetcher)	−0.07%
Software prefetch of the next pops' NodeInfo軟體預取下一批彈出的 NodeInfo	there is no latency left to hide沒有延遲可藏	−0.3 ~ −1.1%
Pure co-activity locality renumber (cache-line density of the per-half-cycle hot set: −45%, near the packing ideal)純共活性區域性重編號(每半週期熱集的 cache-line 密度:−45%,逼近打包理想值)	even a near-perfect locality win moves nothing連近乎完美的區域性勝利都動不了什麼	−0.36%

Three independent confirmations of one fact: the ~20 ns/event floor is not cache-miss-bound. It is dependent-load-chain bound — each event is a short chain of "load, then load what it points to, then branch on it", and the chain's length is the cost. Locality changes where bytes live; it doesn't shorten chains.

三個獨立實驗指向同一個事實:每事件約 20 奈秒的地板不是 cache miss 造成的,而是相依載入鏈:每個事件是一小串「載入 → 載入它指到的東西 → 依它分支」,成本就是鏈的長度。區域性只改變位元組住哪裡,不會縮短鏈。

Act 3 — the flip: same tool, new objective第三幕 —— 翻盤:同一把工具,新的目標

If the bound is chain length, the only lever is deleting links. The hottest loop — SetNodeState's enqueue walk, run for every transistor endpoint of every state flip — contained exactly one deletable link: the PruneMask byte load. Each endpoint check read a random byte from a 15KB array to learn a static fact ("is this node safe to skip?"). L1-resident, but still a 4-5-cycle dependent load feeding a branch, millions of times per frame.

如果瓶頸是鏈長,唯一的槓桿就是刪掉鏈節。最熱的迴圈 —— SetNodeState 的入列走訪,每次狀態翻轉的每個電晶體端點都要跑 —— 正好有一條可刪的鏈節:PruneMask 的位元組載入。每個端點檢查都要從 15KB 陣列隨機讀一個 byte,只為了知道一個靜態事實(「這個節點能不能安全跳過?」)。它常駐 L1,但仍是一條 4-5 cycle、餵給分支的相依載入,每張 frame 數百萬次。

The insight: a static fact doesn't need a lookup — it can live in the node's id itself. Sort nodes so each prune class occupies one contiguous id block, and the lookup becomes a compare against a constant already in a register:

洞察:靜態事實不需要查表 —— 它可以住在節點 ID 本身。把節點排序成「每個剪枝類別佔一段連續 ID 區間」,查表就變成「跟暫存器裡的常數比大小」:

ids:  [3 ..A)          [A ..S)        [S ..B)          [B ..end)
      skip ∩ unsafe    skip ∩ safe    no-skip ∩ safe   no-skip ∩ unsafe

turn-off skip?    c < S            (supply ids 1,2 < 3 ≤ S ride the same compare)
turn-on  unsafe?  c < A  ||  c ≥ B (c is never supply on this path)

And this is where the "dead" renumbering machinery becomes the enabler: you need a permutation pass to create those contiguous blocks. The locality result even helps in reverse — since locality is worth ~0, sacrificing it for class-major ordering costs nothing. (Within each block, ids follow a clk-BFS signal-flow order — a free static approximation of the cascade order that keeps what little locality value exists.)

這就是「死掉的」重編號機器變成賦能者的地方:要造出連續區塊,就需要一個排列(permutation)pass。而區域性的負結果在這裡反而幫忙 —— 既然區域性值 ~0,為了類別排序犧牲它毫無代價。(每個區塊內,ID 依 clk-BFS 訊號流順序排 —— 串接順序的免費靜態近似,保住僅存的一點區域性價值。)

Safety: never trust the layout, verify it安全性:不信任排列,驗證它

A wrong boundary would silently mis-prune (worst case: enqueue VCC and group-walk the power rail). So the mask is still computed at every Reset as ground truth, and every node's bits are checked against the range-implied bits before the first settle; on mismatch the engine falls back to safe-degenerate boundaries (prunes off, supply guarded, still correct). The whole thing is automatic — a two-phase load classifies under identity ids first, then rebuilds permuted — no profile file, any ROM, any netlist. Bit-exactness is engineered, not hoped for: the power-on sweep and the checksum both iterate in original id order through the permutation, so all three golden checksums hold unchanged.

邊界錯了會默默剪錯(最壞情況:把 VCC 入列、對電源軌做群走訪)。所以每次 Reset 仍然把遮罩算出來當基準真相,在第一次 settle 之前逐節點比對「範圍推導的位元」;不符就退到安全退化邊界(剪枝關閉、supply 仍守住、依然正確)。整件事全自動 —— 兩段式載入先在原始編號下分類、再重排重建 —— 不需 profile 檔、任何 ROM、任何網表。位元等價是設計出來的,不是僥倖:開機掃描與 checksum 都透過排列以原始 ID 順序迭代,所以三條 golden checksum 原封不動成立。

The measurements量測

+3.56%

C# median, 19/20 & 11/12 paired

+2.90%

Rust median, 20/20 paired

+6.17%

C# self-captured key (Act 4), 20/20C# 自我捕捉鍵(第四幕),20/20

135.9K / 118.5K

current peaks, hc/s (best of 10, incl. the later B1 pair path)目前峰值 hc/s(10 取最佳,含其後的 B1 成對路徑)

Cross-engine validation came free: the two independent implementations (C# composes the netlist; Rust remaps a snapshot) derived byte-identical block boundaries (A=460, S=1275, B=7532) and the same golden checksums. A measured co-activity profile as the locality key reached +5.12% in early tests — that observation became Act 4.

跨引擎驗證是免費送的:兩個獨立實作(C# 自組網表;Rust 重映射 snapshot)推導出完全相同的區塊邊界(A=460、S=1275、B=7532)與同一組 golden checksum。早期測試裡,用實測共活性 profile 當區域性鍵可達 +5.12% —— 這個觀察催生了第四幕。

Act 4 — the locality key learns from the chip itself第四幕 —— 區域性鍵向晶片本人學習

Within each class block, nodes still need a secondary order. The shipped version first used a static approximation (a blind BFS from the clock over signal-flow edges). But the profile experiment said a measured order is worth more — so why load a file when the engine can measure itself? The final form: at load, the engine builds itself twice. Pass 1 builds class-major with the temporary static key (its ranges verify, so the event prunes are ON), warms the chip past reset, then runs ~32K half-cycles through a cold instrumented copy of the settle loop that records each node's true first-touch order in the real, pruned production cascade — the hot loop itself stays untouched. Pass 2 rebuilds with that captured order as the secondary key. Total load cost ~1.3 s.

每個類別區塊內,節點還需要一個次要順序。出貨初版用的是靜態近似(從時脈出發、沿訊號流邊的盲 BFS)。但 profile 實驗說實測順序更值錢 —— 既然引擎可以自己量,何必載入檔案?最終形態:載入時引擎把自己蓋兩次。第一遍以暫時靜態鍵做類別為主鍵的重建(其區間通過驗證,所以事件剪枝是開的),暖機越過 reset,再用 settle 迴圈的一份冷的、加了記錄的副本跑約 3.2 萬個半週期,記下每個節點在真實、已剪枝的生產級聯中的初次觸碰順序 —— 熱迴圈本體完全不動。第二遍用捕捉到的順序當次鍵重建。總載入成本約 1.3 秒。

It measured +6.17% (20/20 paired) over the static key, and beat the loaded offline profile head-to-head by +4.94% (16/16) — online self-profiling beat stale offline profiling, the classic dynamic-vs-static PGO story, so the file mode was deleted entirely. The capture is immune by construction to workload drift (different inputs, longer runs): it is re-derived from the actual chip on every load. The diagnostic trail was decisive: a cache-line-density instrument showed that an equally-dense order captured with prunes disabled gained nothing — the active ingredient is the order's lineage (matching the real pruned event schedule), not line packing.

實測比靜態鍵快 +6.17%(20/20 配對全勝),而且和載入式離線 profile 正面對決再贏 +4.94%(16/16) —— 線上自我剖析打敗過期的離線剖析,正是動態 vs 靜態 PGO 的經典劇本,於是檔案模式整個刪除。這個捕捉法在構造上免疫於工作負載漂移(不同輸入、更長的執行):每次載入都從晶片本人重新導出。診斷線索也很關鍵:cache line 密度儀表顯示,在剪枝關閉下捕捉到的同等密度順序毫無增益 —— 有效成分是順序的血統(吻合真實的已剪枝事件排程),不是行打包。

Hardware counters later confirmed the "order, not density" reading directly (2026-06-12 ETW PMC): the class-major relayout step cuts L1d misses from 27.3 to 13.2 per 1000 instructions, and the self-capture step then adds its +5–6% with no further miss reduction at all — the gain rides the dependent-load chain's order, not the cache.

硬體計數器後來直接證實了「順序、不是密度」這個判讀(2026-06-12 ETW PMC):類別為主鍵重排那一步把 L1d miss 從每千指令 27.3 砍到 13.2,而自我捕捉那一步再加 +5~6% 時 miss 完全沒有再降 —— 收益騎在相依載入鏈的順序上,不在快取上。

And one more twist: the identical capture ported to Rust measured −1.08% (0/20) — the starkest cross-engine sign-flip of the project (C# +6.17% vs Rust −1.08% for the same key). The Rust engine keeps the static BFS key; per-platform adoption, as always.

還有一個反轉:一模一樣的捕捉法移植到 Rust 實測 −1.08%(0/20) —— 是本專案至今最戲劇化的跨引擎正負翻轉(同一把鍵,C# +6.17%、Rust −1.08%)。Rust 引擎維持靜態 BFS 鍵;一如既往,各平台各自採用。

Old RCM vs. new renumbering — what actually differs舊 RCM vs 新重編號 —— 到底差在哪

	RCM (2026-05, dead)RCM(2026-05,死)	Class-major (2026-06, shipped)類別為主鍵(2026-06,出貨)
Objective目標函數	cache locality / graph bandwidthcache 區域性 / 圖頻寬	encode static facts as id ranges把靜態事實編進 ID 區間
Sort key排序鍵	graph adjacency (BFS levels)圖鄰接性(BFS 層)	prune class (major) + clk signal-flow (minor)剪枝類別(主)+ clk 訊號流(次)
What it tries to remove想移除的東西	cache missescache miss	a dependent LOAD (chain link) per endpoint check + a 15KB array每端點檢查一條相依載入(鏈節)+ 一條 15KB 陣列
Why that matters here為何在這裡有效	it doesn't — hot set already resident無效 —— 熱集已常駐	the engine is chain-bound; deleting links is THE lever引擎被鏈長綁死;刪鏈節正是那根槓桿
Hot-path change熱路徑變更	none (data moves, code identical)無(資料搬家,程式碼不變)	mask loads → register compares; mask array freed遮罩載入 → 暫存器比較;遮罩陣列釋放
Result結果	~1.0×	+3.56% C# / +2.90% Rust

The lesson教訓

Dead ends are indexed by objective, not by tool. "Renumbering is dead" was true for the question RCM asked (locality) and false for a question nobody had asked yet (range-encoding static facts). The general form, for any engine that is dependent-chain-bound rather than miss-bound: every static per-element fact currently answered by a lookup table is a candidate to be answered by the element's position instead — sort once at load, compare forever at run time. The prune masks were the hottest such table here; the fast-path class byte is the next candidate on the list.

死路的索引鍵是目標函數,不是工具。「重編號已死」對 RCM 問的那個問題(區域性)為真,對一個當時還沒人問的問題(把靜態事實編成區間)為假。對任何被相依鏈而非 cache miss 綁死的引擎,通式是:每一個目前靠查表回答的「靜態的每元素事實」,都是候選 —— 改用元素的「位置」來回答:載入時排序一次,執行期永遠用比較。剪枝遮罩是這裡最熱的一張表;fast-path 類別位元組是清單上的下一個候選。

Measurements: full_palette 400k hc, --pin, interleaved-paired with checksum gates; goldens 0x794A43ABDF169ADA @300k / 0x9174E19D961CB6E5 @400k / 0x6D4CCBCE2E9CD599 @1M all unchanged. C# commits 51e046d (range-prune) / 3e4a571 (self-capture) / a553d38 (file-mode removal), Rust 665acf5. AMD Ryzen 7 3700X. 2026-06-10/11.

量測:full_palette 40 萬 hc、--pin、interleaved-paired + checksum 閘門;golden 0x794A43ABDF169ADA @30 萬 / 0x9174E19D961CB6E5 @40 萬 / 0x6D4CCBCE2E9CD599 @100 萬全數不變。C# commit 51e046d(範圍剪枝)/ 3e4a571(自我捕捉)/ a553d38(移除檔案模式),Rust 665acf5。AMD Ryzen 7 3700X。2026-06-10/11。