Memory & L-cache optimization, by example記憶體與 L-cache 優化(用我們的 code 當例子)

AprVisual's switch-level engine isn't limited by how fast the CPU computes — it's limited by how fast memory feeds it. Hardware counters say the hot loop suffers ~805 data-cache misses for every instruction-cache miss: it spends its life waiting on the cache hierarchy, not the ALU. This page is a practical tour of the L1/L2/L3 cache concepts that matter and the concrete C# techniques we used to shrink and reshape data so it fits — each one shown with the real engine code and the measured result. Including the ones that didn't work, and why.

AprVisual 的開關級引擎瓶頸不在「CPU 算多快」,而在「記憶體餵多快」。硬體計數器顯示熱迴圈每 1 次指令快取 miss 就吃 ~805 次資料快取 miss:它一輩子都在等快取層級,不是等 ALU。這頁用實務角度帶你看重要的 L1/L2/L3 快取觀念,以及我們實際用來「把資料縮小、重排到能塞進快取」的 C# 技巧 —— 每一招都附上真實引擎程式碼與實測結果,包含那些沒效的,以及為什麼。

0. Why memory, not compute0. 為什麼是記憶體,不是運算

The engine walks tiny groups of connected transistors ~83,000 times a second, reading per-node records by scattered node id. We profiled it with hardware performance counters (PerfView/ETW PMU on a Ryzen 7 3700X). The hot method's miss profile:

引擎每秒走 ~83,000 次「連通電晶體小群組」,以散亂的 node id 讀取每節點記錄。我們用硬體效能計數器量它(Ryzen 7 3700X 上的 PerfView/ETW PMU)。熱方法的 miss 分布:

counter (hot method)計數器(熱方法)per 1000 instrmeaning意義
D-cache misses~50~1 every 20 cycles — dominant~每 20 cycle 一次 —— 主導
Branch mispredicts~6secondary次要
I-cache misses~0.8noise floor (the 4.6 KB loop fits L1i)雜訊底(4.6 KB 迴圈塞得進 L1i)

D-cache : I-cache ≈ 805 : 1. So code optimizations (unrolling, branchless tricks, fancy codegen) barely move the needle — the CPU is stalled on data. The entire game is getting the right bytes into the right cache level. That is what every technique below is about.

D-cache : I-cache ≈ 805 : 1。所以程式碼層級的優化(展開、無分支技巧、花俏 codegen)幾乎沒用 —— CPU 是卡在等資料。整場遊戲就是「把對的 bytes 放進對的快取層級」。下面每一招都是在做這件事。

1. L-cache in five minutes1. 五分鐘看懂 L-cache

A modern CPU core can't reach RAM directly at speed; it has a hierarchy of small, fast caches in front of it. Rough Zen 2 numbers:

現代 CPU 核心不能全速直接碰 RAM;它前面有一層層又小又快的快取。Zen 2 的粗略數字:

levelsizelatency延遲scope範圍
L1d32 KB~4–5 cycper core每核
L2512 KB~12 cycper core每核
L3~16–32 MB~40 cycshared共享
DRAMGBs~200–300 cyc

Five facts that drive every decision on this page:

驅動這頁所有決策的五個事實:

2. Technique — Structure of Arrays (split hot from cold)2. 技巧 —— SoA 結構體陣列(熱冷分離)

The classic mistake (Array of Structures, AoS): one fat struct per node with every field. When the hot loop reads just one field, the cache line it pulls in is mostly cold fields you won't use — wasted bandwidth, fewer useful records per line. SoA splits the fields into parallel arrays so a line holds only what you're touching.

經典錯誤(AoS,陣列結構體):每個節點一個塞滿所有欄位的胖結構。熱迴圈只讀其中一個欄位時,拉進來的 cache line 大半是用不到的冷欄位 —— 浪費頻寬、每條 line 裝得下的有用記錄變少。SoA 把欄位拆成平行陣列,讓一條 line 只裝你正在碰的東西。

We keep a 16-byte hot record per node, and split the rarely-read fields into their own arrays so they never evict hot lines:

我們每節點留一個 16-byte 的熱記錄,把很少讀的欄位拆成獨立陣列,讓它們永遠不會擠掉熱的 line:

// Hot: read on every BFS visit. 16 bytes -> 4 records per 64-byte cache line.
public static NodeInfo* NodeInfos;

// Cold: split out. Touched only on rare paths, so they never pollute the hot lines.
public static int* NodeConnections;  // only the (rare) floating tie-break reads this
public static int* NodeTlistGates;   // only SetNodeState's fanout writeback reads this

Same data, but the bytes the BFS hammers are now dense — more useful nodes per fetched line, fewer lines, fewer misses.

同樣的資料,但 BFS 猛敲的那些 bytes 現在很密 —— 每抓一條 line 有更多有用節點、line 更少、miss 更少。

3. Technique — pack the hot struct (32 B → 16 B)3. 技巧 —— 把熱結構打包(32 B → 16 B)

The hot record above is 16 bytes on purpose: a 64-byte cache line then holds exactly four nodes. We got there from 32 bytes with an explicit-layout union — overlapping two views of the same 12 bytes, plus packing two small counts into nibbles of one byte:

上面那個熱記錄刻意做成 16 bytes:這樣一條 64-byte cache line 剛好裝四個節點。我們用 explicit-layout union 從 32 bytes 縮下來 —— 讓同 12 bytes 有兩種疊放的 view,再把兩個小計數塞進一個 byte 的高低 nibble:

[StructLayout(LayoutKind.Explicit, Size = 16)]   // 4 per 64-byte line
internal unsafe struct NodeInfo {
    [FieldOffset(0)] public NodeFlags Flags;
    [FieldOffset(1)] public byte Inline;        // 1 = gates live inline (below); 0 = use the Tlist indices
    [FieldOffset(2)] public byte C1c2Count;
    [FieldOffset(3)] public byte GndPwr;        // TWO 4-bit counts packed into one byte (nibbles)

    // 12-byte union @ offset 4: a node uses EITHER the inline gates OR the 3 list indices.
    [FieldOffset(4)]  public fixed ushort InlinePayload[6];  // small-fanout: gates stored right here
    [FieldOffset(4)]  public int TlistC1c2s;   // high-fanout: index into the flat TransistorList
    [FieldOffset(8)]  public int TlistC1gnd;
    [FieldOffset(12)] public int TlistC1pwr;
}

Measured: 32→16 B halved the array (460 KB → 230 KB) and cut hot-method D-cache misses by ~7% (A/B, bit-exact). Net throughput +0.75%.

實測:32→16 B 把陣列砍半(460 KB → 230 KB),熱方法 D-cache miss 降 ~7%(A/B、bit-exact),吞吐量淨 +0.75%

The discipline (this is the important part): shrinking only pays when it moves the array across a cache threshold. NodeInfos at 460 KB spilled L2, so halving it helped. When we tried the same int→ushort shrink on a different array that already fit in L1d, it was pure noise — there was no threshold left to cross. Always measure; "smaller" is not automatically "faster."

紀律(這才是重點):縮小只有在「讓陣列跨過某個快取門檻」時才划算。NodeInfos 在 460 KB 會溢出 L2,所以砍半有用;但我們把同樣的 int→ushort 用在一個本來就塞得進 L1d 的陣列上,結果是純雜訊 —— 沒有門檻可跨了。一定要量;「比較小」不等於「比較快」。

4. Technique — pick the narrowest element type4. 技巧 —— 選最窄的元素型別

A scratch array of int when the values are only 0/1, or only ever node ids below 65,535, is 2–4× bigger than it needs to be — and that's 2–4× more cache lines and L1d pressure. We size every hot array to its actual value range:

當值只有 0/1、或只會是小於 65,535 的 node id,卻用 int 開 scratch 陣列,就比需要的大 2–4 倍 —— 等於多 2–4 倍的 cache line 與 L1d 壓力。我們把每個熱陣列都壓到它真正的值域:

public static ushort* TransistorList;  // node ids < 65K -> u16 not int: 697 KB -> 350 KB
public static byte*   RecalcHash;      // dirty flag, 0/1 only -> byte not int: 58 KB -> 14 KB (now fits L1d)
private static ushort* _groupBuf;      // node ids in current group -> u16: 29 KB vs 58 KB
private static byte*   _inGroup;        // per-node "in group?" flag, 0/1 -> byte: 14 KB

The two byte arrays (RecalcHash, _inGroup) are touched constantly during settling; getting them down to 14 KB each means they live resident in L1d right alongside NodeStates — the difference between a 4-cycle hit and a 40-cycle L3 trip, every access.

那兩個 byte 陣列(RecalcHash_inGroup)在收斂時被狂碰;壓到各 14 KB 表示它們能和 NodeStates 一起常駐 L1d —— 每次存取就是「4 cycle 命中」與「40 cycle 跑 L3」的差別。

Caveat that bit us: a narrow type can be slower if every access needs a widening conversion in a tight loop and the array was already cache-resident. We tried u16 for the group buffer on the Rust twin — the per-access as u16 / as i32 conversions cost more than the (already-zero) cache benefit: −2.4%. Narrow the big spilling arrays, not the tiny resident ones.

咬過我們的陷阱:窄型別若在緊迴圈裡每次存取都要做擴展轉型、而且陣列本來就常駐快取,反而會更慢。我們在 Rust 雙生引擎上把 group buffer 改 u16 —— 每次 as u16 / as i32 轉型的成本蓋過了(本來就是零的)快取好處:−2.4%。要縮的是「會溢出的大陣列」,不是「已常駐的小陣列」。

5. Technique — flatten the graph + inline the common case5. 技巧 —— 把圖攤平 + 把常見情況內聯

A netlist is a graph. The naive representation — each node holding a List<Transistor> of heap objects — is a pointer-chasing nightmare: every neighbour is a random heap address, every hop a fresh cache miss, and the prefetcher is blind. Instead we flatten the whole adjacency into one big array of 0-terminated sub-lists:

netlist 是一張圖。最天真的表示法 —— 每個節點持有一個 List<Transistor> 的堆物件 —— 是指標追逐的惡夢:每個鄰居都是隨機堆位址、每跳一次新的 cache miss、預取器完全瞎掉。我們改成把整個鄰接攤平成一個大陣列,內含以 0 結尾的子清單:

// One big array; a node's neighbours are CONTIGUOUS -> the prefetcher streams them.
var tl = new List<int> { 0 };   // index 0 = shared sentinel/terminator
int AddSubList(List<int> sub) { int i = tl.Count; tl.AddRange(sub); tl.Add(0); return i; }
// A node just stores an int index into `tl`; walking its neighbours is a linear scan.

Walking a node's channels is now a sequential scan of adjacent ushorts — exactly what hardware prefetchers love. And we go one step further: for the common small-fanout node, the gates are stored inline in the 16-byte record itself (the union in §3). That means the single cache line you already fetched to read the node's flags also contains its neighbours — zero extra misses for ~96% of nodes.

走一個節點的通道,現在是對相鄰 ushort循序掃描 —— 正是硬體預取器最愛的。而且我們再進一步:對常見的小扇出節點,gate 直接內聯存在那個 16-byte 記錄裡(§3 的 union)。意思是「你為了讀節點 flags 而已經抓進來的那一條 cache line」同時就含著它的鄰居 —— 約 96% 的節點完全不用多一次 miss。

6. Technique — unmanaged memory + wide loads6. 技巧 —— 非託管記憶體 + 寬載入

All the hot arrays are unmanaged (NativeMemory.AlignedAlloc, raw byte*/ushort*/int*). Two wins: (a) zero GC — no managed object headers between our data, no gen-2 scans walking it; (b) raw-pointer access skips array bounds-checks in the hot loop. The data is exactly the bytes we want, nothing else in the lines.

所有熱陣列都是非託管(NativeMemory.AlignedAlloc、裸 byte*/ushort*/int*)。兩個好處:(a) 零 GC —— 資料之間沒有託管物件標頭、沒有 gen-2 掃描來爬它;(b) 裸指標存取省掉熱迴圈的陣列邊界檢查。line 裡就是我們要的那些 bytes,沒有別的。

On top of that, when the data is contiguous, read it wide. The fanout walk reads two (gate,other) pairs — four ushorts — in a single 64-bit load instead of four separate ones:

在這之上,當資料確實連續時,就寬讀。扇出走訪一次讀兩組 (gate,other) —— 四個 ushort —— 用單一 64-bit 載入,而不是分四次:

// Read 4 ushorts in ONE 64-bit load (x64 little-endian). +1.2%, bit-exact.
ulong quad = Unsafe.ReadUnaligned<ulong>(p);
int c1a = (ushort)quad;          // pair 1 lo
int c2a = (ushort)(quad >> 16);   // pair 1 hi
int c1b = (ushort)(quad >> 32);   // pair 2 lo
int c2b = (ushort)(quad >> 48);   // pair 2 hi

This one is subtle: it helps the sequential enqueue walk (whose loop/load overhead is not hidden under the random-gather stalls), even though the engine overall is latency-bound. Knowing which part of a loop is latency-bound vs overhead-bound is the whole skill.

這招很微妙:它幫到的是循序的入列走訪(它的迴圈/載入開銷沒有被隨機 gather 的 stall 藏住),即使整個引擎是 latency-bound。能分辨「迴圈的哪一段是 latency-bound、哪一段是 overhead-bound」,就是全部的功力所在。

The same wide load was later applied to the group-walk's high-fanout channel list (the long, contiguous walks) for another +1.4%. Interesting twist: it ported to the Rust twin bigger than C# — +2.35%. Rust has no inline-payload split, so every group walk uses the flat transistor list and the wide load applies broadly; on C# only the ~4% high-fanout overflow case reaches the flat list (the common case is already inlined into the node record, §5). The technique is universal; how much it buys depends on how much of the traffic flows through the contiguous path.

同樣的寬讀取後來也用到 group-walk 的高 fanout channel 清單(那些長且連續的走訪)上,再賺 +1.4%。有趣的轉折:它移植到 Rust 雙生引擎比 C# 更大 —— +2.35%。Rust 沒有 inline-payload 拆分,所以每次 group walk 都走平坦電晶體清單,寬讀取適用範圍更廣;在 C# 上只有 ~4% 的高 fanout overflow 情況會走到平坦清單(常見情況已內聯進節點記錄,§5)。技巧是通用的;賺多少取決於有多少流量走連續路徑。

7. Technique — keep loop-invariants in registers7. 技巧 —— 把迴圈不變量留在暫存器

Once the data is cache-resident, the next cost is purely instruction-level — and the biggest single win of the latest batch (+3.2%) wasn't a layout change at all. The group-BFS reads several static base pointers every iteration (NodeStates, TransistorList, the group buffer, the node array…). A static field is a memory location: the JIT must re-load it each time it can't prove it unchanged, and a function call in the loop body can force exactly that. Hoist them into locals at the top of the method and the JIT keeps them in registers across the whole walk:

資料常駐 cache 之後,下一個成本純粹在指令層 —— 而最新這批最大的單一勝利(+3.2%)根本不是佈局改動。group-BFS 每一輪都讀好幾個靜態基底指標(NodeStatesTransistorList、group buffer、節點陣列…)。靜態欄位就是一個記憶體位置:JIT 只要無法證明它沒變,每次就得重讀,而迴圈體裡的函式呼叫正好會逼出這種重讀。把它們在方法開頭提升成區域變數,JIT 整趟走訪就能把它們放在暫存器:

// Before: every iteration re-reads the static fields from memory.
while (readIndex < _groupCount) {
    int nn = _groupBuf[readIndex++];
    if (NodeStates[...] != 0) { ...; AddNode(o); }   // call can clobber the JIT's assumptions
}

// After: hoist statics to locals + manual-inline the add. +3.2%, bit-exact.
byte* nodeStates = NodeStates;  ushort* groupBuf = _groupBuf;  ushort* transList = TransistorList;
int gc = _groupCount;  NodeFlags gf = _groupFlags;     // kept in registers the whole walk
while (readIndex < gc) {
    int nn = groupBuf[readIndex++];
    if (nodeStates[...] != 0) { /* dedup + push + flags-OR, inlined — no call */ }
}
_groupCount = gc;  _groupFlags = gf;                     // write back once at the end

A companion trick in the O(1) fast-path resolver: keep the flag accumulation in integer registers instead of an enum / byte. The original cast to a flags enum and accumulated into a byte — both force the JIT through narrowing/widening it can't always fold. Reading the flags as int and OR-ing shifted 0/1 lane results (flags |= anyGnd << 5) keeps everything in a register and indexes the 256-entry LUT directly — +1.2%, bit-exact.

O(1) fast-path 解析器裡有個搭配技巧:把旗標累加留在整數暫存器,而不是 enum / byte。原本轉成 flags enum、再累加進 byte —— 兩者都逼 JIT 走它不一定折得掉的窄化/寬化。把旗標當 int 讀、OR 進位移後的 0/1 結果(flags |= anyGnd << 5),全程留在暫存器、直接索引 256 格 LUT —— +1.2%,位元完全相同。

Why this is C#-shaped. The hoist is structural to an iterative BFS — there's one stack frame whose locals live across the whole walk. The Rust twin's group walk is recursive, and you can't hoist an invariant across a recursive call boundary (an iterative rewrite there measured −1.3%), so this win doesn't port. The int-register fast-path is a no-op on Rust, which already used u8 flags. Same lesson as everywhere on this project: a real win on one compiler can be nothing — or negative — on the other; measure each, never sync blindly.

為什麼這是「C# 形狀」的。提升是迭代式 BFS 的結構特性 —— 只有一個堆疊框,它的區域變數活過整趟走訪。Rust 雙生引擎的 group walk 是遞迴的,你沒辦法把不變量提升越過遞迴呼叫邊界(在那邊改寫成迭代量到 −1.3%),所以這個勝利無法移植。整數暫存器 fast-path 在 Rust 是 no-op,它本來就用 u8 旗標。和這個專案到處都一樣的教訓:在一個編譯器是真勝利,在另一個可能是零、甚至是負;各自實測,絕不盲目同步。

8. The other half of the skill — measure, and respect the wall8. 功力的另一半 —— 量測,並尊重那道牆

Memory optimization is dense with plausible ideas that are actually noise or worse. The only way to tell is to measure — interleaved-paired (alternate the two builds so thermal drift cancels), bit-exact (same whole-NES checksum), across multiple batches (a single batch can fluke). Things that sounded great and failed on this engine:

記憶體優化充滿「聽起來很合理、實際是雜訊或更糟」的點子。唯一的判別方法就是量 —— interleaved-paired(交錯兩個 build 讓溫漂抵消)、bit-exact(整機 checksum 相同)、跨多批(單批會僥倖)。在這顆引擎上聽起來很棒卻失敗的:

The deepest lesson — latency-bound ≠ throughput-bound. The 16-byte pack cut D-cache misses ~7%, but throughput only rose +0.75%. Why? The misses sit on a serialized dependency chain (read record → get neighbour ids → read their state → decide next). Time is set by the chain's latency, not the miss count. You can cut the count and barely move the clock. That's the floor: random pointer-chase over a multi-hundred-KB working set, bound by DRAM latency the silicon's memory parallelism is already maxing out. No flag, no codegen, no prefetch beats it — only changing the access pattern or the hardware does.

最深的一課 —— latency-bound ≠ throughput-bound。16-byte 打包把 D-cache miss 砍了 ~7%,但吞吐量只漲 +0.75%。為什麼?那些 miss 落在一條序列化的相依鏈上(讀記錄 → 拿到鄰居 id → 讀它們的狀態 → 決定下一步)。時間由「鏈的延遲」決定,不是「miss 數量」。你把數量砍掉,時鐘幾乎不動。這就是地板:對幾百 KB 工作集的隨機指標追逐,被 DRAM 延遲綁住,而矽晶片的記憶體平行度早已用滿。沒有任何旗標、codegen、prefetch 贏得了它 —— 只有改存取樣式或換硬體才行。

Takeaways重點整理

Next: why IR / codegen still hit the wall (the full PMU data) →下一篇:為什麼 IR / codegen 仍撞牆(完整 PMU 數據) →
Study: when does abstraction beat switch-level? →研究:抽象化何時贏過開關級? →