AprVisual — a correct optimization that still lost: the dominant-driver bypass (P-5)

The idea想法

The engine's biggest open cost is re-evaluating a node after an input opens, only to find it unchanged because another driver still holds it (see the frontier). The idea: store, per node, the single transistor whose conduction currently determines its value — its DominantGate. Then on any turn-off of gate g, an endpoint c can be skipped if DominantGate[c] != 0 && DominantGate[c] != g: c is held by a different single driver, so g opening can't change it. One uint16 compare to delete a whole re-evaluation.

引擎最大的未解成本,是「某個輸入斷開後重算一個節點、卻發現它沒變(因為另一個驅動源還拉著它)」(見前緣一文)。想法:替每個節點存下「此刻決定它的值的那一顆電晶體」—— 它的 DominantGate。然後任何閘 g 關斷時,端點 c 若滿足 DominantGate[c] != 0 && DominantGate[c] != g 就能跳過:c 由另一個單一驅動源拉住,所以 g 開路改變不了它。一個 uint16 比對,刪掉一整次重算。

It was provably correct — and it fired它可證明正確 —— 而且真的會觸發

The soundness is subtle but holds. For a pull-down turn-off the condition can never be true — if g was conducting and a different single driver also held c, that's two conductors, so DominantGate would already be 0 (a contradiction). Where it does fire is a pass-transistor disconnect: a node with its own single supply driver, momentarily wired to a neighbour, when that wire opens — the node keeps its own driver's value, unchanged. Verified the hard way: the whole-NES checksum stayed 0x794A43ABDF169ADA (@300k) and 0x6D4CCBCE2E9CD599 (@1M), bit-for-bit. This was not an approximation — it was a real, sound prune.

健全性很微妙但成立。對下拉關斷,這個條件永遠不可能為真 —— 若 g 在導通、又有另一個單一驅動源也拉住 c,那就是兩個導通源,DominantGate 早就會是 0(矛盾)。它真正會觸發的是 pass-transistor 斷開:一個有自己單一供電驅動的節點,一時被接到鄰居,當那條線開路時 —— 節點保持自己驅動源的值、不變。用最硬的方式驗過:整機 checksum 維持 0x794A43ABDF169ADA(@30 萬)與 0x6D4CCBCE2E9CD599(@100 萬),位元完全相同。這不是近似 —— 是真實、健全的剪枝。

Then the measurement然後是量測

Interleaved-paired A/B, full_palette 300k, vs the P-4 baseline. The decisive trick was to measure the maintenance alone (keep the bookkeeping, disable the skip) and subtract:

interleaved-paired A/B,full_palette 30 萬,對 P-4 baseline。決定性的做法是單獨量「維護」(留簿記、關掉跳過)再相減:

Variant變體	vs baseline	means代表
Full P-5 (maintain + bypass)完整 P-5(維護 + 旁路)	−12.05%	= −(maintenance) + (bypass)
Maintenance only (bypass disabled)只維護(關掉旁路)	−15.41%	= −(maintenance)
⇒ the bypass itself⇒ 旁路本身	+3.4%	it works — it really prunes有效 —— 真的剪掉事件
⇒ the maintenance⇒ 維護	−15.4%	the killer — ~4.5× the benefit殺手 —— 約是收益的 4.5 倍

The bypass earns +3.4% of real pruning. But to know who the current dominant driver is, every node resolution must (a) turn the deliberately branchless driver scan (a prior measured win) into a branchy count-and-capture, and (b) write the result into a new ~29 KB DominantGate array — ~117M writes per frame, a fresh random-access stream on a memory-latency-bound loop. That maintenance is −15.4%, ~4.5× the +3.4% it enables. 0 of 16 paired rounds won.

旁路賺到 +3.4% 的真實剪枝。但要知道當下的 dominant driver 是誰,每次節點解析都得 (a) 把刻意 branchless 的驅動掃描(先前量到的勝利)改成 branchy 的 count-and-capture,並 (b) 把結果寫進一條新的 ~29 KB DominantGate array —— 每張畫面 ~117M 次寫入,在 memory-latency-bound 迴圈上多一條隨機存取 stream。那份維護是 −15.4%,約是它換來的 +3.4% 的 4.5 倍。16 輪配對 0 勝。

The lesson: static fact vs maintained state教訓:靜態事實 vs 要維護的狀態

Why P-1 → P-4 won and P-5 lost is the same one thing. The prunes that shipped guard their skip with a static fact computed once at load (a pull-up flag, a capacitance comparison) — free in the hot loop. P-5's skip needs a runtime fact that changes every cycle — "which driver holds this node right now" — and the only way to have it is to maintain it, which costs more than the skip saves. A correct optimization is necessary but not sufficient; it also has to be cheap to know it's safe to apply.

P-1 → P-4 為何贏、P-5 為何輸,是同一件事。上線的剪枝用「載入時算一次的靜態事實」(上拉旗標、電容比較)守住它的跳過 —— 熱迴圈裡免費。P-5 的跳過需要一個「每個 cycle 都在變的執行期事實」——「此刻是哪個驅動源拉住這個節點」—— 而要擁有它,只能去維護它,維護成本超過跳過省下的。一個優化「正確」是必要、但不充分;它還得「判斷它能安全套用」這件事本身夠便宜。

This is the third time the same wall appears, each from a different angle: a maintained "active driver count" (−6%), this maintained "dominant driver id" (−15.4% maintenance), and the irreducible structural waste (the frontier). All three say: knowing which driver currently holds a node is runtime state, and on this latency-bound engine, maintaining that state costs more than the work it would save. P-5's value is that it puts a precise, bit-exact, positive number (+3.4%) on the prize behind that wall — and an equally precise number (−15.4%) on the toll to reach it.

這是同一道牆第三次出現,每次角度不同:一個維護的「active driver 計數」(−6%)、這個維護的「dominant driver id」(維護 −15.4%)、以及不可削減的結構性浪費(前緣)。三者都在說:「某節點此刻被哪個驅動源拉住」是執行期狀態,而在這個 latency-bound 引擎上,維護那個狀態的成本超過它能省下的工。P-5 的價值在於:它替那道牆後的獎賞釘上了一個精確、bit-exact、正的數字(+3.4%),也替抵達它的過路費釘上了一個同樣精確的數字(−15.4%)。

Where the numbers come from數字的出處

Idea proposed in a deliberately adversarial consult with an external EDA model (it surfaces the most useful ideas only under pressure — the same consult that earlier produced the capacitance condition behind P-3). The +3.4% / −15.4% split is interleaved-paired (alternating builds, median of 12–16 rounds, checksum-guarded every round) on an AMD Ryzen 7 3700X. The bypass was reverted; the engine stays at the P-4 baseline. Companion pages: the prunes that won (P-1 → P-4), the frontier (why the last 80% won't prune), why IR / codegen hit the wall.

想法來自一次刻意對抗式的外部 EDA 模型諮詢(它只在被逼問時才吐出最有用的點子 —— 就是先前產出 P-3 電容條件的同一次諮詢)。+3.4% / −15.4% 的拆分是 interleaved-paired(輪流換 build、12–16 輪取中位數、每輪 checksum 守門),在 AMD Ryzen 7 3700X 上量測。旁路已退回;引擎維持在 P-4 baseline。延伸頁面:真的贏的剪枝(P-1 → P-4)、前緣(為何最後 80% 剪不掉)、為什麼 IR / codegen 撞牆。

Update (2026-06-10): a from-scratch reconstruction — the prize was bigger, and it still lost更新(2026-06-10):從頭重建一次 —— 獎賞其實更大,但仍然輸

The original prototype above was never committed, so the technique was rebuilt from spec on a research branch and pushed to its limit. Three findings sharpen the story.

上面那個原型從沒進過版控,所以在一條研究分支上從規格重建、並推到極限。三個發現讓故事更清楚。

1. The prize is far bigger than +3.4%: +13.76%1. 獎賞遠大於 +3.4%:+13.76%

The original stored a 16-bit dominant-gate id and only skipped when a node was held by exactly one driver. A cheaper, sounder rule replaces it with a 1-bit "pinned by my own supply" flag: a node is pinned iff its current value is held by its own gnd/pwr/pull-up regardless of any pass connection ((state 0 & ≥1 own ON gnd) ∨ (state 1 & (≥1 own ON pwr ∨ pull-up))). That pins more nodes and, measured by disabling only the skip (maintenance kept), the bypass alone is worth +13.76% — it suppresses ~40% of all node re-evaluations. The +3.4% was an artefact of the over-restrictive "exactly one" form, not the ceiling.

原版存一個 16-bit 的支配閘 id,只在節點由恰好一個驅動源拉住時才跳。一個更便宜、也更健全的規則改用 1-bit「被自己的供電拉住」旗標:節點 pinned ⇔ 它當前的值是由自身的 gnd/pwr/上拉決定、與任何 pass 連接無關((state 0 且自身 ≥1 個 ON gnd)∨(state 1 且(自身 ≥1 個 ON pwr ∨ 上拉)))。這會 pin 更多節點;用「只關掉 skip、維護照留」量出來,旁路本身值 +13.76% —— 抑制了約 40% 的節點重算。+3.4% 是「恰好一個」那個過嚴版本的產物,不是天花板。

2. The full cost decomposition2. 完整成本拆解

Term項目	vs P-4	what it is是什麼
skip benefitskip 收益	+13.76%	re-evaluations the bypass deletes旁路刪掉的重算
maintenance維護	−15.6%	computing + writing the pinned flag every resolution (the killer)每次解析算+寫 pinned(殺手)
storage tax儲存稅	−4.98% or或	where the bit lives — see below這個 bit 放哪 —— 見下

3. There is no free place to keep the bit (the i-vs-ii trade)3. 這個 bit 沒有免費的家(i 對 ii 的取捨)

Option (i) packs the pinned bit into the (1-byte) NodeStates array so its write rides the mandatory state store (free) — but then every hot conduction read must mask & StateBit to ignore it, and that mask, on the engine's hottest random gather, costs a measured −4.98% floor. Option (ii) keeps the bit in its own 1-byte array so NodeStates stays pure and no mask is needed — but now the flag is its own ~50M-writes/run random stream. Net: (i) = −8.36%, (ii) = −6.84% (the best result). They are within a few points of each other because you are trading a read-side mask tax for a write-side stream — the per-resolution runtime fact has to be paid for one way or the other.

選項 (i) 把 pinned bit 塞進(1-byte 的)NodeStates,寫入就搭上那筆必做的 state store(免費)—— 但接著每一次熱導通讀取都得遮 & StateBit 才能忽略它,而這個遮罩在引擎最熱的隨機 gather 上量到 −4.98% 的地板。選項 (ii) 把 bit 放自己的 1-byte 陣列,NodeStates 保持純淨、不需遮罩 —— 但這旗標就成了一條每次約 5000 萬筆寫入的隨機 stream。淨值:(i) = −8.36%,(ii) = −6.84%(最佳)。兩者只差幾個百分點,因為你只是把「讀側遮罩稅」換成「寫側 stream」—— 那個「每次解析的執行期事實」總得有人付帳。

A last attempt fused the maintenance into the group walk (which already scans each member's supply for the group flag, so the per-member bits look free) — but it was neutral: the walk visits far more members than the small candidate set that benefits, so capturing for all of them costs as much as the targeted re-scan it replaced.

最後一招把維護融進 group walk(它本來就為了 group flag 逐成員掃供電,所以逐成員的 bit 看似免費)—— 但結果中性:walk 走訪的成員遠多於受益的那一小撮候選,替全部成員 capture 的成本 ≈ 它取代掉的針對性重掃。

The fuller verdict. The bypass is a genuine, large win (+13.76%) — much bigger than the original number suggested. It still loses, best case −6.84%, because keeping the one runtime fact it needs — "is this node held by its own driver right now" — costs a ~7% floor (maintenance + storage) on a memory-latency-bound engine, however cheaply you encode it (16-bit id, 1 bit packed, or 1 bit in its own array; scan separate or fused). It is the same wall as the frontier and the active-driver-count attempt, now mapped with a precise prize (+13.76%) and a precise, irreducible toll (≈−20% combined). Correct, and the prize is real — but a maintained per-resolution fact can't be cheaper than the cheap work it deletes here.

更完整的判決。這個旁路是真實、且很大的勝利(+13.76%)—— 遠比原本的數字大。它仍然輸,最佳 −6.84%,因為維護它需要的那個執行期事實 ——「此刻這個節點是不是被自己的驅動源拉住」—— 在 memory-latency-bound 引擎上構成 ~7% 的地板(維護 + 儲存),無論你編碼得多便宜(16-bit id、塞 1 bit、或獨立 1 byte;掃描分開或融合)。這跟前緣、跟 active-driver 計數那次是同一道牆,只是現在連精確的獎賞(+13.76%)和精確、不可削減的過路費(合計約 −20%)都標出來了。正確、獎賞也真實 —— 但在這裡,一個「每次解析都要維護的事實」不可能比它刪掉的那些便宜工還便宜。

Reconstruction + measurements: AMD Ryzen 7 3700X, full_palette 300k/1M, interleaved-paired, every variant bit-exact (golden 0x794A43ABDF169ADA / 0x6D4CCBCE2E9CD599). Code lives on the dominant-bypass branch as a complete negative-result record.

重建 + 量測:AMD Ryzen 7 3700X、full_palette 30 萬/100 萬、interleaved-paired,每個變體都 bit-exact(golden 0x794A43ABDF169ADA / 0x6D4CCBCE2E9CD599)。程式碼留在 dominant-bypass 分支,作為完整的負結果紀錄。

Update (2026-06-11): the maintenance was deleted entirely — and it was still a tie更新(2026-06-11):維護被整個刪掉了 —— 結果仍然是平手

After the range-prune/renumber breakthroughs, this dead end got one more adversarial pass (four independent review agents + a zero-cost event-split experiment). The insight: part of the pinned predicate decomposes into static structure × live NodeStates — and NodeStates is maintained by the engine anyway. For the class "exactly one own gnd channel, clean component" (6,338 nodes), the skip can be derived at the skip site: skip a turn-off endpoint c iff NodeStates[ProbeGate[c]] != 0 — that gnd channel's gate is ON, Gnd outranks everything, removal can't change it, transients included. A read-only table, zero maintenance, zero writes, zero new branches; the ~7% floor is simply gone. It is bit-exact through every gate we have (Debug + 300k/400k/1M goldens, selftest, and a 10M-half-cycle SMB1 run — sprite-heavy, exercising the FC/OAM region full_palette never touches), and it suppresses 5.3M re-evaluations per 100k half-cycles (13.6% of pops).

在範圍剪枝/重編號的突破之後,這條死路又被對抗式地審了一輪(四個獨立評審代理 + 一個零成本的事件拆分實驗)。洞見:pinned 謂詞的一部分可分解為「靜態結構 × 即時 NodeStates」—— 而 NodeStates 引擎本來就在維護。對「恰好一個自有 gnd 通道、乾淨元件」這一類(6,338 個節點),skip 可以在現場推導:關斷端點 c 可跳 ⇔ NodeStates[ProbeGate[c]] != 0 —— 那個 gnd 通道的閘極正導通,Gnd 優先權最高,拆路改變不了它,連暫態都成立。一張唯讀表、零維護、零寫入、零新分支;~7% 的地板直接消失。它通過我們所有的門(Debug + 30萬/40萬/100萬 golden、selftest、再加 1000 萬半週期的 SMB1 —— 精靈密集,會運動到 full_palette 從不碰的 FC/OAM 區),每 10 萬半週期壓掉 530 萬次重算(13.6% of pops)。

maintenance cost (the old killer)維護成本(舊殺手)

5.3M

re-evals suppressed / 100k hc, bit-exact每 10 萬 hc 壓掉的重算,bit-exact

+0.16%

median, 10/20 paired — a TIE中位數,10/20 配對 —— 平手

And it measured neutral: +0.16% median / −0.17% trimmed, 10/20 paired (a heavier range-encoded variant measured −0.51%). The arithmetic is exact and merciless: the turn-off walk performs 9.4 endpoint tests per fired skip (50.1M tests for 5.3M skips), and the pops this class avoids are the engine's cheapest (single-gnd leaf resolutions, ~30 cycles) — so break-even demands the test cost under ~3 cycles, which is precisely what two branch-fed loads cost. Meanwhile the bisect produced the sharper half of the verdict: the valuable leg — "PullUp pins the node high" (64% of the old +13.76% mass) — diverged the checksum when derived from live state, because its proof binds to the value at the node's last resolution; mid-settle transients break it. The maintained pinned bit was never an implementation detail — it is that snapshot. The wall, finally measured to the centimetre: the profitable legs require maintenance (−7% floor); the maintenance-free leg is profit-zero.

而它量出來是中性:中位數 +0.16% / 修剪平均 −0.17%、配對 10/20(較重的範圍編碼變體 −0.51%)。算式精確而無情:關斷走訪每換到 1 次跳過要做 9.4 次端點測試(50.1M 次測試換 5.3M 次跳過),而這一類避開的 pop 是引擎裡最便宜的(單 gnd 葉節點解析,~30 cycles)—— 損益兩平要求每次測試低於 ~3 cycles,恰好就是兩個 branch-fed 載入的價錢。同時,二分法給出了判決中更鋒利的那一半:值錢的那條腿 ——「PullUp 把節點撐在高位」(佔舊 +13.76% 的 64%)—— 用即時狀態推導會讓 checksum 發散,因為它的證明綁定在節點上次解析時的值;settle 中途的暫態會打破它。維護的 pinned 位元從來不是實作細節 —— 它就是那個快照。這道牆終於量到公分級:有利潤的腿需要維護(−7% 地板);免維護的腿利潤為零。

Bonus finding from the adversarial review: the 2026-06-10 branch carried two latent soundness holes its goldens never exercised (ForceCompute-component mates break the pinned-high proof — the FC set is the 2C02 sprite/OAM bus, quiet in full_palette; and a within-member-loop staleness window). Its clean checksums were survivorship, not proof — which is why SMB1 joined the gate set. Code: branch dominant-bypass-2; event split: the [E0] instrument on branch dominant-bypass. 2026-06-11.

對抗式評審的額外收穫:2026-06-10 的分支帶著兩個 golden 從未運動到的潛伏健全性破洞(ForceCompute 元件隊友會打破 pinned-high 證明 —— FC 集合就是 2C02 的精靈/OAM 匯流排,在 full_palette 裡安靜無事;以及成員迴圈內的過期窗)。它乾淨的 checksum 是倖存者偏差、不是證明 —— 所以 SMB1 從此加入門檻組。程式碼:分支 dominant-bypass-2;事件拆分:分支 dominant-bypass 上的 [E0] 儀器。2026-06-11。