Switch-level simulation — a primer (AprVisual)

Where this sits: the levels of circuit simulation先定位:電路模擬的幾個層級

A chip can be simulated at many levels of abstraction, trading accuracy for speed:

一顆晶片可以在不同抽象層級上模擬,在「準確」與「快」之間取捨:

Level層級	Models…模擬的是…	Accuracy / speed準確 / 速度
Analog (SPICE)類比(SPICE)	voltages & currents, differential equations電壓電流、微分方程	most accurate, slowest最準、最慢
Switch-level開關級	transistors as on/off switches, nodes as 0/1電晶體當開/關開關、節點當 0/1	transistor-exact logic, slow電晶體精準邏輯、慢
Gate-level閘級	AND/OR/NOT gatesAND/OR/NOT 邏輯閘	abstract, fast抽象、快
RTL / behavioralRTL / 行為	registers, opcodes (a normal emulator)暫存器、opcode(一般模擬器)	most abstract, fastest最抽象、最快

A normal NES emulator works at the top (it executes opcodes). Switch-level is near the bottom: it doesn't know what an "instruction" is — it only knows transistors and wires, and lets the CPU's behavior emerge from the physics. That's why it's bit-for-bit faithful to the real silicon, and also why it's hundreds of times slower than real time.

一般 NES 模擬器在最上層(執行 opcode)。開關級在接近最底層:它根本不知道什麼是「指令」—— 它只知道電晶體與接線,讓 CPU 的行為從物理自然湧現。這就是為什麼它對真實矽晶逐位元忠實,也是為什麼它比實機慢數百倍。

The academic root學術源頭

The idea of treating a MOS circuit as a network of switches instead of solving its analog equations was formalized by Randal E. Bryant in the early 1980s — the switch-level model (MOSSIM, 1981; "A Switch-Level Model and Simulator for MOS Digital Systems", IEEE Transactions on Computers, 1984). The model's essence:

把 MOS 電路當成一張開關網路、而不是去解它的類比方程,這個想法由 Randal E. Bryant 在 1980 年代初形式化 —— 即開關級模型(MOSSIM, 1981;"A Switch-Level Model and Simulator for MOS Digital Systems", IEEE Transactions on Computers, 1984)。模型的精髓:

A transistor is a bidirectional switch: its gate terminal decides whether its two channel terminals are connected.
一顆電晶體是一個雙向開關:它的 gate 端決定它的兩個 channel 端是否相連。
A wire is a node with a logic state (0 / 1 / floating) and a strength (how hard it's driven — a fat bus vs a tiny gate).
一條線是一個節點,帶邏輯狀態(0 / 1 / floating)與強度(被驅動得多強 —— 粗匯流排 vs 小閘極)。
The circuit's value is found not by formulas but by relaxation: keep re-resolving the connected nodes until nothing changes (a steady state, or fixpoint).
電路的值不是用公式算出來,而是用鬆弛迭代(relaxation):反覆重新解析相連的節點,直到不再變化(穩態,或稱不動點 fixpoint)。

Switch-level sits in the sweet spot: it captures things gate-level models miss — pass transistors, bidirectional buses, charge sharing, ratioed (pull-up/pull-down) logic — without paying SPICE's differential-equation cost. Visual6502 is one concrete, beautifully small implementation of this tradition (it doesn't cite a specific paper, but the model is squarely switch-level).

開關級正好在甜蜜點:它抓得到閘級模型抓不到的東西 —— pass transistor、雙向匯流排、電荷分享、比例式(pull-up/pull-down)邏輯 —— 又不用付 SPICE 的微分方程代價。Visual6502 就是這個傳統的一個具體、極精簡的實作(它沒有明確引用某篇論文,但模型完全是開關級的)。

What is a netlist, concretely?netlist 具體是什麼?

A chip is a flat layout of doped silicon and metal. "Decapping" a 6502 and photographing the die lets you trace, by hand and tooling, every wire and every transistor. The result is a netlist — pure connectivity, no behavior. Visual6502 stores it as three lists:

一顆晶片是一片摻雜矽與金屬的平面佈局。把 6502「開蓋」拍下晶粒,就能(靠人工與工具)描出每一條線、每一顆電晶體。結果就是一張 netlist —— 純粹的連線關係,沒有行為。Visual6502 用三個列表存它:

nodenames / segdefs — the wires (nodes), and human names for the important ones.所有的線(節點),以及重要節點的人類命名。
transdefs — each transistor as (gate, c1, c2): which node is its gate, and which two nodes its channel connects.每顆電晶體 (gate, c1, c2):gate 接哪個節點、channel 連起哪兩個節點。

So the entire CPU reduces to: a set of nodes and a set of three-terminal transistors. Nothing else. The job of the simulator is to make this static connectivity come alive — to compute, cycle by cycle, what value every node settles to.

於是整顆 CPU 就化約成:一堆節點 + 一堆三端電晶體。沒有別的了。模擬器的工作,就是讓這張靜態連線活起來 —— 一個週期一個週期地算出每個節點最後穩定在什麼值。

The graph & topology view圖學與拓樸的看法

This is where graph theory makes everything click. Treat the netlist as a graph:

這裡用圖學來看,一切就通了。把 netlist 當成一張圖:

Vertices = nodes (wires). Edges = transistor channels — but a special kind: a channel edge between c1 and c2 exists only when its gate vertex is high. So the "live" graph changes shape every cycle as gates switch.
頂點 = 節點(線)。邊 = 電晶體的 channel —— 但是特殊的邊:c1↔c2 這條邊只有在它的 gate 頂點為高時才存在。所以這張「導通中」的圖每個週期都在變形,隨著 gate 開關。
A "group" = a connected component of that live subgraph. All nodes reachable through currently-on channels are electrically one wire and must share one value. Finding it is a classic graph traversal (BFS / DFS), or union-find.
一個「group」= 那張導通子圖的一個連通分量。所有透過目前導通 channel 可達的節點,在電氣上就是同一條線,必須共用一個值。找它就是經典的圖走訪(BFS / DFS),或併查集 union-find。
The circuit has feedback loops (latches, flip-flops = cross-coupled gates), so the dependency graph has cycles — it is not a DAG. You can't evaluate it in one topological pass; you must iterate to a fixpoint. Those feedback loops are strongly connected components (SCCs), and they're exactly why simulation has to "settle" over several passes.
電路有回授迴路(latch、flip-flop = 交叉耦合的閘),所以相依圖有環 —— 它不是 DAG。不能一次拓樸排序算完;必須迭代到不動點。那些回授迴路就是強連通分量(SCC),也正是模擬必須跑好幾輪才「收斂」的原因。
You don't re-evaluate the whole chip every step — only nodes whose inputs just changed. That set of pending work is a worklist (a queue); propagating changes through it is the textbook event-driven / worklist algorithm.
每一步不會重算整顆晶片 —— 只算輸入剛變動的節點。這份「待處理」的集合就是一個工作佇列(worklist / queue);讓變化沿著它傳播,就是教科書上的事件驅動 / worklist 演算法。

Breadth-first search (BFS), concretely廣度優先搜尋(BFS),具體來說

BFS is the textbook way to explore a graph outward in rings from a starting vertex: visit the start, then all its neighbours, then their neighbours, and so on. It uses a FIFO queue — enqueue the start; repeatedly dequeue a vertex and enqueue any neighbour you haven't seen yet; stop when the queue is empty. A "visited" mark on each vertex stops you from re-adding one. (Its sibling, DFS, dives deep down one path first using a stack or recursion — it visits the same set of vertices, just in a different order.)

BFS 是教科書上「從起點一圈一圈往外探索圖」的方法:先走起點,再走它所有鄰居,再走鄰居的鄰居,以此類推。它用一個 FIFO 佇列 —— 把起點入列;反覆取出一個頂點、把還沒看過的鄰居入列;佇列空了就停。每個頂點上的「已訪」標記防止重複加入。(它的兄弟 DFS 用堆疊或遞迴先一路往深處鑽 —— 走訪的頂點集合相同,只是順序不同。)

BFS shows up in two distinct places in this simulator, and it's worth not confusing them:

BFS 在這個模擬器裡出現在兩個不同的地方,別把它們搞混:

Collecting one group (inside step A): start from the node being evaluated, and BFS outward through conducting channels only. When the queue drains, you've visited exactly one connected component — the group. Visual6502 and MetalNES write this as recursion (i.e. DFS); AprVisual's C# engine rewrites it as an explicit iterative queue (BFS) so the JIT can inline the loop. Same set of nodes either way.
收集一個 group(在步驟 A 內):從被評估的節點出發,只沿著導通的 channel 做 BFS 往外擴。佇列抽乾時,你剛好走訪完一個連通分量 —— 就是那個 group。Visual6502 與 MetalNES 用遞迴寫(即 DFS);AprVisual 的 C# 引擎改寫成顯式 iterative 佇列(BFS),讓 JIT 能 inline 迴圈。兩種寫法走訪的節點集合相同。
The settle wavefront (steps C→D): the dirty-node worklist is itself a breadth-first wave across the whole chip — round 1 is the nodes touched by the clock edge, round 2 is their downstream neighbours, and so on until the wave dies out. The "current list / next list" double-buffer literally is the BFS levels.
settle 波前(步驟 C→D):髒節點 worklist 本身就是一道掃過整顆晶片的廣度優先波 —— 第 1 輪是被 clock 邊緣碰到的節點,第 2 輪是它們的下游鄰居,以此類推,直到波消散。那個「current list / next list」雙緩衝,字面上就是 BFS 的層級。

Why the visited mark is essential: the circuit graph has cycles (feedback). Without the visited set, BFS would re-enqueue the same nodes forever. The mark makes it terminate and run in O(V + E) — linear in the nodes and channels it touches. A single traversal is cheap; the simulator is slow only because it runs millions of these tiny traversals per frame.

為什麼「已訪」標記不可少:電路圖有環(回授)。沒有 visited 集合,BFS 會永遠重複把同樣的節點入列。標記讓它終止,並以 O(V + E) 執行 —— 與它碰到的節點數與 channel 數成線性。單次走訪很便宜;模擬之所以慢,只是因為一張畫面要跑數百萬次這種小走訪。

The computation, step by step運算流程,一步一步看

Your two-part intuition is exactly right — detect conduction, then queue the affected nodes and update state. Here is the full loop, with the resolve step in the middle:

你的兩段直覺完全正確 —— 偵測導通,然後把受影響的節點放進 queue 做狀態更新。以下是完整迴圈,中間多一個「解析」步驟:

ADetect conduction → collect the group偵測導通 → 收集 group

Start from a node that needs evaluating. Look at its channel transistors; a channel conducts iff its gate node is currently high. Walk outward only through conducting channels, gathering every reachable node. That set is the connected group (= a connected component of the live subgraph). Nodes connected only through an off transistor are not part of it.

從一個需要評估的節點出發。看它的 channel 電晶體;一條 channel 只有在它的 gate 節點目前為高時才導通。只沿著導通的 channel 往外走,收集所有可達的節點。這個集合就是連通 group(= 導通子圖的一個連通分量)。只透過關閉電晶體相連的節點不算在內。

BResolve the group's shared value解析 group 的共用值

The whole group is one electrical wire now, so it has one value. Resolve it by priority:

整個 group 現在是同一條電氣線,所以只有一個值。用優先序解析:

a conducting path to GND → the group is 0 (ground wins);
有導通路徑到 GND → group 為 0(接地最強);
else a path to VCC → 1;
否則有路徑到 VCC → 1;
else an external drive (set-high / set-low);
否則外部驅動(set-high / set-low);
else a pull-up / pull-down (a weak default);
否則 pull-up / pull-down(弱預設值);
else the group is floating → it holds its previous value via parasitic charge; the highest-capacitance member wins.
否則 group 是 floating → 靠寄生電荷維持先前的值;電容最大的成員勝出。

CWrite back & enqueue the neighbours寫回 & 把鄰居放入 queue

Write the resolved value to every node in the group. For any node whose value actually changed, the transistors it gates just flipped on or off — so the channel edges around them changed, and their neighbour nodes may now need re-evaluating. Push those neighbours into the worklist. (This is the "put the affected nodes into a queue" step.) Unchanged nodes propagate nothing.

把解析出的值寫進 group 裡每個節點。對任何真的變了值的節點,它所控制(gate)的電晶體剛剛開或關了 —— 所以它們周圍的 channel 邊變了,鄰居節點可能要重算。把那些鄰居推進 worklist。(這就是「把受影響節點放進 queue」那一步。)沒變的節點不會傳播任何東西。

DRepeat until quiescent (the fixpoint)重複到靜止(不動點)

Drain the queue, node by node, running A–C on each. Every change can trigger more changes; you loop until a whole pass produces no change — the network has settled. A loop limiter (~100 passes) guards against a non-converging (oscillating / metastable) region. One half clock cycle of the chip = toggle the clock node, then settle the entire network to its new fixpoint. Do that ~24 times and you've simulated one CPU cycle; ~715,000 times and you've simulated one video frame.

把佇列一個節點一個節點抽乾,每個都跑 A–C。每個變化可能引發更多變化;一直迴圈,直到某一整輪沒有任何節點改變 —— 網路收斂了。一個迴圈上限(約 100 輪)防止不收斂(振盪 / 介穩)的區域卡住。晶片的一個半時脈週期 = 把 clock 節點翻轉一次,然後把整張網路 settle 到新的不動點。做約 24 次就模擬了一個 CPU 週期;約 715,000 次就模擬了一張畫面。

A tiny worked example — one NMOS inverter一個小範例 —— 一個 NMOS 反相器

        VCC
         |
       [pull-up]        in ── gate of T
         |                     │
  out ───┴──────[ T ]─────── GND
                (channel: out ↔ GND, gated by "in")

If in = 1: T's gate is high → its channel conducts → group = {out, GND} → a path to GND exists → out resolves to 0.
If in = 0: T's gate is low → channel is off → group = {out} alone → no GND/VCC path, but the pull-up gives 1 → out resolves to 1.
That's an inverter — and notice we never wrote "NOT". It emerged from conduction + resolution. A whole 6502 is just millions of these interactions settling together.

若 in = 1:T 的 gate 為高 → channel 導通 → group = {out, GND} → 有路徑到 GND → out 解析為 0。
若 in = 0:T 的 gate 為低 → channel 關閉 → group = {out} 單獨 → 沒有 GND/VCC 路徑,但 pull-up 給 1 → out 解析為 1。
這就是一個反相器 —— 注意我們從沒寫過「NOT」。它是從導通 + 解析湧現出來的。整顆 6502 不過就是幾千個這種互動一起收斂。

Why this is the whole story為什麼這就是全貌

Everything else on this site is an engineering refinement of these four steps. The cost is brutal because step A–C runs for every changed node, every half cycle — the average group is tiny (~1.4 nodes), but there are millions of them per frame. So the entire optimization story is about doing A–C with less work, smaller data, and tighter code — not a cleverer data structure. (The single biggest lever is doing A–C fewer times: the event-count prunes prove that ~21% of those re-evaluations can't change anything, and skip them outright.)

本站其他一切,都是這四個步驟的工程精修。代價之所以殘酷,是因為 A–C 對每個變動節點、每個半週期都要跑一次 —— 平均一個 group 很小(~1.4 個節點),但一張畫面有數百萬個。所以整個優化故事,講的是「用更少工作、更小資料、更緊的程式碼去做 A–C」—— 而不是更聰明的資料結構。(最大的單一槓桿是「少做幾次 A–C」:事件數剪枝證明其中約 21% 的重算不可能改變任何東西,直接跳過。)

→ Next: Visual6502 → what MetalNES added → what AprVisual S1 added→ 接著看:Visual6502 → MetalNES 加了什麼 → AprVisual S1 加了什麼