Qianliang's blog

[Paper Notes] Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization

NSDI'18, by Google

TL;DR: In Andromeda, each NIC queue has one busy-polling userspace thread. For every packet it does one hash lookup, grabs a pre-computed action, edits headers, and transmitsβ€”no locks, syscalls, or kernel pathβ€”so common VM-to-VM traffic crosses the host in ~0.3 Β΅s, while heavyweight features are off-loaded to helper threads (coprocessor).


1 Problem / Motivation


2 Key Idea

β€œCompile everything once, run packets in one tight loop.”

  1. Decouple control vs. data
    • Control plane (Fabric-Manager, VMCs, vswitchd) computes per-flow policy once on a miss.
    • Dataplane (Fast Path engine) consists of a set of packet processing push/pull elements (just think of a unit of task, e.g. routing, fan-in/out, flow monitoring/debugging).
  2. OS-bypass, busy-polling userspace dataplane (Andromeda 2.0)
    • Maps all guest pages, copies directly between guest rings and NIC.
    • One core = one RX/TX queue, zero locks, zero syscalls in steady state.
  3. Flow Table (FT) + Action Table
    • Single hash on 3-tuple (falls back to 5-tuple only when needed) β†’ dense flow-index.
    • Action entry holds: header-rewrite template, dst vPort (quick forwarding decision), bitmap of micro-stages (additional processing like coprocessor).
  4. Stage Bitmaps
    • Fast-Path stages (<50 cycles): encap, VLAN, tiny firewall port-mask, conntrack bump.
    • Coprocessor stages: heavy ACLs, IPSec, shapingβ€”handled by helper thread/SmartNIC.

3 Control Plane Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ☁ Tenant API  (UI / gcloud / Terraform)                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ 1  declarative intent: β€œCreate VPC X, subnet Y, firewall Z”
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  Global fabric manager  ────────────────────────────────┐
β”‚ **Fabric-Manager (FM)**                                                     β”‚
β”‚ β€’ Stores tenant/network objects in a Paxos-replicated datastore             β”‚
β”‚ β€’ Runs the *compiler* that turns high-level objects into **abstract port &  β”‚
β”‚   flow specs** (no physical locations yet)                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ 2  pushes abstract specs + host/VM inventory
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Cluster-scope control layer (tens of servers) ──────────────────┐
β”‚ **VMCs – β€œVirtual Machine Controllers”**                                    β”‚
β”‚ β€’ Cluster is sharded by VM-UUID hash; each shard has a *primary* VMC        β”‚
β”‚ β€’ On startup subscribes to β€œits” slice of VM hosts via FM                   β”‚
β”‚ β€’ When a host switch connects, VMC reads its *actual* OpenFlow tables,      β”‚
β”‚   diffs against desired state, then issues incremental updates              β”‚
β”‚ β€’ All control logic (LB backend pick, NAT port allocation, hairpin rule     β”‚
β”‚   creation on live-migration) happens here **once per flow miss**           β”‚
β”‚ β€’ Replicated with leader election; shard can migrate without host impact    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ 3  RPC (gRPC) carrying high-level OpenFlow commands
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Thin stateless proxy layer – β€œalways up” ──────────────────────┐
β”‚ **OpenFlow Front Ends (OFEs)**                                              β”‚
β”‚ β€’ One per VM host (usually on an out-of-band mgmt VM)                       β”‚
β”‚ β€’ Terminates TLS from the switch, multiplexes to the correct VMC            β”‚
β”‚ β€’ Keeps a persistent TCP channel so the data-plane never notices            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ 4  OpenFlow messages (add-flow / modify-flow / pkt-in …)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Host-local control agents ────────────────────────────────────┐
β”‚ **vswitchd** (in VMM)                                                       β”‚
β”‚ β€’ Owns the **Flow Table (FT) & Action Table**                               β”‚
β”‚ β€’ Installs results computed by VMC                                          β”‚
β”‚ β€’ Handles the very first packet of a new 5-tuple (flow miss)                β”‚
β”‚ **Flow-Miss Coprocessor**                                                   β”‚
β”‚ β€’ Shuttles that miss up to vswitchd without blocking the Fast Path          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ 5  shared-memory ring to…
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  Userspace Fast-Path engine  ───────────────────────────────┐
β”‚ Busy-polling, lock-free dataplane (∼300 ns / pkt)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Layer Comparison:

Layer Scope What state it keeps Why it exists
FM whole planet tenant objects, ACLs, routes, VM inventory Authoritative source of truth; slow-changing, strongly consistent
VMC single cluster (β‰ˆ10 k hosts) desired flow entries for its shard, plus lightweight conn-track Absorbs churn (VM boots, migrations) and compiles intent into concrete OpenFlow
OFE single host almost none (just in-flight msgs) Provides a stable TCP endpoint so switches survive VMC restarts/re-shards
vswitchd per host hot FT / actions cache, minified firewall masks Keeps control logic off the dataplane core; still far lighter than FM/VMC
Fast Path per core nothing persistent Raw performance; executes pre-baked β€œrecipes” with no locks/syscalls

4 Data Plane Architecture

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  TCP/IP guests  ────────────┐
   VM memory    β”‚ virtio-net rings (shared pages)         β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ kick / MSI-X
          only I/O           β–Ό
  β”Œ VMM (qemu-kvm helper) ──────────────┐
  β”‚ β€’ Owns virtio rings & injects ints  β”‚
  β”‚ β€’ Very thin after A-2.0             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  SPSC shared-mem rings
               β”‚ pkt-desc (+ tiny hdr)    
               β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   one per RX/TX queue pair, cpu-pinned
  β”‚  **Fast-Path engine**   β”‚   (DPDK-style busy poll)
  β”‚  β€’ hugepage mempool     β”‚
  β”‚  β€’ parses L2/L3 once    β”‚
  β”‚  β€’ FT hash β†’ flowIndex  β”‚
  β”‚  β€’ Action[flowIndex]:   β”‚β†’ optional enqueue to
  β”‚         βˆ’ header tmpl   β”‚  Coprocessor ring(s)
  β”‚         βˆ’ dst vPort     β”‚
  β”‚         βˆ’ fastStageBits β”‚
  β”‚         βˆ’ coproBits     β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚           β”‚
         β”‚           └────────── missed? β†’ Flow-Miss Coprocessor
         β”‚                             (shared queue, non-blocking)
         β”‚                                     β”‚
         β–Ό                                     β–Ό
   NIC DMA                                 vswitchd (control daemon)

Data Structure

Object Stored in Lifetime Purpose
FT hash table hugepage RAM, per engine per-flow 3- or 5-tuple β†’ dense flowIndex
Action array same per-flow header template, dst port, fastStageBits, coproBits
Stage bitmaps per action const encode micro-ops (< 50 cycles) or Coprocessor hand-off
SPSC rings shared memory long-lived VMM ↔ FastPath, FastPath ↔ Copro

Packet Lifetime

  1. RX DMA lands in NIC queue β†’ FastPath core polls.
  2. Parses outer/inner hdrs, forms 3-tuple β†’ FT lookup.
    • miss? β†’ recompute 5-tuple β†’ retry.
    • double-miss? β†’ enqueue to Flow-Miss Coprocessor.
  3. Hit β†’ fetch Action:
    • apply header rewrite & encap template.
    • if fastStageBitsβ‰ 0 run inlined micro-functions (VLAN_PUSH, FW_PORTMASK, CONNTRACK_INC, …).
    • if coproBitsβ‰ 0 push pointer to Coprocessor ring.
  4. Transmit to dst vPort (host-local virtio queue) or physical NIC TX.
  5. For TX from VM the path is symmetric (virtio β†’ FastPath β†’ NIC).

Per-packet cost in common VM↔VM overlay case β‰ˆ 300 ns (one cacheline).


5 Middlebox Functions


6 Live-Migration Trick


7 Why It Matters