[Paper Notes] Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
NSDI'18, by Google
TL;DR: In Andromeda, each NIC queue has one busy-polling userspace thread. For every packet it does one hash lookup, grabs a pre-computed action, edits headers, and transmitsβno locks, syscalls, or kernel pathβso common VM-to-VM traffic crosses the host in ~0.3 Β΅s, while heavyweight features are off-loaded to helper threads (coprocessor).
1 Problem / Motivation
- Cloud VMs demand hardware-like bandwidth (<100 Β΅s p99) yet must share hosts with thousands of other tenants.
- 2012-era stack (UDP tap inside the VMM β Linux kernel) hit CPU, latency and feature ceilings at ~2 Gb/s per core and offered no in-cloud firewalls, LB, or NAT.
- Need a design that gives performance, rich middlebox functions, isolation and easy global live-migrationβall while being deployable on commodity servers/NICs.
2 Key Idea
βCompile everything once, run packets in one tight loop.β
- Decouple control vs. data
- Control plane (Fabric-Manager, VMCs, vswitchd) computes per-flow policy once on a miss.
- Dataplane (Fast Path engine) consists of a set of packet processing push/pull elements (just think of a unit of task, e.g. routing, fan-in/out, flow monitoring/debugging).
- OS-bypass, busy-polling userspace dataplane (Andromeda 2.0)
- Maps all guest pages, copies directly between guest rings and NIC.
- One core = one RX/TX queue, zero locks, zero syscalls in steady state.
- Flow Table (FT) + Action Table
- Single hash on 3-tuple (falls back to 5-tuple only when needed) β dense flow-index.
- Action entry holds: header-rewrite template, dst vPort (quick forwarding decision), bitmap of micro-stages (additional processing like coprocessor).
- Stage Bitmaps
- Fast-Path stages (<50 cycles): encap, VLAN, tiny firewall port-mask, conntrack bump.
- Coprocessor stages: heavy ACLs, IPSec, shapingβhandled by helper thread/SmartNIC.
3 Control Plane Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Tenant API (UI / gcloud / Terraform) β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1 declarative intent: βCreate VPC X, subnet Y, firewall Zβ
βΌ
βββββββββββββββββββββ Global fabric manager βββββββββββββββββββββββββββββββββ
β **Fabric-Manager (FM)** β
β β’ Stores tenant/network objects in a Paxos-replicated datastore β
β β’ Runs the *compiler* that turns high-level objects into **abstract port & β
β flow specs** (no physical locations yet) β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2 pushes abstract specs + host/VM inventory
βΌ
βββββββββββββ Cluster-scope control layer (tens of servers) βββββββββββββββββββ
β **VMCs β βVirtual Machine Controllersβ** β
β β’ Cluster is sharded by VM-UUID hash; each shard has a *primary* VMC β
β β’ On startup subscribes to βitsβ slice of VM hosts via FM β
β β’ When a host switch connects, VMC reads its *actual* OpenFlow tables, β
β diffs against desired state, then issues incremental updates β
β β’ All control logic (LB backend pick, NAT port allocation, hairpin rule β
β creation on live-migration) happens here **once per flow miss** β
β β’ Replicated with leader election; shard can migrate without host impact β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3 RPC (gRPC) carrying high-level OpenFlow commands
βΌ
ββββββββββββββ Thin stateless proxy layer β βalways upβ βββββββββββββββββββββββ
β **OpenFlow Front Ends (OFEs)** β
β β’ One per VM host (usually on an out-of-band mgmt VM) β
β β’ Terminates TLS from the switch, multiplexes to the correct VMC β
β β’ Keeps a persistent TCP channel so the data-plane never notices β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4 OpenFlow messages (add-flow / modify-flow / pkt-in β¦)
βΌ
βββββββββββββββ Host-local control agents βββββββββββββββββββββββββββββββββββββ
β **vswitchd** (in VMM) β
β β’ Owns the **Flow Table (FT) & Action Table** β
β β’ Installs results computed by VMC β
β β’ Handles the very first packet of a new 5-tuple (flow miss) β
β **Flow-Miss Coprocessor** β
β β’ Shuttles that miss up to vswitchd without blocking the Fast Path β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5 shared-memory ring toβ¦
βΌ
βββββββββββββββββ Userspace Fast-Path engine ββββββββββββββββββββββββββββββββ
β Busy-polling, lock-free dataplane (βΌ300 ns / pkt) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Layer Comparison:
Layer | Scope | What state it keeps | Why it exists |
---|---|---|---|
FM | whole planet | tenant objects, ACLs, routes, VM inventory | Authoritative source of truth; slow-changing, strongly consistent |
VMC | single cluster (β10 k hosts) | desired flow entries for its shard, plus lightweight conn-track | Absorbs churn (VM boots, migrations) and compiles intent into concrete OpenFlow |
OFE | single host | almost none (just in-flight msgs) | Provides a stable TCP endpoint so switches survive VMC restarts/re-shards |
vswitchd | per host | hot FT / actions cache, minified firewall masks | Keeps control logic off the dataplane core; still far lighter than FM/VMC |
Fast Path | per core | nothing persistent | Raw performance; executes pre-baked βrecipesβ with no locks/syscalls |
4 Data Plane Architecture
βββββββββββββ TCP/IP guests βββββββββββββ
VM memory β virtio-net rings (shared pages) β
ββββββββββββββ¬βββββββββββββββββββββββββββββ
β kick / MSI-X
only I/O βΌ
β VMM (qemu-kvm helper) βββββββββββββββ
β β’ Owns virtio rings & injects ints β
β β’ Very thin after A-2.0 β
ββββββββββββββ¬βββββββββββββββββββββββββ SPSC shared-mem rings
β pkt-desc (+ tiny hdr)
βΌ
ββββββββββββββ΄βββββββββββββ one per RX/TX queue pair, cpu-pinned
β **Fast-Path engine** β (DPDK-style busy poll)
β β’ hugepage mempool β
β β’ parses L2/L3 once β
β β’ FT hash β flowIndex β
β β’ Action[flowIndex]: ββ optional enqueue to
β β header tmpl β Coprocessor ring(s)
β β dst vPort β
β β fastStageBits β
β β coproBits β
ββββββββ¬ββββββββββββ¬βββββββ
β β
β βββββββββββ missed? β Flow-Miss Coprocessor
β (shared queue, non-blocking)
β β
βΌ βΌ
NIC DMA vswitchd (control daemon)
Data Structure
Object | Stored in | Lifetime | Purpose |
---|---|---|---|
FT hash table | hugepage RAM, per engine | per-flow | 3- or 5-tuple β dense flowIndex |
Action array | same | per-flow | header template, dst port, fastStageBits, coproBits |
Stage bitmaps | per action | const | encode micro-ops (< 50 cycles) or Coprocessor hand-off |
SPSC rings | shared memory | long-lived | VMM β FastPath, FastPath β Copro |
Packet Lifetime
- RX DMA lands in NIC queue β FastPath core polls.
- Parses outer/inner hdrs, forms 3-tuple β FT lookup.
- miss? β recompute 5-tuple β retry.
- double-miss? β enqueue to Flow-Miss Coprocessor.
- Hit β fetch Action:
- apply header rewrite & encap template.
- if
fastStageBitsβ 0
run inlined micro-functions (VLAN_PUSH
,FW_PORTMASK
,CONNTRACK_INC
, β¦). - if
coproBitsβ 0
push pointer to Coprocessor ring.
- Transmit to dst vPort (host-local virtio queue) or physical NIC TX.
- For TX from VM the path is symmetric (virtio β FastPath β NIC).
Per-packet cost in common VMβVM overlay case β 300 ns (one cacheline).
5 Middlebox Functions
- On a flow miss vswitchd runs the expensive stuff once (full firewall rules, VIP backend pick, NAT port alloc) and writes the result into the action entry.
- Common VMβVM flows β bitmap = 0 β no per-packet penalty.
- For stateful firewall: if rules say βalways allowβ, bitmap stays clear; else a tiny port-range check & optional conntrack counter are enabled.
6 Live-Migration Trick
- During routing-table blackout the source host installs a hairpin flow: any packets still sent to old location are tunneled to the destination host.
- Once the fabric converges, hairpin is removedβzero loss to the VM.
7 Why It Matters
- Performance β p99 latency < 100 Β΅s, 10 + Gb/s per core; scales with NIC queues not CPU sockets.
- Economics β saves hardware middleboxes; host CPU per Gb cut by β₯ 5Γ vs. 2012 design.
- Feature velocity β new middlebox logic dropped into control-plane compilers, not datapath rewrites.
- Isolation β per-flow actions + conn-track keep tenants from interfering.
- Global operations β hairpin flows Β± controller partitioning allow cross-cluster live migration.