[Paper Notes] Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents

20 Aug, 2025

OSDI’24, by Stanford/Purdue et al.

TL;DR: Caravan is a control-plane stack that keeps in-network ML models fresh at runtime by (1) auto-labeling sampled traffic with labeling agents (heuristics/ACLs, DNNs, and foundation models), (2) watching a lightweight accuracy proxy for drift, and (3) retraining only when needed. In experiments it boosts F1 by ~30.3% over offline models and cuts GPU time by 61.3% vs continuous retraining; simple windowed triggers can save ~74.6% GPU time with negligible accuracy loss. It also runs at line rate on a Taurus FPGA testbed.

1 Goal & scope

Keep data-plane ML (switches/SmartNIC/FPGA) accurate under traffic and concept/data drift without relying on ground-truth labels at runtime. Target use cases include intrusion detection and IoT traffic classification; learning happens online while inference stays in the data plane.

2 Core ideas

Labeling agent = a wrapper that asks multiple knowledge sources to label the latest window of flows, then aggregates (e.g., majority voting). You plug in sources via a simple label() function.
Two kinds of sources, used differently:
- Fast but noisy (heuristics, ACLs): produce weak-supervision labels on only the high-confidence subset, e.g. IP-based blacklist.
- Accurate but slow/expensive (foundation models, domain experts): use sparingly and distill to rules, e.g. using prompt to query LLM and generate the corresponding ruleset code for caching.
Rule cache: when you do invoke an LLM, also ask it to emit labeling rules/heuristics you can reuse for several windows; refresh because rules go stale with drift.
Accuracy proxy: compute an F1-like metric using generated labels (not ground truth) to detect relative drops → drift. This drives retraining triggers instead of time-based retraining.

3 Main workflow

Sample recent flows + model predictions from in-network device to streaming DB, e.g. InfluxDB.
Label the window via the agent (weak labels from fast sources; occasional LLM → rule cache).
Validate with the accuracy proxy; if it drops or a rule/event fires → Retrain on a class-balanced subset (using iCaRL); update in-network weights; otherwise, keep going.

4 Key results

+30.3% F1 vs offline across tasks (sim). 61.3% less GPU vs continuous retraining via selective triggers.
Window trigger (every 5–10 windows) ≈ same accuracy, ~74.55% less GPU time.
Line-rate end-to-end on Taurus FPGA; ~30% F1 lift over static on the testbed.

5 Techniques worth learning

Treat existing tools (ACLs, heuristics, smaller DNNs, LLMs) as labeling sources; aggregate them instead of chasing a single oracle.
Prefer weak supervision from fast sources over forcing full-coverage noisy labels.
Use an LLM → rule-cache pattern to amortize cost; refresh rules when proxy suggests drift.
Monitor relative accuracy (proxy) and fire retraining triggers; avoid always-on retraining.

6 One-liners to remember

Label via many imperfect sources, not one perfect one; use weak labels + cached rules; detect drift with an accuracy proxy; retrain selectively → higher F1, far less GPU, still line-rate.

7 Comments

This paper addresses an important and challenging problem that how to analyze (e.g. classify) large volume of raw, unlabeled streaming data (e.g. network traffic) in an efficient way (e.g. lower cost). One key idea is to utilize the ML methods (no-brainer choice, e.g. normal DNN or LLM) to generate a reuseble ruleset. For when to retrain, instead of sticking to find a extremely accurate method to monitor performance, it uses a proxy to detect degradation, a smart way to bypass the fundemantally hard problem.