[Paper Notes] Google Borg (Predecessor of Kubernetes)

26 May, 2025

EuroSys'15, by Google

Cluster management, inspiring K8S

further reading: Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade

TL; DR: cluster management for developer/admin, aimed for reliability/availability, efficiency, simplicity.

Components

job: user submission unit, consists of one or more tasks.

Jobs have constraints on machines with particular attributes (e.g., CPU, OS).
priority: represents the priority of scheduling, higher one can preempt the lower one.

task: a set of Linux processes running in a container on a machine.

cell: a set of machines as a cluster unit to run jobs. Median cell size is 10k.

workload: 1) long-running services: latency sensitive, e.g., Gmail, Google Docs, web search. 2) batch job: best-effect.

Architecture

Borgmaster

Each cell has a logically centralized controller Borgmaster. Each Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler:

main Borgmaster: handles client RPCs; manages state machines; communicates with Borglet; etc. Each main Borgmaster has five replications with Paxos-based store on each replica's local disk.
scheduler: scans the pending queue asynchronously where jobs are added by Borgmaster; assigns tasks to machines.

Scheduling Algorithm

The scheduler will asign tasks in jobs to machines based on the scheduling algorithm, which has two parts: feasibility checking an scoring:

feasibility checking: finds a set of machines that 1) meet the task's constraints, and 2) have enough avaiable space;
scoring: hybrid methods of E-PVM and "best fit". E-PVM ends up spreading load across machines, while best fit tries to fill machines as tightly as possible.

Performance metrics: task startup latency (the time from job submission to a task running), median is 25s. Package installation takes 80% time. Some optimizations: 1) assign tasks to machines that already have the necessary packages, 2) distributes packages to machines in parallel.

Borglet

The Borglet is a local Borg agent that is present on every machine in a cell (a local controller compared with Borgmaster as global controller in the scope of a cell).

The Borglet can 1) start, stop, and restart (if fail) tasks; 2) manage local resources (e.g. OS kernel settings); 3) report the state of the machien to Borgmaster and other monitoring systems, etc.

The Borgmaster control the Borglets by periodically polling each Borglet's state and send it requests.

Scalability

A single Borgmaster can: 1) manage many thousands of machines in a cell, and 2) cells have arrival rates above 10,000 task per minute.

Some techniques to improve scalability: 1) split the scheduler into a separate process that can operate in parallel; 2) to improve response time, add separate threads to talk to Borglets and respond to read-only RPCs (updates should go through master); 3) score caching: cache the scores until the properties of machine or task change; 4) do scheduling for one task per equivalence class (with identical requirements); 5) examine machines for scheduling in random order.

Availability

Some sources of task eviction are: preemption (most for non-prod tasks), machine failure, machine shutdown (most for prod task), out of resources, etc.

Borg use some techniques to mitigate the impart of task evictions:

automatically reschedule evicted tasks;
spread tasks of a job across failure domains (e.g., machines, raks, and power domains);
limit the rate/number of task disruption from a job that can be simultaneously down during maintenance (e.g., OS/machine upgrades);
use declarative desired-state representations (used in k8s) and idempotent mutating operations (for consistency against failure);
etc.

A key design feature: let already-running tasks continue to run even if the Borgmaster/task's Borglet goes down.