[Paper Notes] Google Borg (Predecessor of Kubernetes)
EuroSys'15, by Google
Cluster management, inspiring K8S
further reading: Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade
TL; DR: cluster management for developer/admin, aimed for reliability/availability, efficiency, simplicity.
Components
job: user submission unit, consists of one or more tasks.
- Jobs have constraints on machines with particular attributes (e.g., CPU, OS).
- priority: represents the priority of scheduling, higher one can preempt the lower one.
task: a set of Linux processes running in a container on a machine.
cell: a set of machines as a cluster unit to run jobs. Median cell size is 10k.
workload: 1) long-running services: latency sensitive, e.g., Gmail, Google Docs, web search. 2) batch job: best-effect.
Architecture
Borgmaster
Each cell has a logically centralized controller Borgmaster. Each Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler:
- main Borgmaster: handles client RPCs; manages state machines; communicates with Borglet; etc. Each main Borgmaster has five replications with Paxos-based store on each replica's local disk.
- scheduler: scans the pending queue asynchronously where jobs are added by Borgmaster; assigns tasks to machines.
Scheduling Algorithm
The scheduler will asign tasks in jobs to machines based on the scheduling algorithm, which has two parts: feasibility checking an scoring:
- feasibility checking: finds a set of machines that 1) meet the task's constraints, and 2) have enough avaiable space;
- scoring: hybrid methods of E-PVM and "best fit". E-PVM ends up spreading load across machines, while best fit tries to fill machines as tightly as possible.
Performance metrics: task startup latency (the time from job submission to a task running), median is 25s. Package installation takes 80% time. Some optimizations: 1) assign tasks to machines that already have the necessary packages, 2) distributes packages to machines in parallel.
Borglet
The Borglet is a local Borg agent that is present on every machine in a cell (a local controller compared with Borgmaster as global controller in the scope of a cell).
The Borglet can 1) start, stop, and restart (if fail) tasks; 2) manage local resources (e.g. OS kernel settings); 3) report the state of the machien to Borgmaster and other monitoring systems, etc.
The Borgmaster control the Borglets by periodically polling each Borglet's state and send it requests.
Scalability
A single Borgmaster can: 1) manage many thousands of machines in a cell, and 2) cells have arrival rates above 10,000 task per minute.
Some techniques to improve scalability: 1) split the scheduler into a separate process that can operate in parallel; 2) to improve response time, add separate threads to talk to Borglets and respond to read-only RPCs (updates should go through master); 3) score caching: cache the scores until the properties of machine or task change; 4) do scheduling for one task per equivalence class (with identical requirements); 5) examine machines for scheduling in random order.
Availability
Some sources of task eviction are: preemption (most for non-prod
tasks), machine failure, machine shutdown (most for prod
task), out of resources, etc.
Borg use some techniques to mitigate the impart of task evictions:
- automatically reschedule evicted tasks;
- spread tasks of a job across failure domains (e.g., machines, raks, and power domains);
- limit the rate/number of task disruption from a job that can be simultaneously down during maintenance (e.g., OS/machine upgrades);
- use declarative desired-state representations (used in k8s) and idempotent mutating operations (for consistency against failure);
- etc.
A key design feature: let already-running tasks continue to run even if the Borgmaster/task's Borglet goes down.