Qianliang's blog

[Paper Notes] Google Borg (Predecessor of Kubernetes)

EuroSys'15, by Google

Cluster management, inspiring K8S

further reading: Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade

TL; DR: cluster management for developer/admin, aimed for reliability/availability, efficiency, simplicity.

Components

job: user submission unit, consists of one or more tasks.

task: a set of Linux processes running in a container on a machine.

cell: a set of machines as a cluster unit to run jobs. Median cell size is 10k.

workload: 1) long-running services: latency sensitive, e.g., Gmail, Google Docs, web search. 2) batch job: best-effect.

Architecture

Borgmaster

Each cell has a logically centralized controller Borgmaster. Each Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler:

Scheduling Algorithm

The scheduler will asign tasks in jobs to machines based on the scheduling algorithm, which has two parts: feasibility checking an scoring:

Performance metrics: task startup latency (the time from job submission to a task running), median is 25s. Package installation takes 80% time. Some optimizations: 1) assign tasks to machines that already have the necessary packages, 2) distributes packages to machines in parallel.

Borglet

The Borglet is a local Borg agent that is present on every machine in a cell (a local controller compared with Borgmaster as global controller in the scope of a cell).

The Borglet can 1) start, stop, and restart (if fail) tasks; 2) manage local resources (e.g. OS kernel settings); 3) report the state of the machien to Borgmaster and other monitoring systems, etc.

The Borgmaster control the Borglets by periodically polling each Borglet's state and send it requests.

Scalability

A single Borgmaster can: 1) manage many thousands of machines in a cell, and 2) cells have arrival rates above 10,000 task per minute.

Some techniques to improve scalability: 1) split the scheduler into a separate process that can operate in parallel; 2) to improve response time, add separate threads to talk to Borglets and respond to read-only RPCs (updates should go through master); 3) score caching: cache the scores until the properties of machine or task change; 4) do scheduling for one task per equivalence class (with identical requirements); 5) examine machines for scheduling in random order.

Availability

Some sources of task eviction are: preemption (most for non-prod tasks), machine failure, machine shutdown (most for prod task), out of resources, etc.

Borg use some techniques to mitigate the impart of task evictions:

A key design feature: let already-running tasks continue to run even if the Borgmaster/task's Borglet goes down.