Qianliang's blog

[Book Notes] Site Reliability Engineering: How Google Runs Production Systems

Chapter: The Production Environment at Google, from the Viewpoint of an SRE

Hardware

Machine Manager (Borg)

Storage

The storage consists of multiple layers, from lower (close to bare-metal) to higher:

Networking

Google uses OpenFlow-based SDN: using less expensive "dump" switching components with a central controller instead of "smart" routing hardware to compute the network paths.

Centralized Traffice Engineering: The Bandwith Enforcer (BwE) manages the available bandwidth to maximize the average available bandwidth.

Loading Balancing: Google's Global Software Load Balancer (GSLB) performs load balancing on three levels:

Lock Service

The Chubby lock service provides a filesystem-like API for maintaining locks. Common use cases include: 1) master election, 2) BNS entries, etc.

Monitoring and Alerting

the Borgmon monitoring system can regularly retrieve metrics from monitored server and use them to generate alerts or for historic overviews.

Life of a Request

The life of a web request to some xxx.google.com service:

  1. The user brower resolves the request address to DNS server, which talks to GSLB to pick an IP address for this user;
  2. The brower uses this IP to connect to the HTTP server (named Google Frontend, GFE).
  3. GFE will terminate this TCP connection and look up which service is requested, then use the GSLB to find the corresponding application fontend server's address and send the RPC containing the HTTP request.
  4. The frontend server will analyze the RPC request and construct a protobuf for it. It then looks up the GSLB for the BNS address of a (proper, e.g., unloaded) backend server.
  5. The backend contacts the Bigtable storage service to get the data and return.