[Book Notes] Site Reliability Engineering: How Google Runs Production Systems
Chapter: The Production Environment at Google, from the Viewpoint of an SRE
Hardware
- (almost) homogeneous
- machine: hardware
- server: software
- hierarchy: machines -> rack -> row -> cluster -> datacenter -> campus
Machine Manager (Borg)
- Users will request jobs, which might consist of multiple tasks.
- Borg allocates the tasks to machines, and keeps monitoring it.
- Borg maintains the Borg Naming Service (BNS) for task discovery and communication.
- BNS entry format:
/bns/<cluster>/<user>/<job name>/<task number>
-><ip address>:<port>
- BNS entry format:
Storage
The storage consists of multiple layers, from lower (close to bare-metal) to higher:
- D (for disk): the filesystem running on top of the disk/flash storage.
- Colossus (successor of GFS): the cluster-wide filesystem offering usual filesystem semantics, replication and encryption.
- Bigtable: a NoSQL database system that is essentially a sparse, distributed, persistent multidimentional sorted map indexed by row/column key and timestamp.
- consistency model: eventual consistency, cross-datacenter replication.
- Spanner: provides SQL-like interface for users with real consistency across the world.
Networking
Google uses OpenFlow-based SDN: using less expensive "dump" switching components with a central controller instead of "smart" routing hardware to compute the network paths.
Centralized Traffice Engineering: The Bandwith Enforcer (BwE) manages the available bandwidth to maximize the average available bandwidth.
Loading Balancing: Google's Global Software Load Balancer (GSLB) performs load balancing on three levels:
- geographic load balancing for DNS requests
- load balancing at a user service level (e.g. Youtube)
- load balancing at the RPC level
Lock Service
The Chubby lock service provides a filesystem-like API for maintaining locks. Common use cases include: 1) master election, 2) BNS entries, etc.
Monitoring and Alerting
the Borgmon monitoring system can regularly retrieve metrics from monitored server and use them to generate alerts or for historic overviews.
Life of a Request
The life of a web request to some xxx.google.com
service:
- The user brower resolves the request address to DNS server, which talks to GSLB to pick an IP address for this user;
- The brower uses this IP to connect to the HTTP server (named Google Frontend, GFE).
- GFE will terminate this TCP connection and look up which service is requested, then use the GSLB to find the corresponding application fontend server's address and send the RPC containing the HTTP request.
- The frontend server will analyze the RPC request and construct a protobuf for it. It then looks up the GSLB for the BNS address of a (proper, e.g., unloaded) backend server.
- The backend contacts the Bigtable storage service to get the data and return.