Qianliang's blog

[Paper Notes] Some Classic Distributed Systems

Scaling Memcache at Facebook

NSDI'13, by Facebook

TL; DR: Lessons learned of building large scale caching system at Facebook

Main goal: Show the important schemes of system design at different scales

Why read it:

Designs:

Single server:

In a cluster:

In a region:

Across regions:

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

NSDI'12, by Matei Zaharia, UC Berkeley

TL; DR: A data-sharing abstraction used by Spark; keep data in memory; achieve fault-tolerant

Main goal: Improve performance -> reduce disk/network -> reuse in-memory data / improve data locality

Key ideas:

Programming model:

Mesos: A Platform for {Fine-Grained} Resource Sharing in the Data Center

NSDI'11, by UC Berkeley

TL; DR: Resource manager for the Spark; separate scheduling as offer and pick by two parties

Research problem: Build a scalable and efficient scheduler for different frameworks

Key Ideas:

Datacenter RPCs can be General and Fast

NSDI'19, by Anuj Kalia, CMU

Goal: Break performance/generality trade-off for DC RPC

Background

API

Design

High level idea: optimize for the common cases

Scalability

RDMA writes: limited NIC SRAM

-> Choose user-space packet I/O for eRPC

Zero-Copy

Key idea: message buffer management

Sessions

Congestion Control

use Google Timely as CCA: RTT-based, RTT increase -> congestion -> decrease sending rate

Still common-case optimizations: