6. Chapter 6: Coordination and Agreement

6.1. Introduction


Introduction

Failure assumptions and failure detector
  • Assumptions:
    • Each pair of processes is connected by reliable channels. 
    • No process failure implies a threat to the other processes’ ability to communicate. 
    • In any particular interval of time, communication between some processes may succeed while communication between others is delayed. 
      • Example: failure of a router between two networks -> network partition
    • Eventually any failed link or router will be repaired or circumvented.
    • The processes may not all be able to communicate at the same time. 
    • The processes fail only by crashing. 

      Figure 6.1: A network partition

  • Failure detector 
    • Service that processes queries about whether a particular process has failed. 
    • (a) Unreliable failure detectors
      • Produce one of two values: Unsuspected or Suspected
      • May or may not accurately reflect whether the process has actually failed.
      • Unsuspected -> the detector has recently received evidence suggesting that the process has not failed
      • Suspected -> the failure detector has some indication that the process may have failed
    • (b) Reliable failure detector
      • Always accurate in detecting a process’s failure
      • Unsuspected or Failed
      • Failed -> the detector has determined that the process has crashed.
  • Algorithm for unreliable failure detector
    • Each process p sends a ‘p is here’ message to every other process.
    • It does this every T seconds.
    • Maximum message transmission time: D seconds
    • If the local failure detector at process q does not receive a ‘p is here’ message within T + D seconds of the last one, then it reports to q that p is Suspected. 
    • If it subsequently receives a ‘p is here’ message, then it reports to q that p is OK.
    • In a synchronous system, the failure detector can be made into a reliable one. 
    • D is not an estimate but an absolute bound on message transmission times; the absence of a ‘p is here’ message within T + D seconds indicates that p has crashed.