Fault Tolerance

Introduction

Fault tolerance is an ability to provide uninterrupted access to data in presense of a failure of a cluster node. Clustered applications have, despite the significant increase in hardware reliability, a very high probability of experiencing hardware failures due to the increased number of hardware components. Therefore, tolerating node failures is very important for long-running applications.

Fault Tolerance in Cacheonix

Cacheonix is fault-tolerant. Cacheonix considers all failures as fail-stop failures. Cacheonix implements fault-tolerance by the means of creating back up copies of a value associated with a key. The number of back up copies is configurable and a range from 0 to N. Cacheonix is an N-tolerant system. It will continue to operate normally if Noperational - Nfailed >= Ntolerant. For 1-tolerant configuration the minimum number of nodes in the cluster is 3.

While this works well for describing the static case when the failure occurs once, in real life nodes fail and repair multiple times throughout the life of the cluster. A node repair occurs when a previously failed node resumes operation or a new node joins. For the purposes of this discussion these are the same events. For Cacheonix to remain N-tolerant in a dynamic system, the following condition has to be satisfied: the rate of node failures should be lesser or equal than the rate of repairs by Ntolerant. In other words cluster nodes should repair at the same rate of faster then they fail. If this condition is violated, the cluster stops being N-tolerant until the condition is restored.

Factors affecting the rate of repairs

Network

The rate of repairs is a function of key+object size and network speed. Cacheonix has to create back up copies of the key-value pairs. The bigger the size of the key-value pair (KVsize), the longer it takes to restore N-tolerance. The higher the network speed (Nspeed), the shorter it takes to restore N-tolerance.

Time to repair for an unsaturated network can be calculated as KVsize*Nbackup/Nspeed = Trepair. The repair rate can be calculated as Tinterval/Trepair =Rrate.

Big Fat Boxes

32-core machines with 64Gbytes of RAM (Big Fat Box) are not uncommon. A bigger machine means lesser maintenance. Unfortunately, this also means lesser fault tolerance. A big fat box can either pack up multiple small cluster nodes or a single larger node. Let's assume the MTBF does not depend on the size of the box. In other words, big fat boxes fail as often as the small ones. This means that to repair a single cluster node with 32 Gigabytes of heap will take 32 times longer then a node with 1Gigabyte of heap.

Conclusion

To maintain high availability:

  • Provide your Cacheonix-based application with a fast and, ideally, dedicated network infrastructure.
  • Keep individual size of cached keys and objects reasonable
  • Opt for small fast machines

Labels

 
(None)