Single Point of Failure
In System Design and Architecture, Single point of failure (SPOF) finds an important place.
A Single point of failure (SPOF) is a part of a system, which if fails, would stop the entire system from functioning. While designing systems, it is very crucial to identify single point of failures. There could be one or more single point of failures in a system and we need to implement strategies to mitigate them.
For instance, in the above diagram, in Example 1, the ‘centralized coordinator’ is a single point of failure, since if it fails, the entire system would be paralyzed due to lack of coordination. Similarly, in Example 2, the ‘Transformation Service’ is a single point of failure, since if it fails; the entire system would collapse as it is critical to the communication between the service layer and the data service layer.
How do we fix single point of failures?
One of the approaches is to add ‘Redundancy’ and use a load balancer. Redundancy is a duplication of critical components with the intention of increasing reliability of the system. For data nodes, we add redundancy through a process known as ‘Replication’. For compute nodes, we add redundancy by adding additional similar compute nodes.
Analysis of single point of failures can happen at various levels in a system as outlined below:
- Application level SPOF – In this category, different application components are analyzed to find out single point of failures. For e.g., web servers, databases, queues, critical modules like authentication servers etc.
- Network level or Physical SPOF - In this category, the physical network is analyzed to find out single point of failures. For e.g., Network switches, Routers, Proxy servers, Storage servers etc.