Fault Handling
It is common for systems to fail. Failures could be network failures, application logic failures, storage failure, infrastructure failures etc.
It is important for systems to handle failures gracefully and make the system robust enough.
There are two error handling patterns which are commonly used in system design.
- Fail safe
- Fail fast
Let us take a look at each of these:
1. Fail safe:
In this pattern, the system is designed to fail in such a way that it doesn’t affect the overall system behavior and doesn’t disrupt normal functioning. This might temporarily hide the problem, however, it would resurface as a bigger problem later on, and at that stage it could display different symptoms and will be generally harder to trace and figure out the root cause.
2. Fail fast:
In this pattern, the system is designed to fail in such a way that it fails quickly if there is a fault and is allowed to disrupt the normal functioning of the application. This allows the problem to be identified and addressed quickly. The downside of this approach is that, it would disrupt users and normalcy. However, the short pain inconvenience would prove hugely beneficial and cost effective, since the problem is tackled upright.
Retry mechanisms in fault handling:
While dealing with errors, retry mechanisms are very popular and should be considered when we have multiple system integrations. Let us say, system A (or service A), invokes system B (or service B). If system B is down or throws an error, then system A can implement a mechanism, wherein system A would retry the same invocation, with the same request data periodically, until system B responds successfully. This mechanism would make sure that the error transactions are not lost in that time window (when system B was down).
This kind of implementation would require system A to persist the request data, so that it could be used during retry.