In this section, we will take a look at distributed architecture and the challenges it poses.
A distributed architecture is an architecture, wherein the resources (either computing workload or data nodes) are spread across several nodes or machines. A distributed system will have redundant nodes (i.e. more than one node which will do the same compute operation, or more than one node which will store the same copy of data). The redundant nodes are to ensure high availability in the case of a node failure.
However, distributed systems are not easy to design and they come with their own set of challenges. Let us take a look at the challenges they pose.
Challenges of Distributed Systems
- Being Fault tolerant
- Know and Act
- Centralized coordination
- Monitoring and Troubleshooting
1. Being Fault tolerant
In a distributed environment, the system should expect and handle machine failures (the failure frequency will be even higher when commodity hardware is used). There are two aspects of a fault tolerant system:
- If the machine was performing computational workload and it fails, then, another machine should be able to perform the same computation.
- If the machine was storing data and it fails, then, the copy of data should be available in another machine.
The system architecture has to factor in the above two aspects, in order to make the system fault tolerant.
2. Know and Act:
In a distributed environment, when we add (or remove) machines, the system should know about the machines which are added/removed, without manual intervention. Also, the system should act accordingly, to distribute the workload to new machines which were added and stop assigning workload to machines which were removed.
This process, of knowing about addition/removal of machines and acting on it, is critical for a distributed system.
3. Centralized coordination:
A few distributed systems would allow one computation to be split into multiple smaller machines and then collate the results together. These kinds of systems would need a centralized system (or a machine), which would possess the following characteristics:
- Have knowledge of what part of computation is assigned to which machine.
- Ability to give direction to other machines to carry out specific tasks.
- Ability to delegate the task to some other machine if one of the machines fails.
- Ability to collate the results from all other machines to produce a final output.
4. Monitoring and Troubleshooting
In a distributed environment, when we have a large number of machines, monitoring which machine is up/down and tracking the health of each machine becomes very important. This becomes a necessary evil, rather than a sufficient one. Likewise, troubleshooting issues in a distributed environment can get challenging, if the scope of the work being analyzed spans multiple machines.
Cloud solutions like Kubernetes, platforms like Amazon Web Services etc. offer an ecosystem to build distributed systems. Frameworks like Apache Zookeeper provide a lot of features for coordinating distributed systems.