Replication and Partitioning
In this section, we will take a look at two concepts which are generally used while scaling out data storage in a cloud environment. Consider a data storage system (any database), which needs to be scaled out for high availability in a cloud environment.
Replication:
In order to be highly available, the data needs to be stored in multiple nodes, so that if one node goes down, the other available nodes cater to the data requests. This process of having multiple copies of data is referred to as ‘Replication’.
Replication Factor:
A replication factor is the number of replicas (or copies) which would be maintained for each set of data. In the above diagram, the replication factor is two, since two copies are maintained apart from the primary data.
Challenges of replicating data:
Replication of data poses a very important challenge namely syncing of data. The changes made to any one copy of the data should be synced up in other nodes as well to guarantee consistency of data.
Partitioning:
Replication allows multiple copies of data to be available in multiple nodes. However, what if the data grows in size and one node is not capable of storing all the data. In such a scenario, we would need to split the data into multiple nodes, so that collectively those nodes account for the entire data. This process of splitting the data into multiple nodes is known as 'Partitioning'.
Challenges of partitioning data:
When we partition the data, the system needs to maintain book keeping of which part of data is stored in which node. Based on this information, the system needs to query the corresponding node when a request is made, in order to guarantee high performance.
Replication and Partitioning combined:
Replication and Partitioning could be combined for high availability and scalability from a data perspective. Thus we can have the data partitioned into multiple nodes and each partitioned node could be replicated across multiple nodes.
For instance, in the example above, the data (A, B, C) is partitioned into three nodes, with each node having one set of data. Also, the data is replicated across all the three nodes, in such a way that the partitioned data and replicated data are present in a mutually exclusive manner in any node. i.e., Node 1 maintains partitioned data A and replica of data B, Node 2 maintains partitioned data B and replica of data C, Node 3 maintains partitioned data C and replica of data A.
In such a setup, if node 1 goes down, the system would retrieve the data A from the replica in node 3.
Reference Implementations:
Cassandra database provides partitioning and replication of data. Mongo DB also provides partitioning through partitioned collections.