In this section, we will take a look at Kafka, which is a distributed streaming platform.
Kafka is a distributed messaging system which uses Publish/Subscribe model of communication. Kafka provides high throughput when compared with other messaging systems.
Kafka maintains the messages by writing to a file system, in contrast to other systems which maintain the messages in memory, and therefore the capacity is not limited by the RAM size.
Since messages are persisted on disk, it acts like a distributed commit log. The messages could be read by the consumer at any time and the consumer can also chose to read older messages by traversing back and forth through the message list.
Kafka uses Topics and each consumer can subscribe to one or more topics. Messages are written to the Topic by a producer and each topic maintains a commit log which stores the messages in the exact same sequence in which they were written. The index of the log is also known as offset.
Kafka works on a pull model. i.e., the consumers have to request for messages. Kafka keeps writing the messages as it arrives to a file and consumers have to maintain the index (or offset) of the messages being consumed. Thus Kafka transfers the overhead of tracking which consumer has read which message and helps in being lightweight.
Kafka supports partitioning and replication to provide high degree of scalability and redundancy. Each topic could be configured to have multiple partitions. The incoming messages are stored into one of the partitions. The Kafka cluster maintains the logs for a configurable period of time (retention period), after which the logs would be discarded.
Messages in a partition are also copies or replicated into other partitions, so that if one partition is down, the copy is available in other partitions. This architecture makes Kafka more suitable for distributed computing.
We can have cluster of multiple Kafka instances (known as brokers) and create topics whose partitions would be distributed across the broker instances. This would ensure high availability.
For example, let us say we have a cluster of 4 broker instances. If we create a topic T1, and configure 4 partitions for that topic, then the 4 partitions could be distributed across all the 4 broker instances.
Kafka uses zookeeper to maintain cluster state and also it keeps track of the index offset of each consumer. Kafka also provides a Streaming API to perform complex transformations on streaming data.
Kafka use cases:
Kafka is suitable for high throughput systems like data coming in from various sources like sensors, event streams etc. and consumers which would like to work through the messages offline or which would like to navigate through the messages.