Introduction to Apache Kafka

Senura Vihan Jayadeva
6 min readApr 16, 2023

Introduction

When we begin learning about microservices, we encounter a plethora of concepts, patterns, protocols, and tools such as Messaging, AMQP, RabbitMQ, Event-sourcing, gRPC, CQRS, among others. However, Apache Kafka always intrigued me, and I wanted to explore it further. Unfortunately, there is a widespread misunderstanding of Kafka, with many regarding it as just another messaging system, when in reality, it has so much more to offer.

While you can use Kafka for messaging, utilizing it solely for this purpose can be inefficient. Hence, the aim of this article is to clarify how Kafka operates and the reasons why you should consider it for your projects. I will begin by presenting the fundamental concepts and debunking the notion that Kafka is merely a messaging system. Hopefully, this article will assist those new to this field, just as I am.

What is Kafka ?

Apache Kafka is a distributed streaming platform, which means it can handle and process real-time infinite data across a cluster of machines. Each machine in the cluster is called a broker, and they work together to make Kafka resilient, reliable, scalable, and fault-tolerant.

However, some people mistakenly think that Kafka is just another messaging system. Messaging is simply the act of sending a message from one place to another, involving three main actors: producers who send messages to one or more queues, a buffer data structure called the queue that receives and delivers messages in a FIFO way, and consumers who receive messages when they’re published.

In contrast, Kafka’s distributed architecture and real-time data processing capabilities set it apart from a basic messaging system.

Kafka Fundamentals

Now I’m going to give a good idea about key concepts such as brokers, topics, partitions, producers, and consumers.

Brokers: In Apache Kafka, brokers are the servers that make up the cluster. Each broker stores a portion of the data and messages on the topics. Brokers communicate with each other to ensure that data is properly distributed and replicated across the cluster.

Topics: A topic in Apache Kafka is a category or feed name to which messages are published by producers. Topics are split into partitions, which can be spread across multiple brokers in a Kafka cluster. Each message published to a topic is delivered to one partition within the topic.

Partitions: Partitions are a way to split a topic across multiple brokers in a Kafka cluster. Each partition is an ordered, immutable sequence of messages. The partitions within a topic can be replicated across multiple brokers to ensure that data is not lost in the event of a broker failure.

Producers: Producers are the programs that publish messages to a Kafka topic. Producers can choose which partition to publish messages to or let Kafka choose for them.

Consumers: Consumers are the programs that subscribe to a Kafka topic and consume messages from it. Consumers can read messages from one or more partitions in a topic, and each consumer group can have multiple consumers.

Imagine hypothetical scenario of a gaming company that wants to implement a real-time analytics system to gather insights about its players’ behavior and preferences. The company has a massive user base, and it needs to handle a large volume of data in real-time to make quick business decisions and provide better gaming experiences to its users.

To achieve this, we can use Apache Kafka as the backbone of the analytics system. Kafka would be used to collect, process, and store the real-time data generated by the players’ interactions with the game. The data would be organized into topics, which represent different types of information, such as player actions, game events, and player demographics.

Each partition is a unit of parallelism within Kafka, and it represents a sequence of messages that are totally ordered and immutable. The messages within a partition are assigned a sequential offset, starting from 0 and incrementing by 1 for each new message. The offset of a message within a partition is its unique identifier within that partition.

When a producer sends a message to Kafka, it specifies a partition key, which is used to determine which partition the message should be assigned to. The partition key is usually based on some attribute of the message, such as the game ID in the scenario described in the video. By using a consistent partition key, all messages related to a particular game would be assigned to the same partition, ensuring that they are processed in order and reducing the likelihood of data loss.

In the scenario, the messages related to each game were assigned to a separate partition, which was consumed by a separate consumer group. This allowed each game to be processed independently and in parallel, without interfering with each other. The consumer group was responsible for consuming the messages in the partition in the order in which they were produced and processing them accordingly.

How messages are produced and consumed in Kafka.

In Kafka, producers generate messages and send them to a broker, which stores the messages in partitions of a topic. Consumers subscribe to one or more topics and read messages from partitions, tracking their offsets to avoid duplicates. Multiple consumers can be assigned to the same partition to load balance the processing of messages. Once a consumer has processed a message, it can commit the offset to Kafka. Kafka also allows consumers to replay messages in case of failures or data loss.

Role of ZooKeeper in managing Kafka clusters

Kafka is a distributed system that scales horizontally across multiple servers (brokers) to handle large amounts of data. To ensure that all data is available and consistent across the cluster, Kafka requires coordination among its various components. This is where ZooKeeper comes in.

ZooKeeper is a distributed coordination service that manages the configuration and coordination of distributed systems. Kafka uses ZooKeeper to keep track of the status of each broker in the cluster, including which ones are alive, failed, or serving as leaders for each partition. This information is used to route messages to the appropriate brokers and ensure that all data is available and consistent across the cluster.

ZooKeeper also stores metadata about Kafka topics, such as the number of partitions, replication factor, and which brokers are responsible for which partitions. This metadata ensures that all consumers and producers have a consistent view of the topics, even if the cluster is scaled up or down. It also coordinates partition assignment among consumer group members and elects leaders for each partition in Kafka, ensuring only one broker is responsible for accepting writes to a partition at any given time.

In summary, ZooKeeper is a critical component in the Kafka ecosystem, ensuring the reliability, availability, and consistency of Kafka clusters.

Since this is a long article will discuss about the Advanced Kafka Concepts in the next article and trying to demo you an application how to do the kafka configurations.

--

--

Senura Vihan Jayadeva

Software Engineering undergraduate of Sri Lanka Institute of Information Technology | Physical Science Undergraduate of University of Sri Jayewardenepura