Learn to download and setup Apache Kafka and inbuilt zookeeper from scratch. This tutorial is for absolute beginners to offer them some tips while learning Kafka in the longer run.
1. What is Apache Kafka?
Apache Kafka is a distributed streaming platform based on publish/subscribe messaging system. Another term which is used for Kafka is “distributed commit log”.
Just like we store the transactional data in a database so that we can retrieve it later to make business decisions, Kafka also stores data in messages. Data within Kafka is stored durably, in order, and can be read deterministically.
The main feature of Kafka is its scaling capability as well as protection against failures using data replication.
2. Core Concepts
Let us go through some key terms used in the Apache Kafka.
The unit of data within Kafka is called a message. Think of this as a row in a database table.
The message has two parts – key and body. Both are simply an array of bytes and Kafka does not do anything magical to read and make sense of these bytes. It can be XML, JSON, String or anything. Many Kafka developers favor using Apache Avro, a serialization framework originally developed for Hadoop. Kafka does not care and stores everything.
Keys are used to writing messages into partitions in a more controlled manner. Kafka simply finds the hash of the key and uses it to find the partition number where the message has to be written (logic is not this simple, off-course).
This assures that messages with the same key are always written to the same partition.
A batch is just a collection of messages, all of which are being produced to the same topic and partition. Messages move within the network in form of batches. This is done for efficiency is network utilization.
Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power.
A Kafka topic is very similar to a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions.
For example, consider we have topic with the name “
activity-log” which has 3 partitions with names:
When a source system sends messages to activity-log topic, these messages (1-n) can be stored in either partition based on load and other factors. Here a
message-1 will be stored in one partition only. similarly
message-2 will be stored in same or another partition. No message will be stored in multiple partitions of a given topic.
Please note that if there is a sequence between all messages, then Kafka only ensures the sequence of messages stored in a single partition. There is no guarantee of sequence for all messages stored in multiple partitions.
A stream is considered to be a single topic storing a certain class of data, regardless of the number of its partitions. When working with other systems, Kafka presents this topic (e.g. activity-log) as either producer or consumer for a stream of those messages.
1.4. Brokers and Cluster
A single Kafka server is called a broker. The broker receives messages from producer clients, assigns and maintains their offsets, and stores the messages in the storage system. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk.
If hardware support is good then a single broker can easily handle thousands of partitions and millions of messages per second.
Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller which is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures.
3. Advantages of Kafka
- Kafka is able to provide high throughput while handling multiple producers emitting data sets to a single topic or multiple topics. This makes Kafka for processing bulk events/messages from front-end systems recording page-views, mouse tracking or user behavior.
- Kafka allows multiple consumers to read any single stream of messages without interfering with each other. Each message can be read N number of times because messages are durable.
- Durable messages also mean that consumers can work on historical data. Though, Kafka supports real-time processing as well. It also means that if for some time some consumers are offline, they do not lose any data and gets it when they connect back.
- Kafka is highly scalable, and brokers(nodes) can be added or removed in runtime. The cluster need not be stopped.
- Kafka performs well and can handle millions of records per second within supporting hardware or infrastructure.
This post was about giving a high-level overview of what Kafka looks like. There are a very high number of items that you will read about when going into finer details.
In the next post, we will learn to download and start a Kafka broker and run some beginner’s commands.
Happy Learning !!