Learn to download and setup Apache Kafka and inbuilt zookeeper from scratch. This tutorial is for absolute beginners to offer them some tips while learning Kafka in longer run.
1. What is Apache Kafka?
Apache Kafka is a distributed streaming platform based on publish/subscribe messaging system. Another term which is used for Kafka is “distributed commit log”.
Just like we store the transactional data in database, so that we can retrieve it later to make some business decisions, Kafka also stores data in form of messages. Data within Kafka is stored durably, in order, and can be read deterministically.
The main feature of Kafka is its scaling capability as well as protection against failures using data replication.
The unit of data within Kafka is called a message. Think of this as a row in database table.
The massage has two parts – key and body. Both are simply an array of bytes and Kafka does not do anything magical to read and make sense of these bytes. It can be XML, JSON, String or anything. Many Kafka developers favor the use of Apache Avro, which is a serialization framework originally developed for Hadoop. Kafka does not care and store everything.
Keys are used to write messages into partitions in a more controlled manner. Kafka simply finds the hash of the key and uses it to find the partition number where message has to be written (logic is not this much simple, off-course).
This assures that messages with the same key are always written to the same partition.
A batch is just a collection of messages, all of which are being produced to the same topic and partition.
Messages move within network in form of batches. This is done for efficiency is network utilization.
Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power.
A Kafka topic is very similar to database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions.
For example, consider we have topic with name “
activity-log” which has 3 partitions with names:
When a source system send messages to
activity-log topic, these messages (1-n) can be stored in either of the partition based on load and various other factors.
message-1 will be stored in one partition only. similarly
message-2 will be stored in same or another partition. No message will be stored in multiple partitions of given topic.
Please note that if there is a sequence between all messages, then Kafka only ensures the sequence of messages stored in a single partition. There is no guarantee of sequence for all messages stored in multiple partitions.
A stream is considered to be a single topic storing certain class of data, regardless of the number of its partitions. When working with other systems, Kafka present this
activity-log) as either producer or consumer for stream of those messages.
1.4. Bokers and Cluster
A single Kafka server is called a broker. The broker receives messages from producer clients, assigns and maintain their offsets, and stores the messages in storage system.
It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk.
If hardware support is good then a single broker can easily handle thousands of partitions and millions of messages per second.
Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller which is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures.
- Kafka is able to provide high throughput while handling multiple producers emitting data sets to a single topic or multiple topics. This makes Kafka for processing bulk events/messages from front-end systems recording page-views, mouse tracking or user behavior.
- Kafka multiple consumers to read any single stream of messages without interfering with each other. Each message can be read N number of times because messages are durable.
- Durable messages also means that consumer can work on historical data. Though, Kafka support real time processing as well.
It also means that if for some time some consumers are offline, they do not lose any data and gets it when they connect back.
- Kafka is highly scalable and bokers(nodes) can be added or removed in runtime. The cluster need not to be stopped.
- Kafka provides excellent performance and is able to handle million of records per second within supporting hardware or infrastructure.
This post was all about giving a very high level overview of how Kafka look like. There are very high number of items which you will read about when going into finer details.
In next post, we will learn download and start a Kafka broker and will run some beginners command to start with.
Happy Learning !!
References : Kafka Website