Adapting Flink Slot Sharing Group For Better Performance in Data Processing Pipeline

Nagabhushan SS
5 min readFeb 1, 2021

This blog gives an overview of Apache Flink as well as its adaption of Slot Sharing Group for better performance in data processing pipeline. In our data processing pipe Apache Flink Kafka connector has been used for pulling the data from Kafka, transform it and push it back to another topic in Kafka. Apache Flink Cluster was deployed in Job cluster mode in Kubernetes

Apache Flink Overview

Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink can execute both stream processing and batch processing easily. It does use the Akka framework for it’s distributed processing

Apache Flink uses the master-slave architecture. Apache Flink has mainly three distributed components — Job Manager, Task Manager and Job client.

· Job client submits the Flink job to the Job manager.

· Job manager acting as master orchestrates jobs on different task managers and manage the resources.

· Task Managers are the actual worker nodes doing computations on the data and updating the job manager about their progress.

Job Manager receives the Job Graph from Job client and converts it into the execution graph having multiple tasks. It is a logical representation of the distributed execution. It does convert all the operators and parallelize them for the execution on different task managers.

Task Managers does have resources slots which are used for the actual task execution. Once a job manager sends the task information for execution, the task manager acknowledges the information and regularly update the job manager about the task execution status.

Task Slots

Task Manager can be configured with a certain number of processing slots which give the ability to execute several tasks at the same time. Task slot defines a fixed slice of resources of a Task Manager.

Task Slot in Task Manager which is responsible for executing task assigned to it
Task Slot In Task Manager

Data Processing Pipeline Workflow

Typically most of the workflow in data processing pipeline has below tasks

· Source :- Read data from one of Kafka topics

· Map :- Transformation on the data through Map operation

· Sink:- Pass it to downstream components by writing to another Kafka topic

Workflow of Data Processing Pipeline

Each of the above task in flink is described as “operators”. By default, flink allows tasks of different operators to be deployed into same slot. In other word on each task manager, data is read from Kafka, transformed and is written back to Kafka into different topic. In case of our data processing pipeline task slot was configured to 1 to avoid cross thread synchronization issue.

All tasks deployed to same task slot in Task Manager
Task Manager serving all task slots

As you can see in the above diagram that all the three operators are deployed to the same task slot and executes in the same Task Manager. Writing to Kafka is resource intensive operation since it includes Network I/O time. This was one of the factors degrading the performance during data processing. With the above topology, during our performance test, data processing pipeline was able to process 10K events with in 120 seconds.

Is there a way to improve Performance ?

One of the obvious way to improve performance was to split the message across all Kafka partitions. This was already taken care by our processing pipeline. Inspite of this there was business requirement to handle millions of messages per Kafka partition.

In order to overcome this performance issue, SlotSharingGroup was introduced which allows greater control over operator deployments. Tasks which share the same slot sharing group can be executed in the same slot and, thus, share resources. Tasks with different slot sharing group are executed in separate Task manager.

Source & Map are deployed in one task manager where as sink is deployed in another task manager
Task slot serving different operators

In the above figure, source and map operation are executed in the same task manager while sink operation is executed in different task manager. In this way even though sink is resource intensive operation, it will not affect processing of messages in map transformation read from Kafka topics thus consuming more messages from Kafka. Sink operation is now done in another Task Manager. With the above topology, during our performance test, data processing pipeline was able to process 10K events with in 60 seconds. This almost gave on boost of 100% performance improvement.

Can it be improved further through Task Parallelism ?

Flink provides capability of executing operator parallelly based on “Parallelism” configured for each operator. In our use case, parallelism of Sink operator was made twice the map operator parallelism.

Sink parallelism being twice of Map Operator
Sink Parallelism Twice of Map Operator

As you can see above, parallelism of Sink is twice of Map operator. This gave us further more performance improvements. Increasing more than twice did not yield any more performance improvement since this put loads on Flink for shuffling the sink data across the Network I/O.

References

https://ci.apache.org/projects/flink/flink-docs-release-1.11/internals/task_lifecycle.html

--

--