How to use two Kerberos keytabs (for Kafka and Hadoop HDFS) from a Flink job on a Flink standalone cluster?

Kerberos Keytabs

Kafka

Hadoop HDFS

Flink Job

Flink Standalone Cluster

How to use two Kerberos keytabs (for Kafka and Hadoop HDFS) from a Flink job on a Flink standalone cluster?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

When running Apache Flink jobs that interact with secure enterprise data systems like Apache Kafka (for streaming data) and Hadoop HDFS (for storage), managing multiple Kerberos authentication can be a complex issue. Each system requires its own Kerberos keytab for secure, authenticated access under Kerberos protection. This article guides you through the process of setting up your Flink jobs on a Flink standalone cluster to simultaneously use two Kerberos keytabs: one for Kafka and another for HDFS.

Understanding Kerberos and Keytabs

Kerberos is a network authentication protocol that works on the basis of tickets which allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Keytabs are files that contain pairs of Kerberos principals and encrypted keys (derived from the password). These files are used to authenticate a principal on a server without human interaction, making them crucial for services running within a cluster.

Scenario Overview

Assume you have a Flink job that requires reading data from a secure Kafka topic and then processing this data to store the output in a secure HDFS system. Both Kafka and HDFS clusters are secured with Kerberos. Setting up this scenario involves several steps:

Creating Keytabs: You should have a separate keytab file for Kafka and HDFS.
Configuring Flink to use Keytabs: Configure the Flink cluster/application to authenticate using these keytabs.

Step-by-step Configuration

Step 1: Create and Distribute Keytabs

Ensure that you have separate keytab files for both Kafka and HDFS. Each keytab should have the necessary principals included. These keytab files should be securely copied to all Flink nodes.

Example files:

kafka.keytab
hdfs.keytab

Step 2: Configure Flink

Modify the Flink configuration to include properties for both Kafka and HDFS. Flink configuration files are typically located in the $FLINK_HOME/conf directory.

Kafka Configuration

Add the following properties to the flink-conf.yaml or pass them via the command line:

yaml

security.kerberos.login.keytab: /path/to/kafka.keytab
security.kerberos.login.principal: kafka@YOUR-REALM.COM

HDFS Configuration

Configure the Hadoop's core-site.xml file, typically placed in FLINK_HOME/conf, or pass it through the classpath:

xml

1<property>
2    <name>hadoop.security.authentication</name>
3    <value>kerberos</value>
4</property>
5<property>
6    <name>hadoop.security.authorization</value>
7    <value>true</value>
8</property>
9<property>
10    <name>dfs.namenode.kerberos.principal</name>
11    <value>hdfs/[email protected]</value>
12</property>
13<property>
14    <name>dfs.namenode.keytab.file</name>
15    <value>/path/to/hdfs.keytab</value>
16</property>

Remember to replace _HOST with the actual hostname of the HDFS NameNode.

Step 3: Launching the Flink Job

When launching your Flink job, ensure that the environment variable HADOOP_CONF_DIR is set to the directory containing your core-site.xml and hdfs-site.xml:

bash

export HADOOP_CONF_DIR=$FLINK_HOME/conf

This makes sure that Flink can correctly configure its internal Hadoop filesystems to use Kerberos authentication against the HDFS cluster.

Key Configuration Summary

Here’s a quick reference table summarizing the configuration settings:

Component	Config File	Properties
Kafka	`flink-conf.yaml`	`security.kerberos.login.keytab`, `security.kerberos.login.principal`
HDFS	`core-site.xml`	`hadoop.security.authentication`, `dfs.namenode.keytab.file`, etc.

Conclusion

Setting up multiple Kerberos authentications in a Flink standalone cluster requires careful configuration of both Kafka and HDFS security settings. By ensuring that each component is properly authenticated with the correct keytab file, your Flink jobs can securely interact with other big data technologies in a Kerberos-enabled environment. This integration enables seamless and secure data processing across different platforms, crucial for enterprise-level data operations. Ensuring data security and integrity via proper Kerberos setup protects sensitive information and fits well into secure corporate workflows.