How to use two Kerberos keytabs (for Kafka and Hadoop HDFS) from a Flink job on a Flink standalone cluster?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
When running Apache Flink jobs that interact with secure enterprise data systems like Apache Kafka (for streaming data) and Hadoop HDFS (for storage), managing multiple Kerberos authentication can be a complex issue. Each system requires its own Kerberos keytab for secure, authenticated access under Kerberos protection. This article guides you through the process of setting up your Flink jobs on a Flink standalone cluster to simultaneously use two Kerberos keytabs: one for Kafka and another for HDFS.
Understanding Kerberos and Keytabs
Kerberos is a network authentication protocol that works on the basis of tickets which allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Keytabs are files that contain pairs of Kerberos principals and encrypted keys (derived from the password). These files are used to authenticate a principal on a server without human interaction, making them crucial for services running within a cluster.
Scenario Overview
Assume you have a Flink job that requires reading data from a secure Kafka topic and then processing this data to store the output in a secure HDFS system. Both Kafka and HDFS clusters are secured with Kerberos. Setting up this scenario involves several steps:
- Creating Keytabs: You should have a separate keytab file for Kafka and HDFS.
- Configuring Flink to use Keytabs: Configure the Flink cluster/application to authenticate using these keytabs.
Step-by-step Configuration
Step 1: Create and Distribute Keytabs
Ensure that you have separate keytab files for both Kafka and HDFS. Each keytab should have the necessary principals included. These keytab files should be securely copied to all Flink nodes.
Example files:
kafka.keytabhdfs.keytab
Step 2: Configure Flink
Modify the Flink configuration to include properties for both Kafka and HDFS. Flink configuration files are typically located in the $FLINK_HOME/conf directory.
Kafka Configuration
Add the following properties to the flink-conf.yaml or pass them via the command line:
HDFS Configuration
Configure the Hadoop's core-site.xml file, typically placed in FLINK_HOME/conf, or pass it through the classpath:
Remember to replace _HOST with the actual hostname of the HDFS NameNode.
Step 3: Launching the Flink Job
When launching your Flink job, ensure that the environment variable HADOOP_CONF_DIR is set to the directory containing your core-site.xml and hdfs-site.xml:
This makes sure that Flink can correctly configure its internal Hadoop filesystems to use Kerberos authentication against the HDFS cluster.
Key Configuration Summary
Here’s a quick reference table summarizing the configuration settings:
| Component | Config File | Properties |
| Kafka | flink-conf.yaml | security.kerberos.login.keytab, security.kerberos.login.principal |
| HDFS | core-site.xml | hadoop.security.authentication, dfs.namenode.keytab.file, etc. |
Conclusion
Setting up multiple Kerberos authentications in a Flink standalone cluster requires careful configuration of both Kafka and HDFS security settings. By ensuring that each component is properly authenticated with the correct keytab file, your Flink jobs can securely interact with other big data technologies in a Kerberos-enabled environment. This integration enables seamless and secure data processing across different platforms, crucial for enterprise-level data operations. Ensuring data security and integrity via proper Kerberos setup protects sensitive information and fits well into secure corporate workflows.

