Kubernetes
Airflow
SubDAGs
Executor
Workflow Management

Kubernetes executor do not parallelize sub DAGs execution in Airflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In the landscape of data engineering, Apache Airflow has emerged as a powerful platform for orchestrating complex workflows. With its ability to manage Directed Acyclic Graphs (DAGs), Airflow allows users to define tasks and their dependencies. One of its key features is the ability to execute these tasks in parallel, improving efficiency and reducing overall execution time. However, with complex workflows involving nested DAGs—or subDAGs—certain constraints can arise depending on the executor employed. This article delves into the behavior of KubernetesExecutor concerning subDAGs in Apache Airflow and why it does not parallelize their execution.

Kubernetes Executor in Airflow

Apache Airflow supports many executors to run tasks, and among these is the KubernetesExecutor. This executor dynamically allocates tasks to different pods in a Kubernetes cluster, providing great scalability and resource management capabilities. Here’s how it typically works:

  1. Pod Creation: When a task is scheduled, a new pod is created in the Kubernetes cluster to execute it.
  2. Resource Allocation: Kubernetes dynamically allocates resources based on the pod specification.
  3. Isolation: Each task runs in its pod, ensuring isolation and independent execution.

The KubernetesExecutor is chosen for its abilities to scale workflows according to the available resources. Now, let's explore why it doesn't parallelize subDAG executions.

SubDAGs in Airflow

A subDAG is essentially a DAG within a DAG, allowing hierarchical structuring of complex workflows. SubDAGs are defined using the SubDagOperator. They appear as a single task in the parent DAG, but they consist of multiple tasks when expanded.

Example of a SubDAG:

  • Resource Utilization: SubDAGs running as single pods might lead to underutilization of resources, as they don’t fully leverage the distributed nature of Kubernetes.
  • Concurrency Limitations: By default, Airflow's SubDagOperator is synchronous. It waits until all subDAG tasks are completed before marking the subDAG itself as complete and moving to the next downstream task in the main DAG.

Course illustration
Course illustration

All Rights Reserved.