Skip to content

Cloud Composer Autoscaling:

Naturally running the Cloud Composer feels exceedingly costly. Half of the 300$ / month cost (in its "basic" configuration) comes from the nodes that run all the time (the rest comes from the Cloud SQL instance and other background services).

How to only use machines when we need them is the first question that comes to mind, especially since serverless is the thing in the cloud. This article is the go to source to get started with autoscaling.

Get your variables together

bash
PROJECT=[provide your gcp project id]
COMPOSER_NAME=[provide your composer environment name]
COMPOSER_LOCATION=[provide the selected composer’s location e.g. us-central]
CLUSTER_ZONE=[provide the selected composer’s zone e.g. us-central1-a]

GKE_CLUSTER=$(gcloud composer environments describe
${COMPOSER_NAME}
--location ${COMPOSER_LOCATION}
--format="value(config.gkeCluster)"
--project ${PROJECT} |
grep -o '[^/]*$')


Enable autoscaling
```bash
gcloud container clusters update ${GKE_CLUSTER} --enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--zone ${CLUSTER_ZONE} \
--node-pool=default-pool \
--project ${PROJECT}

Enable kubectl in your terminal, i.e. access the composer

bash
# Run this if your composer is in a private environment
COMP_IP=$(curl icanhazip.com) # if on cloud shell
# if you are on a Mac use this: COMP_IP=$(ipconfig getifaddr en0)

gcloud container clusters update $GKE_CLUSTER \
--zone=${CLUSTER_ZONE} \
--enable-master-authorized-networks \
--master-authorized-networks=${COMP_IP}/32

# This is the "vanilla" command to enable kubectl:
gcloud container clusters get-credentials ${GKE_CLUSTER} \
--zone ${CLUSTER_ZONE} --project ${PROJECT}

You are now ready to "patch" the Composer Kubernetes settings to make sure your nodes scale up and, very important, stop the auto scaler from evicting your pods that are running tasks. Interestingly enough, btw, that the scaling down is the problem, i.e. the main issue for scaling with the Composer is turning off running processes.

To do so, we update Kubernetes' configs. For that you will create three files and then use them with kubectl to apply them to the Kubernetes environment.

There are three things here:

  • Enabling scaling up
  • If you are running "small" machines: asserting that the scheduler stays alive
  • Stopping the eviction of working pods

This enables scaling. Important metric is the targetAverageValue which determines when a new Node is started. This requires further explanation, which you can read up on here. In brief: the lower the metric, the sooner new nodes are provisioned. How low? That is exactly the thing: low relative to how much of the given resource (in this case memory) your Pods with their Workloads eat up. The documentation cites it like so (in their example they have CPU as target metric, which is measured in "m"):

text
For example, if the current metric value is 200m, and the desired value is 100m, the number of replicas will be doubled, since 200.0 / 100.0 == 2.0 If the current value is instead 50m, we'll halve the number of replicas, since 50.0 / 100.0 == 0.5. We'll skip scaling if the ratio is sufficiently close to 1.0

Important when creating this config: you need to fill in the namespace value yourself with the AIRFLOW_WORKER_NS:

bash
AIRFLOW_WORKER_NS=$(kubectl get namespaces \
| grep composer | cut -d ' ' -f1)

Save it to a file with the name: composer_airflow_worker_hpa.yaml

yaml
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: airflow-worker
  namespace: # USE ${AIRFLOW_WORKER_NS} VARIABLE DEFINED ABOVE
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: airflow-worker
  minReplicas: 1
  maxReplicas: 30 # Adjust this based on your use case
  metrics: # Adjust this based on your use case
    - type: Resource
      resource:
        name: memory
        targetAverageValue: 1Gi

Save it to a file with the name: composer_airflow_worker_patch.yaml

yaml
spec:
  template:
    spec:
      containers:
        - name: airflow-worker
          resources:
            requests:
              memory: 550Mi
            limits:
              memory: 1.5Gi
              cpu: 100m
        - name: gcs-syncd
          resources:
            requests:
              cpu: 10m
              memory: 50Mi
            limits:
              memory: 600Mi

Save it to a file with the name: composer_airflow_scheduler_patch.yaml

yaml
spec:
  template:
    spec:
      containers:
        - name: airflow-scheduler
          resources:
            requests:
              memory: 550Mi
              cpu: 100m
            limits:
              memory: 1.5Gi

To apply the configs (run this from the directory where the configs are or adjust the path accordingly):

bash
kubectl patch deployment airflow-worker -n ${AIRFLOW_WORKER_NS} --patch "$(cat composer_airflow_worker_patch.yaml)"
kubectl patch deployment airflow-scheduler -n ${AIRFLOW_WORKER_NS} --patch "$(cat composer_airflow_scheduler_patch.yaml)"

kubectl apply -f composer_airflow_worker_hpa.yaml

Then you need to remove Composer bottlenecks. Below my explanation of the variables. [core]

  • max_active_runs_per_dag (CLI: core-max_active_runs_per_dag): How many DAG runs can run in parallel
  • dag_concurrency (CLI: core-dag_concurrency): Really important! How many tasks may run across runs of one specific DAG.
  • parallelism (CLI: core-parallelism): How many tasks can run across ALL running DAGs.
  • dagbag_import_timeout (CLI: core-dagbag_import_timeout): This is a utility function. This may keep the scheduler from being unnecessarily restarted. What it means is that it allows the scheduler for the timeout, to scan DAG files. If you have "complicated" or much script, and your scheduler is a little underproportioned, then it may very well take some time to scan the scripts and import them.

[scheduler]

  • max_threads (CLI: scheduler-max_threads): Number of threads the scheduler is allowed to use. 2x its cores on the machine should be fine. May allow the scheduler to work a tad faster.

Explanation of the terms in brief here. The actual Airflow documentation is here:

bash
gcloud composer environments update $COMPOSER_NAME \
--update-airflow-configs=core-max_active_runs_per_dag=10 \
--update-airflow-configs=core-dag_concurrency=50 \
--update-airflow-configs=core-parallelism=80 \
--update-airflow-configs=core-dagbag_import_timeout=120 \
--update-airflow-configs=scheduler-max_threads=4 \
--location $COMPOSER_LOCATION \
--project $PROJECT

Kubernetes

You will spend most time in workloads