In Part 1 we diagnosed the root cause: AWS Glue’s ENI lifecycle extends 60+ minutes beyond job completion, every job category shared a single /26, and the launcher had no visibility into actual subnet state. The fix wasn’t more IPs. It was feedback.

Phase 1: Splitting by Blast Radius

We still had to deal with the shared-subnet problem before admission control could do anything useful. If one category of job poisoned the pool, admission control would just politely refuse to launch anything — SLA misses with extra steps.

So we split the /26 into three subnets. Not by capacity. By what could hurt what.

  • Subnet A (/26, us-east-1a) — critical batch (customer-analytics, product-metrics)
  • Subnet B (/26, us-east-1b) — non-critical batch (reporting, aggregations)
  • Subnet C (/27, us-east-1c) — backfill, ad hoc, manual reruns

Subnet C is intentionally small — ad-hoc failures are acceptable; the sensor here is a courtesy, not an SLA gate

The logic we kept coming back to: if a backfill leaves 20 ENIs lingering in available for 40 minutes, we want that cleanup tail trapped. Contained in Subnet C. Not bleeding into the critical morning window in Subnet A.

We put each subnet in a different AZ, but we were careful not to oversell that to ourselves. A Glue job binds to one Connection, one subnet, one AZ. There’s no native multi-AZ for a single Glue job. All we bought was that different job categories live in different AZs. A us-east-1a event still takes out critical batch until we manually repoint the Connection.

The cutover was straightforward. Three new subnets, three new Glue Connections, DAGs repointed via Terraform. We staged it over a week — critical first (highest risk, do it while we’re paying attention), non-critical next, backfills last. Every migration window we waited for the subnet to drain to zero active ENIs before flipping traffic.

The 7AM failures dropped immediately. But we knew the story wasn’t over. Isolation stopped categories from starving each other. It didn’t stop one category from starving its own subnet during a bad cleanup day. We’d moved the failure mode, not killed it.

Phase 2: Teaching Airflow to Look Before It Leapt

The whiteboard insight — the launcher is blind — was still unaddressed. We’d given the launcher more rooms to work in, but it still walked into each one with its eyes closed.

So we wrote a sensor. Simple in shape: before each Glue job category runs, a custom Airflow sensor calls DescribeNetworkInterfaces filtered by subnet ID, counts ENIs across all states (in-use and available), and makes a call — launch now, or wait sixty seconds and ask again. Max wait: ten minutes, then fail loud.

Why a Static Buffer Didn’t Work

Our first version reserved a flat 20 IPs as the “cleanup lag” buffer. It was wrong in both directions. On fast-cleanup days we were blocking launches that would have succeeded fine. On slow-cleanup days — the ones we actually cared about — 20 wasn’t enough, and jobs still failed.

We started logging cleanup lag per run and staring at the distribution. It wasn’t even close to stable. Some days it peaked at 15, some days at 45 (due to adhoc manual re-runs). A static number was always going to be wrong. We needed the sensor to learn.

Adaptive Threshold: Rolling Max via DynamoDB

The data path ended up clean because we resisted the urge to build anything fancy.

The table. glue-eni-observations in DynamoDB, partitioned by subnet_id, sorted by observed_at. Each record captures total_enis, in_use, available, and dag_run_id. TTL 30 days for debugging.

The writer. The sensor itself. On every poke, before making the admit/wait decision, it writes what it observed. No separate Lambda, no post-job hook — the observation always reflects exactly what the sensor saw at decision time.

The reader. On each poke, the sensor reads back the last N observations for its subnet (N=5, tunable) and takes the max of total_enis. That becomes the buffer estimate:

# At decision time:
current_total = count all ENIs in subnet (in-use + available)
free_subnet_ips = subnet_size - current_total

# Buffer = worst cleanup tail observed across last N pre-launch observations
cleanup_buffer = max(rolling_max_available_enis, MIN_BUFFER)

# Admit if there's room for new workers + expected cleanup tail overhang
admit if free_subnet_ips >= workers_needed + cleanup_buffer

MIN_BUFFER = 10 is our floor — a guard against initial runs after implementing the changes.

Why max, not average. We argued about this one. Average hides the worst case. If four of the last five runs saw 20 ENIs and one saw 45, average says 25, reality says 45. We’re sizing for the peak we’ve actually observed, not the one we’d prefer.

Why DynamoDB, not XCom. XCom is per-DAG-run. We needed observations shared across every DAG hitting the same subnet — critical batch, non-critical batch, and ad-hoc all contribute signal about Subnet A’s state. DynamoDB gave us one shared view keyed by subnet, which is what we actually needed.

Things That Bit Us (Or Nearly Did)

  • Worker slot exhaustion. Our first version used mode='poke', which holds the Airflow worker slot while sleeping. With multiple blocked sensors queued up, we nearly exhausted the Airflow worker pool before a single Glue job ran. Switched to mode='reschedule' — releases the slot while waiting. Non-obvious, critical.
  • EC2 API throttling. Four sensors polling every 60s is ~4 DescribeNetworkInterfaces calls/minute — well under the limit. But Describe* calls share a rate-limit bucket across the account. If other teams adopt this pattern, or someone runs heavy EC2 automation in parallel, we’ll hit throttling sooner than expected. Something to watch, not panic about yet.
  • Silent sensor waits. Early on we had a sensor hang indefinitely on a misconfigured subnet. Nothing alerted. The job just sat there looking “in progress.” We added the 10-minute timeout and explicit task failure after that. A sensor that waits forever is operationally invisible — worst failure mode in the system.

Limitations We’ve Accepted

  • Subnet ceiling. Each subnet has its own cap. Subnet A’s 59 IPs will run out if we add critical jobs or bump worker counts. Six-month capacity review catches growth before we hit the wall.
  • Lookback tuning. Rolling max over 5 runs needs occasional adjustment. Too short and we miss cleanup patterns; too long and the buffer grows stale. Reviewed quarterly.
  • No Glue Auto Scaling. Glue 4 supports it, but the sensor can’t predict how many workers Glue will actually provision mid-run. Fixed worker counts give us deterministic IP math. Worth revisiting if workloads become more variable.

Observability

A Grafana dashboard plots ENI utilization against available capacity. The correlation between cleanup spikes and job launches is immediately visible — which, after months of staring at this problem without that view, still feels like cheating.

An alert fires at 80% utilization — above the admission threshold but below exhaustion. It fires when something bypasses the sensor: a manual run, an unexpected workload, a new service quietly grabbing ENIs in the same subnet. That alert has caught us twice — both times from developers running one-offs in the wrong subnet.

Before vs After Architecture

architecture

On-Call Runbook

When the 80% alert fires:

  1. Run the Step 1 ENI count query from the investigation above. Confirm actual state.
  2. Check for orphaned ENIs (available with no active job) — typically safe to delete once confirmed detached from any active Glue job via aws ec2 delete-network-interface after confirming. See common deletion failure modes and the InvalidParameterValue error you’ll hit if you try to delete before the ENI fully detaches.
  3. Check for stuck job runs (STOPPING state that never resolves):
    aws glue get-job-runs --job-name <job-name> --max-results 10 \
      --query "JobRuns[*].[Id, JobRunState, StartedOn, CompletedOn]" --output table
    aws glue batch-stop-job-run --job-name <job-name> --job-run-ids <run-id>
    

    ENIs should detach within 5 minutes of a force-stop.

  4. If neither explains it: pause non-critical DAGs, admit only SLA-critical jobs manually, notify downstream stakeholders. Resume normal scheduling once utilization drops below 60%.

Hitting 80% regularly is a signal to expand subnet capacity before it becomes an incident.

Results

Metric Before After (90 days)
Job runs/week 20 20
IP exhaustion failures/week ~5 0
Weekly failure rate ~25% 0%
7AM on-call pages (this cause) ~3/week 0
Sensor-blocked launches (recovered in <10 min) n/a 12 / 360 runs (3.3%)

All 12 blocked launches were cleanup-tail events — the sensor held the job, waited out the lag, and launched successfully within the 10-minute window. No manual intervention required.

The Lesson: Build the Observability AWS Didn’t

The IP exhaustion bug wasn’t about IPs. It was about an undocumented cleanup window in a managed service we couldn’t see into — invisible to CloudWatch, unmentioned in the Glue console. The fix wasn’t clever networking; it was building the signal AWS didn’t ship: ENI state, persisted, queryable, wired into admission. “Serverless” moves the servers out of sight, not out of your problem. Every underlying constraint — IPs, disk, memory, quotas — still binds you. AWS decides which to expose. When they don’t, silence isn’t absence of risk; it’s absence of signal. We’ve hit the same shape twice since: No space left on device during Spark checkpoints, and exit code 137 with no executor logs. Both required the same playbook — instrument what AWS didn’t, feed it back into admission or sizing, alert before the wall. We’ll write those up next.

If you take one thing from this post: assume every managed service has an observability gap, and treat finding it as part of operating the service.

Grafana Dashboard

Below is a sample Grafana dashboard we built to monitor ENI utilization across our three subnets. This dashboard surfaces the signals AWS doesn’t expose natively — real-time ENI counts, cleanup lag patterns, rolling max buffers, and the 80% alert threshold that triggers before exhaustion hits.

Glue ENI — Subnet Utilization
glue-eni-observations · Last 6 hours · auto-refresh 60s
us-east-1 Subnet A · B · C ⚑ 1 alert firing
Subnet A — utilization
71%
42 / 59 IPs
Subnet B — utilization
34%
20 / 59 IPs
Subnet C — utilization
22%
6 / 27 IPs
Sensor blocks (today)
3
of 60 runs · all recovered
IP exhaustion failures
0
90-day streak
Subnet A — ENI count over time (6h)
ENI counts for Subnet A over 6 hours.
in-use
available (cleanup lag)
rolling max (buffer)
80% alert threshold
Active alert
⚑ subnet-a utilization > 80%
Fired 07:03 · duration 4m 12s
manual run in wrong subnet
Rolling max (last 5 obs · Subnet A)
runtotal ENIsin-use
dag_run_20260412_07034230
dag_run_20260412_06003820
dag_run_20260412_05003520
dag_run_20260411_19004120
dag_run_20260411_07003320
rolling max →42
Subnet A (/26 · us-east-1a) — critical batch
−6h−3hnow

peak today48 ENIs
cleanup tail avg22 min
jobs running2 × 10 workers
Subnet B (/26 · us-east-1b) — non-critical batch
−6h−3hnow

peak today24 ENIs
cleanup tail avg19 min
jobs running2 × 10 workers
Subnet C (/27 · us-east-1c) — backfill / ad hoc
−6h−3hnow

peak today10 ENIs
cleanup tail avg31 min
jobs running1 concurrent

References