The specified subnet does not have enough free addresses to satisfy the request — Fix
In Part 1 we diagnosed the root cause: AWS Glue’s ENI lifecycle extends 60+ minutes beyond job completion, every job category shared a single /26, and the launcher had no visibility into actual subnet state. The fix wasn’t more IPs. It was feedback.
Phase 1: Splitting by Blast Radius
We still had to deal with the shared-subnet problem before admission control could do anything useful. If one category of job poisoned the pool, admission control would just politely refuse to launch anything — SLA misses with extra steps.
So we split the /26 into three subnets. Not by capacity. By what could hurt what.
- Subnet A (/26, us-east-1a) — critical batch (customer-analytics, product-metrics)
- Subnet B (/26, us-east-1b) — non-critical batch (reporting, aggregations)
- Subnet C (/27, us-east-1c) — backfill, ad hoc, manual reruns
Subnet C is intentionally small — ad-hoc failures are acceptable; the sensor here is a courtesy, not an SLA gate
The logic we kept coming back to: if a backfill leaves 20 ENIs lingering in available for 40 minutes, we want that cleanup tail trapped. Contained in Subnet C. Not bleeding into the critical morning window in Subnet A.
We put each subnet in a different AZ, but we were careful not to oversell that to ourselves. A Glue job binds to one Connection, one subnet, one AZ. There’s no native multi-AZ for a single Glue job. All we bought was that different job categories live in different AZs. A us-east-1a event still takes out critical batch until we manually repoint the Connection.
The cutover was straightforward. Three new subnets, three new Glue Connections, DAGs repointed via Terraform. We staged it over a week — critical first (highest risk, do it while we’re paying attention), non-critical next, backfills last. Every migration window we waited for the subnet to drain to zero active ENIs before flipping traffic.
The 7AM failures dropped immediately. But we knew the story wasn’t over. Isolation stopped categories from starving each other. It didn’t stop one category from starving its own subnet during a bad cleanup day. We’d moved the failure mode, not killed it.
Phase 2: Teaching Airflow to Look Before It Leapt
The whiteboard insight — the launcher is blind — was still unaddressed. We’d given the launcher more rooms to work in, but it still walked into each one with its eyes closed.
So we wrote a sensor. Simple in shape: before each Glue job category runs, a custom Airflow sensor calls DescribeNetworkInterfaces filtered by subnet ID, counts ENIs across all states (in-use and available), and makes a call — launch now, or wait sixty seconds and ask again. Max wait: ten minutes, then fail loud.
Why a Static Buffer Didn’t Work
Our first version reserved a flat 20 IPs as the “cleanup lag” buffer. It was wrong in both directions. On fast-cleanup days we were blocking launches that would have succeeded fine. On slow-cleanup days — the ones we actually cared about — 20 wasn’t enough, and jobs still failed.
We started logging cleanup lag per run and staring at the distribution. It wasn’t even close to stable. Some days it peaked at 15, some days at 45 (due to adhoc manual re-runs). A static number was always going to be wrong. We needed the sensor to learn.
Adaptive Threshold: Rolling Max via DynamoDB
The data path ended up clean because we resisted the urge to build anything fancy.
The table. glue-eni-observations in DynamoDB, partitioned by subnet_id, sorted by observed_at. Each record captures total_enis, in_use, available, and dag_run_id. TTL 30 days for debugging.
The writer. The sensor itself. On every poke, before making the admit/wait decision, it writes what it observed. No separate Lambda, no post-job hook — the observation always reflects exactly what the sensor saw at decision time.
The reader. On each poke, the sensor reads back the last N observations for its subnet (N=5, tunable) and takes the max of total_enis. That becomes the buffer estimate:
# At decision time:
current_total = count all ENIs in subnet (in-use + available)
free_subnet_ips = subnet_size - current_total
# Buffer = worst cleanup tail observed across last N pre-launch observations
cleanup_buffer = max(rolling_max_available_enis, MIN_BUFFER)
# Admit if there's room for new workers + expected cleanup tail overhang
admit if free_subnet_ips >= workers_needed + cleanup_buffer
MIN_BUFFER = 10 is our floor — a guard against initial runs after implementing the changes.
Why max, not average. We argued about this one. Average hides the worst case. If four of the last five runs saw 20 ENIs and one saw 45, average says 25, reality says 45. We’re sizing for the peak we’ve actually observed, not the one we’d prefer.
Why DynamoDB, not XCom. XCom is per-DAG-run. We needed observations shared across every DAG hitting the same subnet — critical batch, non-critical batch, and ad-hoc all contribute signal about Subnet A’s state. DynamoDB gave us one shared view keyed by subnet, which is what we actually needed.
Things That Bit Us (Or Nearly Did)
- Worker slot exhaustion. Our first version used
mode='poke', which holds the Airflow worker slot while sleeping. With multiple blocked sensors queued up, we nearly exhausted the Airflow worker pool before a single Glue job ran. Switched tomode='reschedule'— releases the slot while waiting. Non-obvious, critical. - EC2 API throttling. Four sensors polling every 60s is ~4
DescribeNetworkInterfacescalls/minute — well under the limit. But Describe* calls share a rate-limit bucket across the account. If other teams adopt this pattern, or someone runs heavy EC2 automation in parallel, we’ll hit throttling sooner than expected. Something to watch, not panic about yet. - Silent sensor waits. Early on we had a sensor hang indefinitely on a misconfigured subnet. Nothing alerted. The job just sat there looking “in progress.” We added the 10-minute timeout and explicit task failure after that. A sensor that waits forever is operationally invisible — worst failure mode in the system.
Limitations We’ve Accepted
- Subnet ceiling. Each subnet has its own cap. Subnet A’s 59 IPs will run out if we add critical jobs or bump worker counts. Six-month capacity review catches growth before we hit the wall.
- Lookback tuning. Rolling max over 5 runs needs occasional adjustment. Too short and we miss cleanup patterns; too long and the buffer grows stale. Reviewed quarterly.
- No Glue Auto Scaling. Glue 4 supports it, but the sensor can’t predict how many workers Glue will actually provision mid-run. Fixed worker counts give us deterministic IP math. Worth revisiting if workloads become more variable.
Observability
A Grafana dashboard plots ENI utilization against available capacity. The correlation between cleanup spikes and job launches is immediately visible — which, after months of staring at this problem without that view, still feels like cheating.
An alert fires at 80% utilization — above the admission threshold but below exhaustion. It fires when something bypasses the sensor: a manual run, an unexpected workload, a new service quietly grabbing ENIs in the same subnet. That alert has caught us twice — both times from developers running one-offs in the wrong subnet.
Before vs After Architecture
On-Call Runbook
When the 80% alert fires:
- Run the Step 1 ENI count query from the investigation above. Confirm actual state.
- Check for orphaned ENIs (
availablewith no active job) — typically safe to delete once confirmed detached from any active Glue job viaaws ec2 delete-network-interfaceafter confirming. See common deletion failure modes and theInvalidParameterValueerror you’ll hit if you try to delete before the ENI fully detaches. - Check for stuck job runs (
STOPPINGstate that never resolves):aws glue get-job-runs --job-name <job-name> --max-results 10 \ --query "JobRuns[*].[Id, JobRunState, StartedOn, CompletedOn]" --output table aws glue batch-stop-job-run --job-name <job-name> --job-run-ids <run-id>ENIs should detach within 5 minutes of a force-stop.
- If neither explains it: pause non-critical DAGs, admit only SLA-critical jobs manually, notify downstream stakeholders. Resume normal scheduling once utilization drops below 60%.
Hitting 80% regularly is a signal to expand subnet capacity before it becomes an incident.
Results
| Metric | Before | After (90 days) |
|---|---|---|
| Job runs/week | 20 | 20 |
| IP exhaustion failures/week | ~5 | 0 |
| Weekly failure rate | ~25% | 0% |
| 7AM on-call pages (this cause) | ~3/week | 0 |
| Sensor-blocked launches (recovered in <10 min) | n/a | 12 / 360 runs (3.3%) |
All 12 blocked launches were cleanup-tail events — the sensor held the job, waited out the lag, and launched successfully within the 10-minute window. No manual intervention required.
The Lesson: Build the Observability AWS Didn’t
The IP exhaustion bug wasn’t about IPs. It was about an undocumented cleanup window in a managed service we couldn’t see into — invisible to CloudWatch, unmentioned in the Glue console. The fix wasn’t clever networking; it was building the signal AWS didn’t ship: ENI state, persisted, queryable, wired into admission. “Serverless” moves the servers out of sight, not out of your problem. Every underlying constraint — IPs, disk, memory, quotas — still binds you. AWS decides which to expose. When they don’t, silence isn’t absence of risk; it’s absence of signal. We’ve hit the same shape twice since: No space left on device during Spark checkpoints, and exit code 137 with no executor logs. Both required the same playbook — instrument what AWS didn’t, feed it back into admission or sizing, alert before the wall. We’ll write those up next.
If you take one thing from this post: assume every managed service has an observability gap, and treat finding it as part of operating the service.
Grafana Dashboard
Below is a sample Grafana dashboard we built to monitor ENI utilization across our three subnets. This dashboard surfaces the signals AWS doesn’t expose natively — real-time ENI counts, cleanup lag patterns, rolling max buffers, and the 80% alert threshold that triggers before exhaustion hits.
| run | total ENIs | in-use |
|---|---|---|
| dag_run_20260412_0703 | 42 | 30 |
| dag_run_20260412_0600 | 38 | 20 |
| dag_run_20260412_0500 | 35 | 20 |
| dag_run_20260411_1900 | 41 | 20 |
| dag_run_20260411_0700 | 33 | 20 |
| rolling max → | 42 |
References
- AWS Knowledge Center: Resolve “subnet does not have enough free addresses” in Glue
- AWS Glue does not clean up network interfaces
- AWS Glue Job Not Releasing ENIs After Completion — 1,200+ accumulated ENIs from the same root cause
- Glue Connections Running Out of IPs
- How to delete a network interface associated with a VPC
- Unable to Delete Stuck ENI in in-use state