Look Ma, No Servers!
Remember yelling “Look ma, no hands!” right before eating pavement? That’s our AWS Glue story. We adopted serverless expecting less infrastructure work. Ins...
→Field notes for engineers who build and run production systems. Each post starts with a real incident and works through to the fix. No theory without a production problem behind it.
// investigations
Remember yelling “Look ma, no hands!” right before eating pavement? That’s our AWS Glue story. We adopted serverless expecting less infrastructure work. Ins...
→The 7AM Ritual Our daily batch consisted of 2 critical Glue jobs per weekday and 2 non-critical Glue jobs every 2 hours. Every morning at 7:00 AM, our batch...
→In Part 1 we diagnosed the root cause: AWS Glue’s ENI lifecycle extends 60+ minutes beyond job completion, every job category shared a single /26, and the la...
→The 48-Minute Silent Failure Our product Glue job looked healthy for 48 minutes and 59 seconds. Progress metrics moving. No anomalies in the console. Then S...
→Where We Left Off Part 03 stopped the crash. S3 Shuffle moved spill off local disk, and the job stopped dying at 48 minutes. But it also gave us something t...
→// filter by topic