$ _

beyond5nines

Field notes for engineers who build and run production systems. Each post starts with a real incident and works through to the fix. No theory without a production problem behind it.

5 articles
4 topics
1 series

Series

series

Look Ma! no servers

all parts →

Look Ma, No Servers!

Remember yelling “Look ma, no hands!” right before eating pavement? That’s our AWS Glue story. We adopted serverless expecting less infrastructure work. Ins...

No Space Left on Device — Trap

The 48-Minute Silent Failure Our product Glue job looked healthy for 48 minutes and 59 seconds. Progress metrics moving. No anomalies in the console. Then S...

No Space Left on Device — Fix

Where We Left Off Part 03 stopped the crash. S3 Shuffle moved spill off local disk, and the job stopped dying at 48 minutes. But it also gave us something t...

Topics