Elevator Pitch
- A buggy internal cleanup task, triggered by an ambiguous
pending_delete API query, unintentionally withdrew and deleted BYOIP prefixes via BGP, making some customer services unreachable for over six hours.
Key Takeaways
- The outage (6h 7m) affected a subset of BYOIP customers when ~1,100 prefixes were withdrawn; it was caused by a Cloudflare configuration/code change, not an attack.
- A task passed
pending_delete with no value, so the API returned all BYOIP prefixes; the cleanup then deleted prefixes and dependent objects (including service bindings), complicating recovery.
- Cloudflare’s remediation focuses on API schema standardization, separating configured vs operational state with snapshot/health-mediated rollouts, and circuit breakers for large/rapid withdrawal actions.
Most Memorable Quotes
- “The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind.”
- “During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC.”
- “Because the client is passing pending_delete with no value… the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed.”
Source URL•Original: 3242 words
•Summary: 187 words