Elevator Pitch

  • A buggy internal cleanup task, triggered by an ambiguous pending_delete API query, unintentionally withdrew and deleted BYOIP prefixes via BGP, making some customer services unreachable for over six hours.

Key Takeaways

  • The outage (6h 7m) affected a subset of BYOIP customers when ~1,100 prefixes were withdrawn; it was caused by a Cloudflare configuration/code change, not an attack.
  • A task passed pending_delete with no value, so the API returned all BYOIP prefixes; the cleanup then deleted prefixes and dependent objects (including service bindings), complicating recovery.
  • Cloudflare’s remediation focuses on API schema standardization, separating configured vs operational state with snapshot/health-mediated rollouts, and circuit breakers for large/rapid withdrawal actions.

Most Memorable Quotes

  • “The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind.”
  • “During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC.”
  • “Because the client is passing pending_delete with no value… the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed.”

Source URLOriginal: 3242 wordsSummary: 187 words