Ok. You can flap it slower and less frequently. The RFC you mentioned talks about timers on the order of a minute or so. So I would say advertise the new route for 1 minute and unconditionally restore for 10. Then only after that advertise for 2 minutes and restore for 10. There’s clearly some interval of on/off that isn’t a problem and that’s an effective way to evaluate the impact you of a deployed route change gradually over time rather than fucking up the internet for 25 minutes until someone figures out what’s going on.
And obviously you don’t do this on every individual route change - you batch them so it’s a release train.
If you think there’s better techniques other than “don’t break things” I’m all for it.
This specific outage is the equivalent of this scenario:
You have an if/then statement with N conditions in AND. You remove one condition, leaving the rest of the statement unchanged. What happens?
The answer is that if you remove one condition, your input (in this case routes) is more likely to match N-1 conditions than N, so more input is going to be processed according to the “then” clause.
The impact of course depends on the fact that these were BGP routes, advertised to the Internet,… but the problem itself is generic.
What can you do?
1) check this kind of if/then statements with special care, in order to analyze under which condition the input is processed by the “then” clause. This is exactly one of their followups [1]
2) consider adding “global”, catch-all policies acting as an additional safety net (if applicable)
3) test your changes not just syntactically. Set up a test environment with multiple routers, apply the configuration and see what happens.
[1] Adding automatic routing policy evaluation into our CI/CD pipelines that looks specifically for empty or erroneous policy terms
Yeah, but all of those basically boil down to “the next outage will look different from before” which fine but isn’t an actual solution IMO.
My point is you want to do that and gradual rollouts that you don’t make permanent until you’ve observed the real world behavior if you want to prevent all future outages. This specific temporary rollout and automatic rollback also has the side effect that even if you don’t do any of the “hardening steps” outlined, your system will still prevent any kind of mistake you’ve made from rolling out and becoming more permanent. Like I said, the “flapping” parameters can be tuned however you want and you can aggregate updates into an automated “release train”. If you want you can do so with automated health metrics although it can be hard to implement automating validation that behavior before and after the route is “correct” (maybe trained ML models would be helpful here).
This is btw in many ways how Google releases code into production - they bundle a bunch of PRs into a giant “publish” step - if CI fails or anything in production fails, they automatically rollback the entire set of changes since they can’t know which part of the release went bad. It’s a huge hammer to solve any issue they didn’t account for.
Rollback is of course useful when things go wrong (and by the way the routers CF use natively support rollback features). What I’m questioning is flapping as a structured way to carry out network changes.
Even a slow flap can cause issues downstream. Imagine a router handling hundreds of thousands of routes. Its software has a memory leak so any route received increases its RAM usage. A slow flap may well bring that router to a halt. Now you might say, “hey, this is not my fault”, but it is still something that could happen to your routers or your peers.
Another aspect is that network devices can get Terabits/s of traffic. Now, a router is mostly stateless, but if you do this flapping thing to a firewall, what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
So, yes, of course you “flap” (rollback) when things go wrong, but you probably don’t do it intentionally to test what’s going on in a network change.
> Its software has a memory leak so any route received increases its RAM usage.
Surely you realize this as a weak reason but thought the argument against is that it’s my problem for someone else’s misbehaving software? I mean anyone sane in networking would treat this as not their problem (or at least work with the major providers for whom it is to make this possible).
However the strongest reason why I don’t buy this is that routes change regularly as a matter of course so changing a route forward and back is no different from changing it twice and so this bug would already be causing you issues and this is maybe a small percentage of extra advertisements.
> what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
Again, this explanation largely relies on FUD rather than concrete explanations. BGP routes change regularly and often. Such issues if they exist are already problems and briefly advertising a new route for a period of time as a dry run doesn’t alter those issues in any meaningful way. The problem is you’re treating “flap” as somehow magically different from any normal route change when it’s not really meaningfully so.
In the session scenario, I was talking about firewalls, not BGP routers (although, of course, you could have firewall features on a BGP router).
What I'm saying is, there are ways to validate and carry out network changes in a pretty robust way, including gradual rollout (if that's what you want) by using route or firewall rules priority or other mechanisms.
I keep being skeptical about this flapping strategy, but if this works in your setup, good for you.
And obviously you don’t do this on every individual route change - you batch them so it’s a release train.
If you think there’s better techniques other than “don’t break things” I’m all for it.