SERVFAIL might not be a good enough signal for alerting. I definitely see third-party DNS providers returning SERVFAIL for their own reasons; if it's a popular host (looking at you, Route 53), then you'll proxy those through and end up alerting on AWS's issue instead of your own.
They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems. (I remember Google adding itself as malware many many years ago. Nice to have a continuous check for that sort of thing, which I am sure they do now.)
> They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems.
Yeah this was my first thought too. Why don't they have such a system in place for as many permutations of their public services as they can think of? We're a small company and we've had this for critical stuff for several years.
In my experience it's because the monitoring system is usually controlled by another team. So you they don't know what they should be testing, and the developers who know how to do it aren't easily able to set it up as part of deployment.
Add issues like network-visibility, and you wind up talking about a cross-team, cross-org effort to stick an HTTP poller and get the traffic back to some server - and so going into production without it winds up being the easier path (because it'll work fine - provided nothing goes wrong).
They might just want a prober; ask every server for cloudflare.com every minute. If that errors, there are big problems. (I remember Google adding itself as malware many many years ago. Nice to have a continuous check for that sort of thing, which I am sure they do now.)