The challenge for me that I always struggled to explain to my leadership is that in many cases, the time investment to make an accurate metric dashboard greatly exceeds just fixing the problem and knowing anecdotally that it works (I worked with ML Whisperers, people whom I trusted a lot to understand the underlying problems and filter out noise).
For example, in my case (sending machines causing invisible data corruption to be replaced) I could have been promoted by doing the following:
1) Finding a metric that correlated with the user pain that I was fixing. In this case, it would be something like "number of jobs that die with a NaN in 1 hour" while running an A/B test (half the jobs in the fleet have some feature enabled) and showign that, with significance, our fix reduces the number of NaNs significantly. (data driven)
2) Demonstrate that the NaN rate corresponds to user
productivity (this could be # of papers published, # of models trained per hour, whatever) and that high NaN rates really did have an effect (impact)
3) filter the data carefully, because the vast majority of nans are actually caused by user error, not silent data corruption (this was the actual hard part and nobody has a better solution than "run a determinstic calculation on 8 cores and use majority vote to find the baddie")
Run the above for 6 months, show it to all the execs in your division, get a few people from Search, Ads, YouTube or Research/DeepMind to say it increased revenue or decreased costs by 10%. Bingo: promotion, along wiht a full time job maintaining a dashboard with constant requests to add new features, fix code broken by other teams, and making even more presentations to execs on how dysfunctional it is.
Or, I could just focus on fixing the machines, hearing anecdotally from the ML Whisperers that it's working again, and go back to surfing hacker news and getting another 100 karma in a day.
I have always had trouble with this type of nebulous goal.