This can be true but depends on other sources of error being small enough. Standard error is just a formula and it varies inversely with the square of the sample size, so you trivially narrow a confidence interval by sampling more often. In this specific case, imagine you had daily measures of violent crime instead of only annual. You'd get much tighter error bars.
Does that mean you should be more surprised if your predictions are wrong? It depends. You've only reduced model error, but this is the classic precision versus accuracy problem. You can very precisely estimate the wrong number. Does the model really reflect the underlying data generating process? Are the inputs you're giving reliable measurements? If both answers are yes, then your more precise model should be converging toward a better prediction of the true value, but if not, you're only getting a better prediction of the wrong thing.
We can ask these questions with this very example. Clearly, ARIMA is not a causally realistic model. Criminals don't look at last year's crime rates and decide whether to commit a crime based on that. The assumption is that, whatever actually does cause crime, it tends to happen at fairly similar levels year to year, that is, 2020 should be more different than 2010 than it is different from 2019. We may not know what the causually relevant factors really are or we may not be able to measure them, but we at least assume they follow that kind of rule. This sounds plausible to me, but is it true? We can backtest by making predictions of past years and seeing how close they are to the measured value, but the possibility of this even working depends upon the answer to the second question.
So then the second question. Is the national violent crime data actually reliable? I don't know the answer to that, but it certainly isn't perfect. There is a real crime rate for every crime, but it isn't exactly the reported number. Recording and reporting standards vary from jurisdiction to jurisdiction. Many categories of crime go undereported, the extent of which can change over time. Changes may reflect different policing emphasis as much as or more than changes in the underlying true rate. I believe the way the FBI even collects and categorizes data has changed in the past, so I'm not sure a measurement from 1960 can be meaningfully compared to a measurement from 2020.
Ultimately, "how surprised you should be when you are wrong" needs to take all of these sources of error into account, not just the model's coefficient uncertainty.
You can arbitrarily scale error bars based on real world feedback, but the underlying purpose of a model is rarely served by such tweaking. Often the point of error bars is less “How surprised you should be when you are wrong” than it is “how wrong you should be before you’re surprised.”
When trying to detect cheating in online games you don’t need to predict exact performance, but you want to decent anomalies quickly. Detecting cereal killers, gang wars, etc isn’t about nailing the number of murders on a given day but patterns within those cases etc.
You only really need to take those sources of error into account if you want an absolute measure of error, which as you explain, seems pretty impossible.
An error for weather only needs to be relative -- for example, if the error for rain today is higher than yesterday, it's not important the exact number higher that it is -- only that it's higher. (Not that I know if this is possible.)
It's like how you can't describe how biased a certain news source is or how to read a Yelp or Rotten Tomatoes review -- you just have to read it often enough to get an intuitive sense that a 4.1 star Yelp-rated restaurant with 800 reviews is probably good while a 4.6 star restaurant with 5 reviews is quite possibly terrible.
Does that mean you should be more surprised if your predictions are wrong? It depends. You've only reduced model error, but this is the classic precision versus accuracy problem. You can very precisely estimate the wrong number. Does the model really reflect the underlying data generating process? Are the inputs you're giving reliable measurements? If both answers are yes, then your more precise model should be converging toward a better prediction of the true value, but if not, you're only getting a better prediction of the wrong thing.
We can ask these questions with this very example. Clearly, ARIMA is not a causally realistic model. Criminals don't look at last year's crime rates and decide whether to commit a crime based on that. The assumption is that, whatever actually does cause crime, it tends to happen at fairly similar levels year to year, that is, 2020 should be more different than 2010 than it is different from 2019. We may not know what the causually relevant factors really are or we may not be able to measure them, but we at least assume they follow that kind of rule. This sounds plausible to me, but is it true? We can backtest by making predictions of past years and seeing how close they are to the measured value, but the possibility of this even working depends upon the answer to the second question.
So then the second question. Is the national violent crime data actually reliable? I don't know the answer to that, but it certainly isn't perfect. There is a real crime rate for every crime, but it isn't exactly the reported number. Recording and reporting standards vary from jurisdiction to jurisdiction. Many categories of crime go undereported, the extent of which can change over time. Changes may reflect different policing emphasis as much as or more than changes in the underlying true rate. I believe the way the FBI even collects and categorizes data has changed in the past, so I'm not sure a measurement from 1960 can be meaningfully compared to a measurement from 2020.
Ultimately, "how surprised you should be when you are wrong" needs to take all of these sources of error into account, not just the model's coefficient uncertainty.