Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While I don't have an opinion on wide events (AKA spans) replacing logs, there are benefits to metrics that warrant their existence:

1. They're incredibly cheap to store. In Prometheus, is may cost you as little as 1 byte per sample (ignoring series overheads). Because they're cheap, you can keep them for much longer and use them for long-term analysis of traffic, resource use, performance, etc. Most tracing vendors seem to cap storage at 1-3 months while metric vendors can offer multi-year storage.

2. They're far more accurate that metrics derived from wide events in higher-throughput scenarios. While wide events are incredibly flexible, their higher storage cost means there's an upper limit on the sample rate. The sampled nature of wide events means that deriving accurate counts is far more difficult- metrics really shine in this role (unless you're operating over datasets with very high cardinality). The problem only gets worse when you combine tail sampling into the mix and add bias towards errors/ slow requests in your data.



For point (2), you can derive accurate counts from sampled data if the sampling rate is captured as metadata on every sampled event. Some tools do support this (I work for Honeycomb, and our sampling proxy + backend work like this, can't speak for others).

The issue is there are still limits to that, though. I can still get a count of events, or a AVG(duration_ms). But if I have a custom tag I can't get accurate counts of that. And if I want to get distinct counts of values, I'm out of luck. Estimating that is an active machine learning research problem.


It's an interesting point. We are actually running a test with with Honeycomb's refinery later this week, I'm slightly skeptical but curious to see if they can overcome this bias.


You also lose accuracy because of sampling noise.


On top of that, metrics can have exemplars, which give you more (and dynamic) dimensions for buckets without increasing the cardinality of the metric vectors themselves. It's pretty much a wide event, with the sampling rate on this extra information just being the scrape interval you were already using anyway.

Not every library or tool supports exemplars, but they're a big part of the Prometheus & Grafana value proposition that many users entirely overlook.


This is exactly right. This kind of structured logging is great, but it doesn’t replace metrics. You really want to have both, and simple unsampled metrics are actively better for e.g. automated alerting for exactly those reasons. They’re complements more than substitutes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: