Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

True. I guess the thing that I normally want from metrics is I want to have a huge number of them that exist in a way that I can look at them when I want. But I don't want to have to pay for collecting and aggregating them all the time. So in the scenario where they are just events then I need some other control system that can trigger the collection of events that aren't normally emitted


With metrics, you're always sampling. It's impossible to know the value of the measurement at every point in time.

When you collect any form of metrics, something is choosing that sample rate.


That's not true in all models. For example in the (execrable) `statsd` model there is a bit of information sent every time a metric changes.


It's not just the stats protocol, it's the underlying metric, too. statsd is just a way of recording/transmitting metrics.

If I transmit a statsd metric representing "CPU usage", I am still sampling it. E.g., I might read the CPU usage every second & generate a statsd stat. That's a sample rate of 1Hz. I have to choose some sampling frequency, since the API most OS's expose is "what's the current CPU usage?".

If the metric is "total number of HTTP requests", then I can definitely just transmit that metric every time I get a request. We're not sampling for that metric.

The latter is inherently a discrete event, with which we can know every data point of, though. Things like CPU, memory, are either fundamentally continuous, or their implementations are simply sampling it.

I do agree the model matters too; Prom's tendency to just poll /metrics endpoints every n seconds means even things like HTTP events are inherently sampled.


> If I transmit a statsd metric representing "CPU usage", I am still sampling it.

In practice this is how everyone does it, but in theory it should be possible to have a non-sampled view of CPU usage (defined as "time process is scheduled onto a CPU"). With the right kernel introspection, you could represent it as a series of spans covering each time slice where the process is scheduled. Perhaps with a concept of a "currently ongoing span" to account for the current time slice.

Do I think this would be more useful than the typical sampled metric? Probably not, outside some niche performance analysis workflows. But my point is that CPU is not actually continuous, and I struggle to think of any metric which cannot be represented without sampling if you REALLY need it.


I almost put exactly that in a footnote, 'cept about RAM, instead of CPU usage. No OS that I know of exposes such an API, so it's highly theoretical.

As for truly contiguous metrics, hmm. How about current battery charge (in Wh)? Host uptime also seems technically continuous (albeit representable by a straight line). (Yes, we track this met; it makes reboots stand out in lieu of my metrics system not providing a vertical marker feature.) Clock drift?

(and I'm going to insert the footnote on this comment about something something Planck units.)


It depends on the metric. Some metrics represent discrete events, such as "number of HTTP requests received". It is absolutely possible to record that metrics at every point in time, without sampling.

(There are metrics that are continuous, such as CPU usage. Those, yes, you're always sampling.)


Great point. (Y) this feels like a gauge / counter distinction?

You could get pedantic at this point and say that because computers are fundamentally discrete machines, it is technically possible to sample the CPU usage at every tick :p


> this feels like a gauge / counter distinction?

I'm not particularly fond of those terms; I don't find them descriptive. I don't think they're quite the right terms, either. For example, queue length is fundamentally not a continuous metric: it only changes when the length of the queue does, and if you record those events as they happen, you can get the exact graph of the queue length without there being a sampling frequency. But it is a "gauge" in Prom's language.

But yes, a lot of the metrics surrounding event-like data probably do fall into Prom's "counter".

> sample the CPU usage at every tick :p

Linux has been tickless for years. There's still going to be a time at which the scheduler kicks in, of course, but if the core isn't contested, schedulers these days aren't necessarily going to even trigger. The process on that core can simply run until it sleeps. (Assuming no other process transitions to runnable, and there's no other core available for that process.)

As another poster points out, if we had enough insight into the kernel, though, even still we could get the discrete events of when the scheduler deschedules a core. So, technically we don't have to same. But the practical APIs we're going to use are sampling ones.


I would like to subscribe to your newsletter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: