Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
All you need is Wide Events, not "Metrics, Logs and Traces" (isburmistrov.substack.com)
267 points by talboren on Feb 27, 2024 | hide | past | favorite | 175 comments


This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling.


Exactly. When they say

> Unlike with prometheus, however, with Wide Events approach we don’t need to worry about cardinality

This is hinting at the hidden reason why not everyone does it. You have to 'worry' about cardinality because Prometheus is pre-aggregating data so you can visualize it fast, and optimizing storage. If you want the same speed on a massive PB-scale data lake, with an infinite amount of unstructured data, and in the cloud instead of your own datacenters, it's gonna cost you a lot, and for most companies it is not a sensible expense.

It does work at smaller scale though, we once had an in-house system like this that worked well. Eventually user events were moved to MixPanel, and everything else to Datadog, metrics/logs/traces + a migration to OpenTel. It took months and added 2-digit monthly bills, and in the end debugging or resolving incidents wasn't much improved over having instant access to events and business metrics. Whoever figures out a system that can do "wide events" in a cost-effective way from startup to unicorn scale will absolutely make a killing.


https://victoriametrics.com/ would definitely recommend anyone having performance issues with Prometheus to give VictoriaMetrics a try.


Once you plug long term storage onto your Prometheus, do you really need the main Prometheus instances anymore?

Here’s an article about this idea: https://datadrivendrivel.com/posts/rmrfprometheus/

You can substitute the Grafana Agents for OTEL collectors as well.


While Grafana Agent uses less resources than Prometheus, there is more optimized Prometheus-compatible scraper and router exists - vmagent [1]. I'd recommend you giving Grafana Agent and vmagent the same workload and comparing their resource usage.

P.S. Prometheus itself can also act as a lightweight agent, which collects metrics and forwards them to the configured remote storage [2].

[1] https://docs.victoriametrics.com/vmagent/

[2] https://prometheus.io/blog/2021/11/16/agent/


That's just kicking the can over to an object storage API instead of managing disks.


And comes with all the downsides of Prometheus as well


Could you pelase elaborate more on "comes with all the downsides of Prometheus"?


Or, you can use Thanos, the de-facto standard with the biggest OSS community.


Thanos/Mimir community doesn't help to resolve configuration routine or even bigger resource consumption for a huge setup.


Bigger resource consumption of what exactly? Leaf Prometheus instances or the Thanos/Mimir stack compared to VictoriaMetrics? Have you seen a large scale migration between the two, with actual numbers?


A few of interesting real-world large-scale migrations are highlighted at https://docs.victoriametrics.com/casestudies/


Not everything emits wide events. Maybe you can get the entire application layer like that, but there is also value in logs and metrics emitted from the rest of the infra stack.

To be fair, you could probably store and represent everything as wide events and build visualization tools out of that that can combine everything together, even if they are sourced from something else.


Wide events seem to be "structured logs with focused schemas" (maybe also published in a special way beyond writing to stdout) but most places I've worked would call that "logging" not "wide events".

The reasons we don't use them for everything are as others in the thread say: it's expensive. Metrics (just the numbers, nothing else) can be compressed and aggregated extremely efficiently, hence cheaply. Logs are more expensive due to their arbitrary contents.

It's all due to expense really.


Columnar storage stores data very efficiently, too - because it compresses data of a similar nature (columns). Check e.g. ClickHouse on this matter: https://clickhouse.com/docs/en/about-us/distinctive-features, https://clickhouse.com/blog/working-with-time-series-data-an...

So I wouldn't say that events are "expensive" while metrics are "cheap" - both depend on the actual implementation, and events can be cheap too.

And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).


If you have small pre-defined sets of events in data structures that compress well. That is not the case for any real system.

> And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

Capturing the requests into event/trace/whatever other name they gave to logs this month is many times that and is multiplied by traffic.


> Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

I thought this argument was about whether wide events can be used for metrics or metrics is a completely different concept. If we want to emulate metrics in events, we would also make them periodically independently of the traffic. Like emit them once in a while. Pretty much like Prometheus scraping works


Storing telemetry efficiently is only part of what Monitoring is supposed to do. The other part is querying: ad-hoc queries, dashboards, alerting queries executed each 15s or so. For querying to work fast, there has to be an efficient index or multiple indexes depending on the query. Since you referred ClickHouse as efficient columnar storage, please see what makes it different from a time series database - https://altinity.com/wp-content/uploads/2021/11/How-ClickHou...


And yet people use ClickHouse quite effectively for this very problem, see the comment here: https://news.ycombinator.com/item?id=39549218

There are also time-series databases out there that are OK with high cardinality: https://questdb.io/blog/2021/06/16/high-cardinality-time-ser...


> And yet people use ClickHouse quite effectively for this very problem

There is no doubt that ClickHouse is a super-fast database. No one stops you from using it for this very problem. My point is that specialized time series databases will outperform ClickHouse.

> There are also time-series databases out there that are OK with high cardinality

So does this blog say that tolerance to cardinality means that QuestDB indexes only one of the columns in the data generated by this benchmark?

TSDBs like Prometheus, VictoriaMetrics or InfluxDB will perform filtering by any of the labels with equal speed, because this is how their index works. Their users don't need to think about the schema or about which column should be present in the filter.

But in ClickHouse and, apparently, in QuestDB, you need to specify a column or list of columns for indexing (the fewer columns, the better). If the user's query doesn't contain the indexed column in the filter - the query performance will be poor (full scan).

See like this happened in another benchmarketing blogpost from QuestDB - https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...


I agree that specialised DBs outperform a general-purpose OLAP database. The question is - what does outperform mean. In this area queries should not be actually ultra-fast, they should be reasonably fast to be comfortable. And so missing indexes for some attributes would be likely okay. Looking at https://clickhouse.com/blog/storing-log-data-in-clickhouse-f..., they added just bloom filters for columns. Which makes sense, but this is not a full-blown index, and likely it will yield reasonable results. But this all is theoretical, I haven't built such a solution by self (we're working on it now for in-house observability), so likely miss something that can only be discovered on practice.

Btw we use Victoria Metrics now at work. It works good, queries are fast. But we're forced to always think about cardinality, otherwise either performance or cost get hurt. This is okay for the predefined set of metrics & labels and works well, but it doesn't allow having deep explorations.


In QuestDB, only SYMBOL columns can be indexed. However, sometimes, queries can run faster without indexes. This is because, under the hood, QuestDB runs very close to the hardware and only lifts relevant time partitions and columns for a given query. Therefore table scans between given timestamps are then very efficient. This can be faster than using indexes when the scan is performed with SIMD and other hardware-friendly optimizations.

When cardinality is very high, indexes make more sense.


The whole point of wide events is recording an arbitrary set of key value pairs. How do you propose storing that in a columnar datastore?


I can't speak for others, but at Honeycomb that's what we do. There's some details in this blog post that might be interesting: https://www.honeycomb.io/blog/why-observability-requires-dis...


I worked on Scuba, inside and outside of Meta (Interana), and yeah - It was expensive AF. I recommend focusing on metrics first. Use analytics logging sparingly, and understand the statistics of how metrics work, because without understanding those statistics you'll misread your events anyway.

This is not to say that wide events aren't worth it - For many things, something like Scuba or Bigquery are invaluable. There's ways to optimize. But we're talking about "One of AWS's largest machines" vs "A couple cores", and I suggest learning Prometheus first.


> understand the statistics of how metrics work

Haha, since you worked on Scuba I’ll mention IMO this point was by far the biggest flaw of ODS. No one ever performed the metric rollups correctly. Average of averages? And at what granularity? ODS downsampled the older time series data but now perhaps you’re taking a percentile over a “max of maxes”. Except it only sometimes used that method of downsampling automatically.

And I seem to recall the labels “daily”, “weekly”, and “monthly” not being intuitive either, and two of them meant the same thing... that was quite a mess to work with.

A lot of the autoscaling systems were wonky because the ODS metrics they were based upon didn’t represent what people thought they did.


Never in the world I would have expected my post to cause the discussion about ODS flaws :D


I don’t know that’s true. My last two very-not-meta-sized companies have both had systems that were very cost effective and essentially what the article describes. It’s not the simplest thing to put in place, but far from unapproachable.

I think on if the big hills is moving to a culture that values observability (or whatever you choose to call it, I prefer forensic debugging). It’s another thing to understand and worry about and it helps tremendously if there are good, highly visible examples of it.

Edit: Typo.


Could you share some specifics of how it could be approached?


I don't know what that commentor has in mind. My own experience building this up is to start with usable information and not try to instrument everything at once. Those are usually:

- some way to get to errors when they happen

- zeroing in on the key performance indicators for your application, and relating them to infra metrics, particularly resources (because cpu, mem, storage, and bandwidth costs money).

Unless you have both domain and infra knowledge, it will be hard to know ahead of time.

For a stateless web app backed by a db, you're typically starting with:

- request metrics (req/s, latency)

- authenticated user activity

- db metrics (such as what you'd get with pganalyze)

It's when there are resource pressure that things get interesting. Here, you have product-fit, you have user traction and growth, but now your app is falling down because it is popular.

It is tempting to just crank things up horizontally and say, you're trying to land-grab users ... but your team will never develop the discipline to develop scalable and reliable software. It's here that you start adding instrumentation to find bottlenecks -- whether that is instrumenting spans, adding metrics, optimizing queries, etc. You also need to craft the dashboard to give actionable intelligence. Here's where Datadog's notebook feature is great -- you explore (and collaborate) with the notebook until you can find the bottleneck, and then export the useful metrics into a dashboard. Then you set up the monitoring, because you have found the key performance indicators.

It's this active search to understand what is going on in _both_ app and infra that shows you the limits of the current architectural designs, guide what you need to do, and validate the architectural and engineering decisions for the future. This active search may involve tools beyond OpenTelemetry or Datadog or Honeycomb -- maybe you have to attach a REPL, or go poking around a memory profiler.

What you _don't_ do is blindly adding these things because having the capability somehow makes things better. Rather, you incrementally improve your capability in order to solve your present scalability and reliability problems with your app and its infra.


Maybe this is a dumb question but why wouldn’t it be cost effective to pre-aggregate counts occasionally and sample on the fly?


While I don't have an opinion on wide events (AKA spans) replacing logs, there are benefits to metrics that warrant their existence:

1. They're incredibly cheap to store. In Prometheus, is may cost you as little as 1 byte per sample (ignoring series overheads). Because they're cheap, you can keep them for much longer and use them for long-term analysis of traffic, resource use, performance, etc. Most tracing vendors seem to cap storage at 1-3 months while metric vendors can offer multi-year storage.

2. They're far more accurate that metrics derived from wide events in higher-throughput scenarios. While wide events are incredibly flexible, their higher storage cost means there's an upper limit on the sample rate. The sampled nature of wide events means that deriving accurate counts is far more difficult- metrics really shine in this role (unless you're operating over datasets with very high cardinality). The problem only gets worse when you combine tail sampling into the mix and add bias towards errors/ slow requests in your data.


For point (2), you can derive accurate counts from sampled data if the sampling rate is captured as metadata on every sampled event. Some tools do support this (I work for Honeycomb, and our sampling proxy + backend work like this, can't speak for others).

The issue is there are still limits to that, though. I can still get a count of events, or a AVG(duration_ms). But if I have a custom tag I can't get accurate counts of that. And if I want to get distinct counts of values, I'm out of luck. Estimating that is an active machine learning research problem.


It's an interesting point. We are actually running a test with with Honeycomb's refinery later this week, I'm slightly skeptical but curious to see if they can overcome this bias.


You also lose accuracy because of sampling noise.


On top of that, metrics can have exemplars, which give you more (and dynamic) dimensions for buckets without increasing the cardinality of the metric vectors themselves. It's pretty much a wide event, with the sampling rate on this extra information just being the scrape interval you were already using anyway.

Not every library or tool supports exemplars, but they're a big part of the Prometheus & Grafana value proposition that many users entirely overlook.


This is exactly right. This kind of structured logging is great, but it doesn’t replace metrics. You really want to have both, and simple unsampled metrics are actively better for e.g. automated alerting for exactly those reasons. They’re complements more than substitutes.


This is essentially Amazon Coral’s service log format except service logs include cumulative metrics between log events. This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chain

People point to Splunk and ELK but they fail to realize that inverted index based solutions algorithmically can’t scale to arbitrary sizes. I would rather point people to Grafana Loki and CloudWatch Logs Insights and the compromises they entail as not just the right model for “wide events” or structured logging based events and metrics. Their architectures allow you to scale at low costs to PB or even exabyte scale monitoring.


As far as design and ergonomics go, I'd compare servicelogs to a pile of trash that may yet grow massive enough to accrete into a planetoid.

A text based format whose sole virtue is descending from a system that was composed mainly of bugs that had coalesced into perl scripts.

It's not the basis of something you could even give away, let alone have people willingly pay you for their agony. Cloudwatch being rather alike in this regard.


One thing that really gets under my skin when I think about observability data is the abject waste we incur by shipping all this crap around as UTF-8 bytes. This post (from 1996!) puts us all to shame: https://lists.w3.org/Archives/Public/www-logging/1996May/000...

Knowing the type of each field unlocks some interesting possibilities. If we can classify fields as STRING, INTEGER, UUID, FLOAT, TIMESTAMP, IP, etc we could store (and transmit!) them optimally. In particular, knowing whether we can delta-encode is important--if you have a timestamp column, storing the deltas (with varint or vbyte encoding) is way cheaper than storing each and every timestamp. Only store each string once, in a compressed way, and refer to it by ID (with smaller IDs for more frequent strings).

It's sickening to imagine how much could be saved by exploiting redundancy in these data if we could just know the type of each field. You get some of this with formats like protocol buffers, but not enough.

Another thing, as you mention, is optimizing for search. Indexing everything seems like the wrong move. Maybe some partial indexing strategy? Rollups? Just do everything with mapreduce jobs? I don't know what the right answer is but fully indexing data which are mostly write-only is definitely wrong.


In my analysis of some exabyte scale log analytic systems the top two compute spend buckets was json serde and utf8 conversions.


Storing by delta can bite you quite hard in the event of data corruption. Instead of 1 data point being affected it would cascade down. Selecting specific ranges where the concrete bottom/top as in "give me everything between 1-2 pm from last Saturday" might also become problematic. I'm sure there's a tradeoff to be had here; Weaving data-dependencies throughout your file certainly leaves a redundancy hole not everyone is willing to have.


I think we could limit the blast radius by working in reasonably sized chunks--like O(10-100MB)--and possibly replicating (which becomes much more attractive when the data set gets a lot smaller). But you're right, it's a good point that redundancy can be a feature.


That second part of only storing reference ids sounds like db normalizing but for logs. Seems reasonable though!


Which compromises in CloudWatch Log Insights makes it not the right model for "wide events"?

I have the impression it does a good job providing visibility tools (search, filter, aggregation...) over structured logs.

Ergonomics is bad, though, with the custom query language and low processing speed, depending on the amount of data you're processing during an investigation.


I think you misunderstood. It does.


> This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chain

I don't have any experience with scuba besides this article, but I think you've missed the point. Wide events, based on my understanding, are a combination of traditional logs and something akin to service logs.

This provides two crucial improvements. The first is flexible, arbitrary associations as a first-class feature. As I interpret it, wide events give you the ability to associate a free-form traditional log message with additional dimensions, which is similar to what service logs offer but more flexible. E.g. if you log "caught unhandled FooException, returning ServerException" but only emit a metric for ServerException=1, service logs can't help you.

The other major benefit that you seem to have overlooked is a good UI to explore those events. I think most people would agree that the cloud watch UI is somewhere between bad to mediocre, but the monitor portal UI is nothing short of an unmitigated disaster. And neither give you the ability described in this article, to point and click graph events that match certain criteria. As I read it, it's the equivalent functionality to simple insights queries, except it doesn't require any typing, searching for the right dimension names, or writing stats queries to get graphs.


> inverted index based solutions algorithmically can’t scale to arbitrary sizes

curious why do you think so? Inverted index can be sharded and built/updated/queried in parallel, so scale linearly.


A few issues come up. First inverted indices can be sharded but the index insert patterns aren’t uniformly distributed but instead have a zipf distribution, which means your sharding scales proportional to the frequency of the most common token in the log. There are patches but in the end it sort of boils down to this.

Another issue is indexing up front is crazy expensive vs doing absolutely nothing but packing and time indexing, maybe some bloom indices. This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.

The technique of something like Loki is to batch data into micro batches and index them within the batches into a columnar store (like parquet of orc) and time index the micro batches. The query path is highly parallel and fairly expensive, but given the cost savings up front it’s a lot cheaper than up front indexing. You can turn the fan out knob on queries to any size and similar to MPP scale out databases such as Snowflake there’s not really much of an upper limit. Effectively everything from ingestion to query scales out linearly without uneven heat problems like you see in a sharded index.


> which means your sharding scales proportional to the frequency of the most common token in the log

inverted index entry for frequent token can be sharded itself. You can imagine that google doesn't store all page ids in internet for the word 'hello' on the same server.

> This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.

for log processing you are likely correct. I was more wondering in general why do you think inverted index doesn't scale.


These sorts of heat balancing sharding schemes are very difficult to implement and very expensive. As you see hot keys you need to split the hash space and rebalance within that space by reshuffling the shard data.

I’d note that also Google doesn’t bother keeping a perfect index because perfect fidelity isn’t necessary, unlike in a lot analytic or similar system where replication of ground truth is important. It’s much more important for Google to maintain high fidelity at the less frequent token side of the distribution and very low fidelity at the high frequency side. Logs can’t do that.


> As you see hot keys you need to split the hash space and rebalance within that space by reshuffling the shard data.

Correct, but I think it is solvable problem, if such approach adds value (e.g. one can build and start selling product like that).

Also, complexity may be not that hard. Say, you are building some tiered shards:

- every entry at the beginning is single shard

- once shard reaches 10m records, you split it to 10 shards, which is not that difficult computation, and can be done in XXXms in single transaction


It’s actually quite hard. It starts with being able to detect a hot key at all. It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot. Then it’s not about simply shuffling data (which isn’t simple as you outline - you need to salt the keys and they shuffle randomly, otherwise you don’t get uniformity), then you need to create cumulatively eventually consistent write replicas to balance write load while answering queries online in a strongly consistent way. Add to this any dynamic change in the index like this requires consistent online behavior (I.e., ingestion and queries don’t stop because you need to rebalance), and the hot keys are necessarily “large” in volume so back pressure can be enormous and queue draining itself can be expensive. Add to it you need stateful elastic infrastructure.

There are definitely products that offer these characteristics. S3 and dynamodb both do, even if you can’t see it. But it took many years of very intensive engineering to get it to work, and they have total control over the infrastructure and runtime behind an opaque api. Elastic search and Splunk are general purpose software packages that are installed by customers, and their data models are much more complex than objects or tables.


> It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot.

I think you mixed two orthogonal topics: you first talked about frequent tokens, and now switched to hot keys(tokens which are frequently queried).

As for frequent tokens, I think I well described algorithm, and it looks simple, and I don't see any issues there, if your metadata store (where you store info about shards and replicas) allows some kind of transactions (e.g. cocroachdb or similar).

For hot keys/shards, as you pointed out, solution is to increase replication factor, but I think if shard is relatively small(10m IDs as in my example), adding another replica online is also fast, can be done in single transaction, and may not require all these movement you described.


The thread is about inverted indices. Frequent tokens are hot keys.


I've seen situations where the cost of indexing all the logs (which were otherwise just written to a hierarchical structure in HDFS and queried with mapreduce jobs) in ES would have been highly significant--think like an uncomfortable fraction of total infrastructure spend. So, sure, you can make it scale linearly by adding enough nodes to keep up with write volume but that doesn't mean it's affordable. And then consider what that's actually accomplishing for those dollars. You're optimizing for quick search queries on data which you'll mostly never query. Worth it?

EDIT: as a user, being able to just run mapreduce jobs over logs is a heck of a lot better experience IMO than trying to torture Kibana into giving me the answers I want.


Loki basically works like a nice UI on mapreducing logs. The rest is exactly my point about ELK.


it could be shortcoming of ES specifically, since JVM is not ideal platform for heavy data crunching.


This thread has a lot of discussion about Wide Events / Structured Logs (same thing) being too big at scale, and you should use metrics instead.

Why does it have to be an either/or thing? Couldn't you hook up a metrics extractor to the event stream and convert your structured logs to compact metrics in-process before expensive serde/encoding? With this your choice doesn't have to affect the code, just write slogs all the time; if you want structured logs then then output them, but if you only want metrics then switch to the metrics extractor slog handler.

Futher, has nobody tried writing structured logs to parquet files and shipping out 1MB blocks at once? Way less serde/encoding overhead, and column oriented layout compresses like crazy with built-in dictionary and delta encodings.


The open telemetry collector does just that. https://github.com/open-telemetry/opentelemetry-collector-co...


seems cool but the top of the page says the thing you suggest is now deprecated


Span Metrics Processor is replaced with the very similar Span Metrics Connector which is still supported


The Span Metrics Processor being replaced by the Span Metrics Connector is very, very OpenTelemetry.


This package implements the parquet idea for Go's slog package: https://github.com/samber/slog-parquet


>too big at scale

Bigger than meta? o_0


The scale isn’t a problem if you also have Meta’s profit margins. But most businesses don’t…


I don't think Meta's margins have something to do with this. Companies smaller in scale than Meta also have less data!

And yes, Scuba is in-memory, but it's not the requirement. Check this video out on how Honeycomb implemented their columnar storage: https://vimeo.com/331143124


That's what I don't understand about observability services.

They complicate things so much that it's sometimes more expensive to observe than to RUN the infra being observed.


Sounds like an unaligned incentives issue. Or maybe after a certain scale observing becomes cheaper than running?


The isomorphism of traces and logs is clear. You can flatten a trace to a log and you can perfectly reconstruct the trace graph from such a log. I don't see the unifying theme that brings metrics into this framework, though. Metrics feels fundamentally different, as a way to inspect the internal state of your program, not necessarily driven by exogenous events.

But I definitely agree with the theme of the article that leaving a big company can feel like you got your memory erased in a time machine mishap. Inside a FANG you might become normalized to logging hundreds of thousands of informational statements, per second, per core. You might have got used to every endpoint exposing thirty million metric time series. As soon as you walk out the door some guy will chew you out about "cardinality" if you have 100 metrics.


I think all metrics can be reconstructed as “wide events” since they’re just a bunch of arbitrary data? Counts, gauges, and histograms at least seem pretty straight forward to me.

It seems like the main motivation for metrics is that sending + storing + querying wide events for everything is cost prohibitive and/or performance intensive. If you can afford it and it works well, wide events is definitely more flexible. A metric is kinda just a pre-aggregation on the event stream.


If you think of a metric as an event representing the act of measuring (along with the result of that measurement), then it becomes the same as any other event.


True. I guess the thing that I normally want from metrics is I want to have a huge number of them that exist in a way that I can look at them when I want. But I don't want to have to pay for collecting and aggregating them all the time. So in the scenario where they are just events then I need some other control system that can trigger the collection of events that aren't normally emitted


With metrics, you're always sampling. It's impossible to know the value of the measurement at every point in time.

When you collect any form of metrics, something is choosing that sample rate.


That's not true in all models. For example in the (execrable) `statsd` model there is a bit of information sent every time a metric changes.


It's not just the stats protocol, it's the underlying metric, too. statsd is just a way of recording/transmitting metrics.

If I transmit a statsd metric representing "CPU usage", I am still sampling it. E.g., I might read the CPU usage every second & generate a statsd stat. That's a sample rate of 1Hz. I have to choose some sampling frequency, since the API most OS's expose is "what's the current CPU usage?".

If the metric is "total number of HTTP requests", then I can definitely just transmit that metric every time I get a request. We're not sampling for that metric.

The latter is inherently a discrete event, with which we can know every data point of, though. Things like CPU, memory, are either fundamentally continuous, or their implementations are simply sampling it.

I do agree the model matters too; Prom's tendency to just poll /metrics endpoints every n seconds means even things like HTTP events are inherently sampled.


> If I transmit a statsd metric representing "CPU usage", I am still sampling it.

In practice this is how everyone does it, but in theory it should be possible to have a non-sampled view of CPU usage (defined as "time process is scheduled onto a CPU"). With the right kernel introspection, you could represent it as a series of spans covering each time slice where the process is scheduled. Perhaps with a concept of a "currently ongoing span" to account for the current time slice.

Do I think this would be more useful than the typical sampled metric? Probably not, outside some niche performance analysis workflows. But my point is that CPU is not actually continuous, and I struggle to think of any metric which cannot be represented without sampling if you REALLY need it.


I almost put exactly that in a footnote, 'cept about RAM, instead of CPU usage. No OS that I know of exposes such an API, so it's highly theoretical.

As for truly contiguous metrics, hmm. How about current battery charge (in Wh)? Host uptime also seems technically continuous (albeit representable by a straight line). (Yes, we track this met; it makes reboots stand out in lieu of my metrics system not providing a vertical marker feature.) Clock drift?

(and I'm going to insert the footnote on this comment about something something Planck units.)


It depends on the metric. Some metrics represent discrete events, such as "number of HTTP requests received". It is absolutely possible to record that metrics at every point in time, without sampling.

(There are metrics that are continuous, such as CPU usage. Those, yes, you're always sampling.)


Great point. (Y) this feels like a gauge / counter distinction?

You could get pedantic at this point and say that because computers are fundamentally discrete machines, it is technically possible to sample the CPU usage at every tick :p


> this feels like a gauge / counter distinction?

I'm not particularly fond of those terms; I don't find them descriptive. I don't think they're quite the right terms, either. For example, queue length is fundamentally not a continuous metric: it only changes when the length of the queue does, and if you record those events as they happen, you can get the exact graph of the queue length without there being a sampling frequency. But it is a "gauge" in Prom's language.

But yes, a lot of the metrics surrounding event-like data probably do fall into Prom's "counter".

> sample the CPU usage at every tick :p

Linux has been tickless for years. There's still going to be a time at which the scheduler kicks in, of course, but if the core isn't contested, schedulers these days aren't necessarily going to even trigger. The process on that core can simply run until it sleeps. (Assuming no other process transitions to runnable, and there's no other core available for that process.)

As another poster points out, if we had enough insight into the kernel, though, even still we could get the discrete events of when the scheduler deschedules a core. So, technically we don't have to same. But the practical APIs we're going to use are sampling ones.


I would like to subscribe to your newsletter.


It took the world decades to develop widely accepted standards for working with relational data and SQL. I believe we are at the early stages of doing the same with event data and sequence analytics. It is starting to simultaneously emerge in many different fields:

- eng observability (traces at Datadog, Sumologic, etc)

- operational research (process mining at Celonis)

- product analytics (funnels at Amplitude, Mixpanel)

As with every new field, there are a lot of different and overlapping terms being suggested and explored at the same time.

We are trying to contribute to the field with a deep fundamental approach at Motif Analytics, including a purpose-built set of core sequence operations, rich flow visualizations, a pattern matching query engine, and foundational AI models on event sequences [1].

Fun fact: creators of Scuba turned it into a startup Interana (acquired by Twitter), who we took a lot of inspiration from for Motif's query engine.

[1] https://motifanalytics.com


Thanks for sharing, didn't know about Motif


At the company I work for we send json to kafka and subsiquently to Elastic search with great effect. That's basically 'wide events'. The magical thing about hooking up a bunch of pipelines with kafka is that all of a sudden your observability/metrics system becomes an amazing API for extending systems with aditional automations. Want to do something when a router connects to a network? Just subscribe to this kafka topic here. It doesn't matter that the topic was origionally intended just to log some events. We even created an open source library for writing and running these,pipelines in jupyter. Here's a super simple example https://github.com/bitswan-space/BitSwan/blob/master/example...

People tend to think kafka is hard, but as you can see from the example, it can be extremely easy.


This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.

Sometimes it seems that it's fixable by 'just having a list of people listening', and then you look and all that some of them do is mildly transform your data and pass it along. It doesn't take long before people realize that. 'just logging some events' is making future promises to other teams you don't know about, and people start being terrified of emitting anything.

This is a story I've seen in at least 4 places in my career. Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database, and you'd see yearlong projects to do some mild changes to a data model, which was originally designed in 5 minutes.

As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.


> This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.

Yeah, I'd always use protobuf or similar rather than JSON for that reason, and if you need a truly breaking change I'd emit a new version of the events to a new topic rather than trying to migrate the existing one in place. It's not actually so costly to keep writing events to an old topic (and if you really want you can move that part into a separate adapter process that reads your new topic and writes to your old one). Or you can do the whole avro/schema-registry stuff if you prefer.

> Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database

It should be significantly less scary: it's impossible to mutate data in-place, foreign key issues are something you go back and fix and reprocess rather than something that takes down your OLTP system, schema changes are better-understood and less big-bang, event streams that are generated by transforming another event stream are completely indistinguishable from "original" event streams as opposed to views being sort-of-like-tables but having all sorts of caveats and gotchas.

> As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.

There are hard parts but also parts that are easier than a traditional database. There's no query planner, no MVCC, no locks, no deadlocks, no isolation levels, indices are not magic, ...


I think you're missing that person's point though. This evolution implied in the thread was:

1. Write "logging" data (observability, whatever) 2. Someone else starts using that to drive behavior 3. Change your logging, because it's just logging right? And stuff breaks.

To state it another way, anything you're emitting, _even internal logging_, is part of your API/contract, and therefore can't be changed carelessly. That problem is the same no matter what technology you use.


I think this is the crux of it, if something works for awhile then actually that's fine, as an industry we over index and scare new developers towards complexity. The counter is true too, what works at scale doesn't at non scale - not because of tech, but because holistically your asking for a lot, a lot of knowledge, a lot of complex tech to be deployed by a small team.


> This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream.

All you have to do is pass around trace baggage headers, right?


I'm glad that works for you but to me it sounds really expensive. At small scale you can do this any way you want but if you build an observability system with linear cost and a high coefficient it will become an issue if you run into some success.


The only expensive part is the hardwarevfor the elastic servers. Kafka is cheap to run. We have an on prem elastic db pulling in tens of thousands of events per second. On prem servers aren't that expensive. It's really just 6 servers with 20tb each and another 40tb for backups. And it's not like you have to store everything forever... Compare that data flow to everyonevwatching youtube all the time. It's really nothing...


> We have an on prem elastic db pulling in tens of thousands of events per second.

Many kinds of systems can work great at this scale, which is non-trivial but ultimately not particularly large.

> It's really just 6 servers with 20tb each and another 40tb for backups.

Hopefully you can ingest and store and query O(10k) RPS with an order of magnitude less resources than this?


I can name a single company in my area that runs their own servers, and they've been in the middle of a migration to the cloud for the past five years.


it's really down to company culture: how many MBAs and PMs it normalizes.


Zookeeper in production can be really a pain to maintain…


Check out Redpanda if you don’t like Zookeeper. Raft based, no JVM, lower cost and simpler to run Kafka compatible system


You don't need zookeeper for Kafka for a while now.


good to hear, I haven’t dealt with Kafka for years. Zookeeper was a pain to deal with, glad to see there are alternatives


We use wide events at work (or really “structured logging” or really “your log system has fields”) and they are great.

But they aren’t a replacement for metrics because metrics are so god damn cheap.

And while I’ve never used a log system with traces, every logging setup I’ve ever used has had request/correlation IDs to generate a trace because sometimes you just wanna lookup a flow and see it without spending a time digging through wide events/your log system. If you aren’t looking up logs very often, then yeah it seems browsing through structured logs isn’t that bad but then do it often and it’s just annoying…


This person is simply misinformed. I worked at meta and used scuba, and it's like 6/10 (which makes it one of meta's best tools).

A tool like splunk can do everything scuba can do and a million things it can't. Sumologic can too.

The reason that splunk/sumologic are so much better than scuba is that they have open-ended query languages rather than this on-rails "only ever do one group-by". Just for example, if you wanted to dynamically extract a field at query-time based on the difference of two values and group by that, that's something you can do trivially in splunk/sumo.

I could write a whole essay on the topic really, but the gist of it is you need a full-scaled open-ended language for advanced querying because 1% of the time you need to do weird stuff like count-by -> a second count-by.

What I will agree with is that traces/metrics do not inherently give you this ability, but absolutely traces could if there was a platfom with a powerful enough query-language for it (e.g. give me all requests that go through 4 services, have errors on service 3 but not 4, and are associated with userId 123 on service 1)


I didn't say such tools don't exist. Honeycomb, mentioned in the post, is exactly Scuba fwiw. I said that over-focusing on traces / metrics and "logs" (in the classical understanding) hides the true power of wide events, and they're not being used widely.

Also IMO open-ended query language doesn't help in quick exploration, it's a barrier. UI and easiness of use is the paramount for adoption. To achieve arbitrary queries one can dump everything to smth like ClickHouse and query it with SQL. This would be a nice addition to any observability stack, to cover for a small percentage of the very deep explorations.


It looks like a "wide-event" is just a structured log, you can send any log containing json to sumo/splunk at it'll parse it out as fields for you. So as I'm understanding it you're advocating for structured logs (which is fine, those are great).

If you want a point-and-click interface to log searching I agree that some percentage of people like to start there (and I think splunk may even have that too), so I'm not opposed to it existing at all, but I feel very strongly that having the more sophisticated capabilities if you want to move beyond a point-click is a requirement.


It exactly is a structured log or log in open telemetry.

To make it easier for myself, I think of spans also as structured logs with a schema that everyone had agreed on, which make it possible to trace requests across multiple services/clients. It's probably more than that, but I don't need academic precision to see how this is more useful during livesite investigations than simply querying logs with unaligned schemas.


Yes, structured log exactly. Why I prefer "wide event" as a term because it has this "wide" component that serves for 2 purposes:

- it highlights the intention of storing as much context as possible - it also hints on the implementation for a system that would serve them. One likely need to use columnar storage to store wide events, there is no way around it

But just a personal preference in the end.


Honeycomb was in fact built to originally be a replacement for scuba, by people who found it so valuable and missed it.


> Just for example, if you wanted to dynamically extract a field at query-time based on the difference of two values and group by that, that's something you can do trivially in splunk/sumo.

You can trivially do that in Scuba too using a derived column, which is supported in the UI. If you need more complex stuff you can write your query in SQL and still use all the supported UI visualizations. And for even more complex stuff, for example if you need joins, the data is usually mirrored to Presto, so you can write arbitrarily complex queries and still visualize the results in the Scuba UI.

I'm not sure if your assessment of Scuba is based on full knowledge of what you can do with it.


I've seen some really nice UI query systems that have plenty of expressive power for 99.9% of queries but also include an escape hatch to allow you to use a query language. Mixpanel has one that I really like and I've seen non-techy business types go to town on it no problem.

The other option I've seen is a guided query language with helpful UI affordances as you type. I've seen it on Datadog as well as Jira's advanced search.


mipsytipsy also worked with scuba, and she specifically designed honeycomb, an observability tool based on wide events, around the motivating idea that a very imperfect tool like scuba could do lots of things better than much more polished and better engineered tools which didn't use the wide events approach.


> if there was a platfom with a powerful enough query-language for it (e.g. give me all requests that go through 4 services, have errors on service 3 but not 4, and are associated with userId 123 on service 1)

Datadog has a new feature for querying the whole trace call graph that they seem to be testing out. Sounds like precisely the thing you're describing.


What's the difference between a Wide Event and a structured log?


Nothing; what the author is calling a wide event is a structured log.

> Structured Log which is pretty much the Wide Event

The OP's point is that you can just capture anything that one might call a metric/span/trace in a structured log.

So, it's a bit more specific, in some senses: a "metric" is a structured log bearing some value. E.g.,

  {
    "@timestamp": "…",
    "query_kind": "list_people_in_org",
    "query_time_s": .051
  }
Is a structured log. It's also a single datapoint along a timeseries. The metric might be something like "query time", with a dimension "query_kind".

ELK stacks let you visualize such structured logs into graphs with only moderate pain.

A "span" is a structured log with start/end timestamps.

A trace is just a tree of spans, built out of many structured logs. In a prior life, we'd have,

  {
    "span_id": "root_span_id/subspan_a/subsub_span_b",
    …
  }
You can pull out whole traces by doing a "span_id startswith 'root_span_id'" query in whatever your query language is. You can pull out sub-traces similarly. It works even across microservice boundaries, if you coordinate the span IDs appropriately.

I think where I depart from the OP: it takes too much space. The OP advocates sampling; I'd rather a specialized data structure for metrics that can store a data point efficiently. I.e., at least in sizeof(timestamp || f32) bytes, ideally better, and then I can just not worry about sampling. (In a sense; if you're watching a continuous value, you're always sampling. But for discrete events that you're recording, I can just record all the events. But I'll only get a number. Sometimes, that's okay. I feel like this is a bit spiritually different than the sampling the OP is discussing.)


Thanks. I don't know why new unpopular names are used for well-known things. Googling for "Wide Events" give millions of results for "City-wide events" etc. Probably just clickbaiting--what's a Wide Event?


So, this is just my own view, but here's how I see it:

Structured logs are ... well, logs that are structured (probably json) and machines read 'em to offer nice analysis capabilities. We've largely as an industry decided that these are great, and we should use them instead of unstructured logs if we can.

But just because a log is structured it doesn't mean it has what you need to effectively debug an issue! You usually need app-specific data added as key-value pairs to that log so that you can correlate stuff like "stuff is slow on this endpoint" with "and these are the device versions and user_agent strings that correlate the most with that".

And a lot of developers are used to dumping some of that data into random logs, but without making that data a part of a "wider" structured log, it can be really hard to correlate the behavior you don't like with other data that aids with debugging.

Hence, a desire to call them "wide events".

Anyways, I'm not particularly sold on that term either. But I do think there's some need to describe not only structured logs, but structured logs that contain all the rich info you need to debug most stuff in production with general ease.


Structured logs are sensible, because you're not setting yourself up to need to write a billion little parsers down the line to attempt to parse unstructured logs, if they're even still parsable.

JSON, or usually ndjson, is just simple, and widely supported. It's not the only format, nor even the best format. But it is easily produced, and better than no format at all.

> But just because a log is structured it doesn't mean it has what you need to effectively debug an issue! You usually need app-specific data added as key-value pairs to that log so that you can correlate stuff like "stuff is slow on this endpoint" with "and these are the device versions and user_agent strings that correlate the most with that".

Yes, you have to instrument the app. There's no getting around that. Experience will tell you what to log.

> And a lot of developers are used to dumping some of that data into random logs, but without making that data a part of a "wider" structured log, it can be really hard to correlate the behavior you don't like with other data that aids with debugging.

Again, experience. Log correlation is easier with what the article calls a "SpanId" or a "TraceId", some places call it a "RequestId" or a "CorrelationId", just some way to say "all longs forming a particular request" so that you can just read a request, from start to end. A simple thing … but you have to log it, or you won't have it.

Agreement across services about naming it all the same thing in your log store of choice also helps, so that if you need to trace it across services, you can. This really isn't hard, but it depends IME on how well your organization is otherwise functioning. E.g., do engineers talk & plan? Can they say "X would help with Y" and then have people just go "yeah, it would. Done." or is it a long drawn out fight because technical leadership can't fathom basic stuff like what a RequestId does.

> But I do think there's some need to describe not only structured logs, but structured logs that contain all the rich info you need to debug most stuff in production with general ease.

I feel like you're just saying "log the data we actually need", which isn't a terribly actionable thing if you don't already know the answer. If I had to make a recommendation: outcomes & latencies of I/O.

Otherwise, put something into production, & be on-call for it.


Yeah, a lot of what you're saying is what OpenTelemetry tries to solve for, and does so fairly well. If I were at an organization and had the ability and time, I'd move towards tracing and correlating existing logs with those traces, then adding new stuff to spans rather than creating new logs. Tracing as the baseline + common naming conventions can go a long way.


I assumed "wide event" is the internal terminology at FB, given the OP.

I've never personally heard it, but it the context of the OP / metrics/logs/traces, it's a pretty self-descriptive term.


Haha, I think this term is originated by Honeycomb team actually.

Why I prefer "wide event" over "structured log" as a term because it has this "wide" component that serves for 2 purposes: - it highlights the intention of storing as much context as possible - it also hints on the implementation for a system that would serve them. One likely need to use columnar storage to store wide events, there is no way around it

But just a personal preference in the end.


How do you keep metadata around well enough to log in a structured fashion between apps? What if one app is a queue listener, which calls an HTTP service, etc. etc.

I personally think the challenge is in passing around the metadata.


This is solved using span IDs commonly used in distributed tracing.


This is basically a metric with tags. Only difference is that a metric has a main unit it measures.

In the end anything can be represented as a structured log.

A span is NOT what the OP calls a “system wide event”. A span has a begin and end time. What he/she describes doesn’t have that.

In the end giving different kind of instrumentation instruments a name makes sense, mainly for processing them / rendering them / altering on them.


FWIW I think x-ray has everything you need, its just that AWs tooling does not give you much ability to aggregate over x-ray bundles. I wrote a tool to help bulk load x-ray samples into a local browser duckdb and then slice and dicing in realtime interactive visualisations. It also includes the ability to generate a flamegraph over the selected traces. All this great data is already in an AWS, account and we just need better tools to make use of it.

https://observablehq.com/@tomlarkworthy/x-ray-slurper


So, what is the closest thing in the open source world to what the author describes? (Setting aside the question of is it right for you, which, of course, depends.)


Any OLAP database that accepts unstructured data can be used in this manner.

The ELK stack is a popular choice, albeit with a focus on search rather than OLAP.

If SaaS is an option, a simple staring point in AWS might be Data Firehose into S3 with Athena. Snowflake can load and query the data too. All of these tools have multiple frontend options with a proportional relationship between cost and user-friendliness.

I honestly just do this in PostgreSQL until my project outgrows it. Create a table with a JSONB column and as few indexes as possible to improve write throughput. Cover a timestamp column with a BRIN index to filter by date range.


brin is your friend for logs, for sure.


Where I work we’ve set up OpenTelemetry SDK in the applications to expose traces, logs and metrics.

Grafana agent as OTEL collector on the application hosts, Grafana Tempo as backend for traces, Loki for logs and Prometheus for Metrics.

The cool thing about Tempo it generates metrics for ingested spans and their labels (spanmetrics) so this allows us to explore “unknown unknowns” as the author calls it in a very cost efficient way.


Still early days, but Grafana's tracing is getting there. https://grafana.com/docs/grafana/latest/panels-visualization...


What format/library/protocol does your app speak to get the traces into Grafana Traces? Tempo?


https://grafana.com/oss/tempo/

> Tempo can ingest common open source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry.


This seems like event sourcing with a nice tool to inspect, filter and visualize the event stream. The sampling rate idea is a decent tactic I hadn't heard of.


I like logs. Unlike most people selling and using observability platforms, most of the software I write is run by other people. That means it can't send me traces and I can't scrape it for metrics, but I still have to figure out and fix their problems. To me, logs are the answer. Logs are easy to pass around, and you can put whatever you want in there. I have libraries for metrics and traces, and just parse them out of the logs when that sort of presentation would be useful. (Yes, we do sampling as well.)

I keep hearing that this doesn't scale. When I worked at Google, we used this sort of system to monitor our Google Fiber devices. They just uploaded their logs every minute (stored in memory, held in memory after a warm reboot thanks to a custom linux kernel with printk_persist), and then my software processed them into metrics for the "fast query" monitoring systems. The most important metrics fed into alerts, but it didn't take very much time to just re-read all the logs if you wanted to add something new. Amazingly, the first version of this system ran on a single machine... 1 Go program handling 10,000qps of log uploads and analysis. I eventually distributed it to survive machine and datacenter failures, but it ultimately isn't that computationally intensive. The point is, it kind of scales OK. Up to 10s of terabytes a day, it's something you don't even have to think about except for the storage cost.

At some point it does make sense to move things into better databases than logs; you want to be alerted by your monitoring system that 99%-ile latency is high, then look in Jaeger for long-running traces, then take the trace ID and search your logs for it. If you start with logs, you have that capability. If you start with something else, then you just have "the program is broken, good luck" and you have to guess what the problem is whenever you debug. Ideally, your program would just tell you what's broken. That's what logs are.

One place where people get burned with logs is not being careful about what to log. Logs are the primary user interface for operators of your software (i.e. you during your oncall week), and that task deserves the attention that any other user interface task demands. People often start by logging too much, then get tired of "spam", and end up not logging enough. Then a problem occurs and the logs are outright misleading. (My favorite is event failures that are retried, but the retry isn't logged anywhere. You end up seeing "ERROR foobar attempt 1/3 failed" and have no idea of knowing that attempt 2/3 succeeded a millisecond after that log line.)

For the gophers around, here's what I do for traces: https://github.com/pachyderm/pachyderm/blob/master/src/inter... and metrics: https://github.com/pachyderm/pachyderm/blob/master/src/inter.... If you have a pipeline for storing and retrieving logs (which is exactly the case for this particular piece of software), now you have metrics and traces. It's great! I just need to write the thing to turn a set of log files into a UI that looks like Jaeger and Prometheus ;) My favorite part is that I don't need to care about the cardinality of metrics; every RPC gets its own set of metrics. So I can write a quick jq program to figure out how much bandwidth the entire system is using, or I can look at how much bandwidth one request is using. (meters logs every X bytes, and log entries have timestamps.)

I think since we've added this capability to our system, incidents are most often resolved with "that's fixed in the next patch release" instead of multiple iterations "can you try this custom build and take another debug dump". Very enjoyable.


I'm also a fan of logs. If you have some more examples of how you typically log things to be most effective, I'd love to see 'em! I'm still finding my sense for when it's too much versus too little. Best way to incorporate runtime data. How to structure log messages to work well with other systems. Hearing from others and seeing battle tested examples would surely help. Or if you're down to chat a bit I can send you an email and continue the conversation. Will check out Pachyderm in the meantime~


You can send me an email.

To me the golden rule is "show your work". Every operation that can start and end should log the start and the end. If your process is using CPU but not logging anything, something has gone wrong. Aim to log something about ongoing requests/operations every second or so. (This is spammy if you're doing 100,000 things concurrently. I use zap and zap's log sampling keys on the message; so if your message is "incoming request" and 100,000 of them are arriving per second, you can have it only write the logs for one of them each second. I hate to sample, but it's a necessity for large instances and hasn't caused me any problems yet.)

I also like to keep log levels simple; DEBUG for things interesting to the dev team, INFO for things interesting to the operations team, ERROR for things that require human intervention. People often ask me "why don't we have a WARN" level, and it's because I think warnings are either to be ignored, or are fatal. Warnings ("your object storage configuration will be deprecated in 2.10 and removed in 2.11, please migrate according to these docs") should appear in the user-facing UI, not in the logs. They do require human action eventually.

Overall, I'm more of a "print" debugger than a "step through the code with breakpoints" debugger. To me, this is an essential skill when you're running code on someone else's infrastructure; you will be 1000 times slower at operating the debugger when you are telling someone via a support ticket which commands to run. (Even if the servers are yours, I don't love sshing into production and mutating it.) So ultimately, the logs need to collect whatever you'd be looking for if you had a reproduction locally and were trying to figure out the problem. It's an art and not a science; you will get it wrong sometimes, and your resolution for the underlying bug will include better observability as part of the fix. This is usually enough to never have a problem with that subsystem again ;)


This sounds like a privacy nightmare as described if there aren't guardrails. 'Dump everything'

Can pretty easily achieve this with structured logging in GCP with their metrics explorer. Pretty cheaply I might add. Sentry can also do a bit of this if you're on something like fly.io (they offer a year free).

I don't think either would completely replace tracing in a complex system for me. At least not in the context Ive worked.


The SQL-equivalent in the sampling rate example should sum the inversion:

SELECT SUM(1 / samplingRate) FROM AdImpressions WHERE IsTest = False


Wide events are fine until someone puts personally identifiable information (PII) in them. Then you're in a bit of a mess as you've presumably taken PII out of an environment with one set of access controls, and into a separate, different environment, with access controls that are for a different purpose than required by the data.


Logging and tracing have this problem too.


Wide events described in this article seem to equal structured logging but a more loose dumping ground. So yeah to an extent it has this problem, just more so.

How does tracing? Are folks adding PII to spans? I suppose you could but I'm not sure why.


Using the ELK stack for almost a decade to have somewhat wide events + no sampling, on not Meta scale and a few GB/day make it absolutely affordable and super fast. Unfortunately Kibana was a bit better/easier in the old versions than nowadays but it’s still pretty straight forward to get everything out of it.


Great article, here is a Python notebook I created earlier to show you how you can capture such wide events:

https://colab.research.google.com/drive/1Y65qXXogoDgOnXFBDyF...


begging people to recognize that a person who sells a solution is going to view these problems through the lens of being rewarded for applying their solution to your problem, even if it's not appropriate.

> Yet, per my own experience it’s still extremely hard to explain what does Charity meant by “logs are thrash”, let alone the fact that logs and traces are essentially the same things. Why is everyone so confused?

Charity is not confused, Charity is incentivized. What she means by "logs are trash" is "I do not sell a logging product". (and, to be clear, I'm only naming Charity individually here because that's who the author named in their article.)

> When I was working at Meta, I wasn’t aware that I was privileged to be using the best observability system ever.

The observability system that is appropriate for Meta is not necessarily appropriate for your project. Those tools are cool but also require a pretty serious investment to build and operate correctly. It's very easy to wade into a cardinality explosion problem when tagging and indexing everything you can imagine, it's very easy to wade into problems regarding mixed retention policies when some events are important and others are less-important, it's very easy to wade into a latency-sensitivity issue if you're building a log/event collection infra that you don't allow to ever lose data, etc. As it turns out, observability is a large topic.

The idea that there's one "best" way to do observability is a little ridiculous. Like when I worked at Etsy some of the data was literally money, when I worked at Jackbox Games we made fart joke games (Quiplash, Drawful, Fibbage, You Don't Know Jack, etc) and the infrastructure was nothing but pure cost. The observability needs of those two orgs were phenomenally different, because the products were different, the revenue models were different, the needs of the users were different, etc.

Also this notion that "all you need is wide events" is the answer seems ... really shallow. A data point is an unordered set of key-value pairs? That's how ... a LOT of logging, metrics, and tracing infra expresses things at the level of an individual record/event. The difference is in the relationships between the keys and values, the relationships between the individual records, etc.

and "stop sampling" is just a bizarre marketing angle. If you have 1 million records or 10 million records and you get the same squiggly line out of analyzing it, congrats you have inflated the size of the data that nobody ever looks at. There is only one person who this benefits and it's the person who charges you for the pipeline, which is exactly why people who sell a pipeline are incentivized to tell you that sampling is bad: if you are sampling, you are sending and storing and querying fewer data points, so they are charging you less money. They are getting paid to tell you that sampling is bad. Sampling is not good or bad, sampling is sampling. The reality is that in a lot of these systems, the vast majority of the information will never, ever be looked at or used. Whether or not that matters is entirely context dependent.


For this:

> There is only one person who this benefits and it's the person who charges you for the pipeline, which is exactly why people who sell a pipeline are incentivized to tell you that sampling is bad

I largely agree, and I'll say that at least with Honeycomb (since it's mentioned by the author) we make sampling a key component of pretty much any deal before anything gets signed. For small stuff this clearly doesn't matter so much, but it basically boils down to:

- Most of your data is probably uninteresting because it's uniform and represents success cases, so you just need a statistically significant sampling to get a sense of what "okay" means for comparisons

- Whenever there's an error or high latency, there's almost always something interesting in there, and you probably want all of it

And so this typically works out to generating an order of magnitude or two more data than you actually need to get an accurate view of what's going on any any point in time.

And so when you do this, you can (and probably should?) pack your events/logs/traces/whatever-you-call-it with a bunch of data.

There's some examples where you can't do this, though. Some people want to be able to do something like plug in a customer ID and dig up the exact trace that represents something they complained about. Or there's some compliance to adhere to, legal or non-legal, where it's still less money to pay for everything unsampled than it is to deal with the consequences of non-compliance. But for most organizations I'd say what I mentioned above holds true.

...but that's just one of the several ways that some folks will frame up Observability. The term has been kleenexed a bit so it now means whatever any vendor says it means, and they all say varyingly different things.


> and "stop sampling" is just a bizarre marketing angle

Wait, where did I mention stopping sampling? :) The opposite: the article is praising the native sampling Scuba has.


You could also use a unified, OTel native platform like https://www.kloudmate.com instead of setting up Grafana, Prometheus, Loki, separately.


Is this not just structured logging? I’m wondering whether the author has used tracing tools much, or whether they’re truly trying to understand modern observability through OpenTelemetry documentation.


Man, this whole "All you need is XYZ" framing is turning out to the as irritating and overused as "XYZ considered harmful" from back in the day.


Has anyone built an open-source version of this and have a blog post around it? Curious about implementation to see how you keep storage tight and querying still fast.


I think ClickHouse is becoming a default storage for observability nowdays: https://clickhouse.com/use-cases/logging-and-metrics

And there are quite a few solutions on top of it.

A couple of examples that seem to be interesting (however I didn't use them in real life):

https://coroot.com/ https://qryn.metrico.in/#/


We are currently using Elasticsearch, and can only store a tiny amount of the data because of how much we use. 5 days at most.

ClickHouse is replacing Elasticsearch in this context, and is providing the same data storage, but with better compression, and not holding data in memory?

Is that correct?


I think the storage architecture in ClickHouse and Elastic are very different. And I think compression in ClickHouse can be really damn good.

Don't have a good comparison at hands though, but a few random links on the topic:

https://engineeringat.axis.com/schema-changes-clickhouse/ https://clickhouse.com/blog/contentsquare-migration-from-ela... https://clickhouse.com/blog/100x-faster-graphql-hive-migrati... https://blog.zomato.com/building-a-cost-effective-logging-pl...


I am the author of the first link.

We successfully replaced Elasticsearch with ClickHouse within a Year. We went from having difficulties managing 3 months of data to storing 5 year+ of data. We also had difficulties onboarding users into Kibana and Elasticsearch world, this was not such a big problem with ClickHouse since most developers feel comfortable with SQL.

This change together with Apache Superset for the BI layer made a huge impact in the amount of internal users that could extract value from the data collected. Went from around 150 to 800 internal users.

We have not yet managed to bring everything into a wide event table but as long as the logs can be joined with the metrics under an interface that the developers feel comfortable with and the solution is cost-effective enough to allow all relevant context to be collected you will get far imo.


Thanks for sharing!

> This change together with Apache Superset for the BI layer made a huge impact in the amount of internal users that could extract value from the data collected. Went from around 150 to 800 internal users.

Really impressive results.


Thank you for sharing!


New Relic supports this in the form of custom events. I have used it and it works but is very expensive. An alternative is to use ClickhouseDB directly.


Does it support "native sampling" described in the article? This is really important to keep the cost low.


This looks like structured logging and piping those logs to Splunk or am I missing something?


otel is working on their events spec https://github.com/open-telemetry/community/issues/1688


Heh, this is an unfortunate consequence of naming.

OTel Events are just OTel logs with a name. An OTel log is a log body with a trace ID, span ID, severity, etc. But that's also an event as per the author's definition.

In OTel, a span is just a structured log, which is also an event (as per the author's definition of event). So is a Span Event, which is a log-like entity that you can produce in the context of a span. And OTel metrics produce what are called "metric events", which are also events.

IMO it's one of those things that's horribly confusing until one day it's not, and then everything starts looking like how the author described.


Observability as a shared concept has followed Agile and DevOps.

Something with a real meaning that is enables a step-change is development practices. Adoption is organic initially because the pain it solves is very real.

But as awareness of the idea grows it threatens established institutions and vendors, who must co-opt the concept and redefine it such that they are included.

If they can’t be explicitly included (logs, metrics, traces)[0], then they at least make sure the definition is becomes so vague and confused that they are not explicitly excluded[1].

Wide events and a good means to query them covers everything, but not if you as a vendor cannot store and query wide events.

[0] as the article notes, one of these is not like the other. [1] Is Scrum Agile? What do you mean a standup can’t go for an hour? See also DevOps as a role.


Incredible what you can do with infinite money!

For everyone else, more specific data structures, sampling and careful consideration of what to record are essential.


The key is that you pay for the bandwidth, sampling, cardinality, and schema considerations somewhere in the system. Depending on the problem, you may be able to get away with dealing with those issues later vs. earlier, but at some point you start dealing with required fields, aggregation, etc.

I own a system in one of the tech giants where sampling is counterproductive to our problem domain, and where in the pipeline we deal with each of those issues is our bread and butter problem that can sometimes swing costs by $millions.


Hang on, you don't need infinite money if you've got a sampling rate, do you? Drop the sampling rate (mentioned in the article: they do use sampling) from 0.01 to 0.001 and you've reduced the data ingress by a factor of ten.


> just put it there, it might be useful later

> Also note that we have never mentioned anything about cardinality. Because it doesn’t matter - any field can be of any cardinality. Scuba works with raw events and doesn’t pre-aggregate anything, and so cardinality is not an issue.

This is how we end up with very large, very expensive data swamps.


that depends on the sampling rate no? I would much rather have a rich log record sampled at 1% than more records that dont contain enough info to debug..


It is a tragedy of the current generation of observability systems that they have inculcated the notion that telemetry data should be sampled. Absolute nonsense.


The people feeling the pain of (and paying for) the expensive data swamp are often not the same people who are yolo'ing the sample rate to 100% in their apps, because why wouldn't you want to store every event?

Put another way, you're in charge of a large telemetry event sink. How do you incentivise the correct sampling behaviour by your users?


Don't let the user pick the sampling rate. In Honeycomb land this is called the EMA Dynamic Sampler.

https://docs.honeycomb.io/manage-data-volume/refinery/sampli...


You should never need to sample telemetry data.


Metrics sample rate yes but logging sample? When an end-to-end transaction for a very important task breaks, do I get *some* breadcrumbs to debug it?


I have used that approach before with sentry. It was a non-issue. It depends on nature of the project of course, we had a system that was running every second so if it failed it generated a lot of data..


I agree. Sampling logs.. sounds dangerous. Obviously every system is different.

At least in GCP you can apply a filter to prevent ingestion and set different expiries on log budgets. This can help control costs without missing important entries.


Sampling can be smart, e.g. based on some field all events have (can be called traceId, haha).


The best part of this post is where they quote a failed SaSS trying to explain why successful SaSS is wrong. Anything for an edge even if it’s not useful.


Is honeycomb failed?


It’d be news to me and my paycheck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: