Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a hilarious read but I think the author is too optimistic about the state of humanity. Marl isn't the "marginal" user, Marl is the "average" user. If the average user actually cared about deep and meaningful content, then any A/B test that throws her under the bus in order to please Marl will show bad data, and the proposed change would be killed.

Yes, the author tries to hand-wave this away as "product is sticky", but I really doubt this is the main reason.

No, the truth is far more scary. The average user doesn't want deep and meaningful content. The average user is Marl. That is why every product, no matter how noble it starts off, eventually degenerate into Marl-fodder. Because that's where the money is. The only way to escape this is to take on a huge pay cut and work at a company that doesn't care about growing profits. Go ahead, you first.

Finally, let's be honest. Marl isn't some obnoxious bozo. You and I are both Marl. That's why we're here in the HN comments. You are Marl, I am Marl, the world is Marl, and it's getting Marlier every day.



A/B tests, as they are run by current software companies, are inherently flawed. I have never, in my entire career, ever heard of an A/B test that ran for a year, let alone 3-5 years. That’s where the true power of statistics comes alive, and nobody is financially incentivized to even consider that fact.


When I worked on ads at Google we had many A/B tests that had been running that long, generally holdbacks where a feature was almost entirely but not 100% launched.

It was relatively rare that the holdback would show markedly different results than the initial A/B test we used in deciding to launch. If that had happened more often we would have run more long tests and been slower to move to launch.


I've seen one A/B test in the wild run for a full year.

It was on a small part (test of a product name + description across the site), and the most interesting aspect was that it only made a small but measureable difference. Because of that there was no strong incentive to delete the AB test (not much harm to the user) nor make the B side permanent (too low of an effect).

In that respect, AB Tests that end early aren't a bad thing IMO: either there's a clear improvement or it's really bad, and the choice is obvious enough to not have to wait much longer.


What I think the grandparent is getting at:

You can measure the direct effect of a change now on something like conversions. But you can't measure the second order effects: things like trust from your users, or the effects on community quality and composition, etc.

This is a good part of why enshittification happens: lots of changes with immediate "good" impact that can be measured quantifiably, but there's also readily foreseeable negative consequences to them.

Of course, just running the test longer doesn't really address this for most possible changes.


Generally you'll have a quarterly holdout to measure total impacts but I agree that yearly would be better.


But even this doesn't work: if you are continually making choices that erode your users' trust in you, there will eventually be an impact. It happens outside of the experiment (e.g. communications between users, general sliding changes in sentiment, etc). And you can't just spot in the time series whether you're going too far or not.


Late to this comment thread, but Amazon actually excels at this type of long term measurement, through methodologies internally called HVA/DSI and DSE (to name just a couple).

- High Value Action / Downstream Impact == using a "twins" comparison, estimate the 12-month impact of a customer taking a particular action (e.g. sign up for Prime, watch their first Prime Video, etc.), compared to one who doesn't. HVAs are basically those "A's" which turn out to have a high numeric DSI value.

- Downstream Expectation == similar but very different - instead of quantifying the impact of a single action, DSE tries to estimate the combined downstream causal impact of a user taking an initial action... there's a sophisticated methodology there that tries to strip away confounding factors like "rich people who would've shopped more anyways, are also naturally more likely to sign up for Prime", because they truly want to measure the causal benefit of Prime itself, separate from the fact that richer customers generally spend more no matter what

These are both long-term methodologies, that were explicitly designed in response to the problems of: short term experiments that didn't capture long-term negative effects, and different parts of Amazon having vastly different methodologies for measuring business impact (e.g. page views vs search impressions vs downloads vs orders vs whatever... no, everyone should optimize for the same customer level financial metric which is a flavor of growth-adjusted composite contribution profit (GCCP) that's partly derived from DSE)


> Late to this comment thread, but Amazon actually excels at this type of long term measurement

It's funny-- Amazon is the exact case I'm thinking of. I went from spending high five figures to a few hundred, and I'm seeking to eliminate that. The exact impact of these kinds of data-driven management practices has lead me to expending a whole lot of work to figure out how to give Costco, Target, and Walmart my business instead.

My complaints:

- Cutting of customer service. I didn't use customer service often, but it was always exemplary There's something wrong with my account where if I pay with points from my Amazon Visa for a book, that the book gets yoinked away out of my account a couple of days later with a "payment failed" message. The points are still deducted. I spent a few hours with customer support twice on this issue, and each time the specific book that this happened to was fixed, but the problem remains. It's clearly a backend problem, but Amazon thinks it's a better move to keep a high value customer on hold while people not empowered to fix anything futz around.

- I (believe that I) was briefly in an experiment with an alternate "buy it now" order flow that would pretty reliably charge me for 2 of whatever item I was seeking to buy. Support wasn't helpful. I have video.

- Overall devolution of the retail marketplace into a flea market full of counterfeit, dubious goods.

- Aggressive attempts to upsell me back into the Prime ecosystem (e.g. the whole "Iliad flow" thing).

I'm sure all of these business decisions and changes looked great on initial measurements, but they're traps later. Worse, they turn people like me into people that were formerly Amazon evangelists to people who work to help friends use other marketplaces.

Even a year isn't sufficient time: none of these things pissed me off within a year of the change. And they're pretty difficult to capture, because they're hopelessly confounded with other changes in the market and consumer sentiment and Amazon doesn't roll things out slowly enough have a truly different contingent experiencing a different business.

Heck, maybe even all the interactions with me look positive on your metrics, depending on how you weigh downstream effects in your model: a previously valuable customer has "payment problems", begins consuming excessive support resources, then leaves.


There's a lot of good feedback to chew through, but I'll refrain from diving in too deep, and just mention that, as important as the HVA/DSI methodology is, there's been a comparatively lower amount of research done in "negative HVAs". In theory, one can do the same type of analysis to compare "twins" and pick out the NEGATIVE value of having repeat payment problems or repeat unsuccessful customer service interactions. Optimizing for growing the positive HVAs, is fundamentally different from optimizing to reduce the negative ones, but Amazon has the tools to get there or to do both, if it wants/needs to.

And yes, 12 months is arbitrary and doesn't capture everything, and longer windows of analysis are possible, but waiting even longer just throws the signal-to-noise ratio too far in the direction of noise.

FWIW, I no longer at Amazon, but I've yet to see a company of significant scale apply this level of econometrics so rigorously in day to day business decisions, or that they would evaluate 12 months as a baseline (most companies and most A/B tests are much shorter, obviously). I'm sorry you've had bad experiences, and anyways I think it's overall good for society to cultivate strong alternatives to Amazon, but as invisible as it may be to you as a consumer, your data and your lost value as a customer are definitely accounted for within these methodologies, even if no visible changes are happening or they're not winning you back.


I appreciate your comment. I guess what I'm saying is:

I love statistics and econometrics and testing beliefs with data.

But at some point, you do need to think about how to relate to human beings and what is, overall, "good business." That is, data are not replacements for clinical judgment about what is reasonable.

Coming up with ever-more-sophisticated ways to measure what is revenue maximizing but "not quite too abusive" isn't how we keep a good reputation or create a good world to live in.

Of course, completely ignoring indicators and making choices purely based on intuition and values isn't great, either.


Have you heard of long running holdbacks? Even if not, rest assured that for major features, they are very commonly run.


Your point is a good one - what you're describing I've heard referred to as "the novelty effect". In fact, a lot in this article reminds me of another essay critical of the short-sightedness of many A/B tests:

https://www.zumsteg.net/2022/07/05/unchecked-ab-testing-dest...


You're certainly correct that software companies should do a lot more year+ A/B tests. You can learn really interesting things from it that a shorter test won't capture. I know of this one: https://medium.com/@AnalyticsAtMeta/notifications-why-less-i...


Gwern runs tests that long.


"Marginal" in the blogpost is used in the economic sense, as in the next incremental user -- not "marginal" as in minority.


That is the way whack is using it. whack is correct that if there is a negative effect on the average user, a test will show that negative effect. That's what "average" means.

To perceive an effect in new users without getting the same effect in existing users, you'd need to show different content to those two groups.


Hmm, I think the authors point is more towards attention addiction, rather than specific average types of people. It’s more a matter of setting a low bar to encourage more people to be distracted by your app when they really shouldn’t be using it. Basically increasing the number of apps that people check in on, especially when those users are in their marginal time (before bed, while cooking, etc.).


> The only way to escape this is to take on a huge pay cut and work at a company that doesn't care about growing profits. Go ahead, you first.

Gladly! Where do I sign up?


What's your background, and what are your skills?

I did this in 2022 [1] and have really liked my new work. There are a lot of nonprofits doing important things, and I think it's likely you could switch to one.

[1] https://www.jefftk.com/p/leaving-google-joining-the-nucleic-...


OT: what's your blog based on?

I looked for a colophon / about site page with no joy.

Source suggests it's simply hand-coded HTML?

Hrm... Apparently, plus some webscripts?

<https://www.jefftk.com/p/designing-low-upkeep-software>

<https://github.com/jeffkaufman/webscripts/blob/master/makers...>


That's right: I draft each post in HTML, then have some scripts which add indexes, headers, footers, and css, and make an RSS feed.


Thanks, pretty cool.

I also like the article preview hack.


Thanks! Implementation details: https://www.jefftk.com/p/preview-on-hover


there are lots of swe jobs with the government.


I would rather not involve myself with the government of my country.


I completely disagree. If you walked up to Marl, built trust with him, and asked him whether he wanted more meaningful content in his life (for a definition of meaningful which made sense to him) I think he would say yes. So it's not really about Marl's preferences but about the way those preferences are collected and Marl's (sadly mostly justified) lack of trust.


If you walked up to me, built trust with me, and asked me if I wanted more exercise in my life I’d say yes. And yet.


and yet if you became friends with someone that encouraged you in the right way, or found a routine that let you get exercise in a way that didn't suck, you'd probably feel really good about yourself and want to keep doing it.

It's not that its impossible to work with people to raise them to a higher stanardard. It's harder, sure. But not impossible. And the result is usually worth its weight in gold


Marl behavior does not convey Marl's actual preferences, and this is where A/B testing zombies with no artistic instinct or creative bone in their body sacrifice the gift of their influence on the world.

To design for humanity, you need to look deeper into what Marl wants without relying on Marl to tell you what that is, because he is incapable of expressing it with words or actions.


You're using a different definition of the world "marginal". Marginal in context doesn't mean rare or unusual. It just means the next user.


I think you may both be right. Assume you start off with a small niche product and keep increasing your userbase.

Then the characteristics of the users at the fringes will change the more you grow. That is to say, the former Marls in the middle are different (and likely not so shallow) from the next-generation Marls on the outside. Eventually, your notion of what is the average user, and OP's notion of what is the Marl that finally kills the UX, will align.


Or you choose a niche that is willing to pay for value. If engagement is a meaningful metric for a business, that’s a red flag. This is why I don’t work on general purpose consumer apps and instead work on utility B2B products, because your job becomes to provide value to a business, not marginal entertainment to Marl.


What does it even mean what a user "wants"? I want to eat 3 cinnamon rolls. I also want to be fitter. Everyone has contradictory wants, it is not a binary choice.

The point is whether the tools feed our best or worst intentions, and which are easier to exploit. To put the onus on the individual is skewed.


> That's why we're here in the HN comments.

I half-agree with you. I think it's possible to cultive environments where content consumption is more intentional and anti-Marlian. I think HN does a pretty good job of this.


Marl is supposed to have a short attention span. Reading a bunch of comments, and composing your own multi-pararagraph comment, is hardly Marl-like.


The penultimate paragraph of the article says almost exactly this


> That is why every product, no matter how noble it starts off, eventually degenerate into Marl-fodder.

The big reason for that is that people do not want to pay for quality content or services. Ask any HN commenter if you doubt me. So companies instead focus on growth to get the ad cents from the millions of impressions, or trick "whales" by manipulating their addiction in the same manner as casinos.

As long as people call anybody a fool for paying for online services, don't expect things to improve.


Well said.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: