Yes, I used to, but No, I fixed it :) Among other things, I am team lead for a p...

aeyes · on Feb 19, 2022

I feel that your problems aren't even remotely related to my problems with large distributed systems.

My problems are all about convincing the company that I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user, it is usually some internal system related to data storage or processing which can't cope anymore.

Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog. Immediately you have a million problems like data migration, reporting, data ingestion, making it work with all the related systems like search, recommendations, reviews and so on.

And even if you get the ball rolling you have to work across dozens of different teams which can be hard because naturally people resist change.

Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

ratww · on Feb 19, 2022

> Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

I used to work on a Unicorn a few years ago, and this hits close to home. From 2016 to 2020, the pages didn't change one single pixel, however there we had 400 more engineers working on the code and three stack iterations: full-stack PHP, PHP backend + React SSR frontend, Java backend + [redacted] SSR frontend (redacted because only two popular companies use this framework). All were rewrites, and those rewrites were justified because none of them was ever stable, the site was constantly going offline. However each rewrite just added more bloat and failure points. At some point the three of them were running in tandem: PHP for legacy customers, another as main and another on an A/B test. (Yeah, it was a dysfunctional environment and I obviously quit).

axiosgunnar · on Feb 19, 2022

> Yeah, it was a dysfunctional environment and I obviously quit

What do you think could management have done better to make it not dysfunctional and have people quitting?

ratww · on Feb 19, 2022

I think just common sense and less bullshit rationalisation would have been enough.

They had a billion dollars in cash to burn, so they hired more than they needed. They should have hired as needed, not as requested by Masayoshi Son.

They shouldn't be so dogmatic. Some teams were too overworked, most were underworked (which means over-engineering will ensue), but no mobility was allowed because "ideally teams have N people".

They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject. So yeah, our internal tool for putting balloons and clowns into images lived together with the authentication micro-service, because it's the same team.

Rewriting everything twice without analysis was wrong. The rewrites were because previous versions were "too complex" and too custom-made but newer ones had an even more complex architecture, but "this time it's right, software sometimes need complexity".

Believing that some things were terrible would have gone a long way. Launching the main node.js server locally would take 10 to 20 minutes to launch, while something of the same complexity would often take about 2 or 3 seconds. Of course it would blow up in production! Maybe try to fix instead of ordering another rewrite.

They were good people, I miss the company and still use the product, but it didn't need to be like this.

webmaven · on Feb 20, 2022

> They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject.

Where the heck did this come from? AIUI, the ideal is supposed to be one-team-per-service, not one-service-per-team.

ratww · on Feb 20, 2022

It comes from a dogmatic reaction against microservices. Microservices were problematic in certain ways, but instead of analysing what went wrong and why, they just went the opposite direction and started doing "big services only". It was a misguided approach, plain and simple.

Interestingly due to internal bureaucracy and understaffing in some teams, there was a lot of "multiple-teams-per-service", which yeah, is another issue in itself.

akkartik · on Feb 19, 2022

Favorited (https://news.ycombinator.com/favorites?id=akkartik&comments=...)

dasil003 · on Feb 19, 2022

I don't know your specifics, but I have worked on some large scale architecture changes, and 200 engineers + 2 year feature freeze is generally not a reasonable ask. In practice you need to find an incremental path with validation and course correction along the way to limit the amount of concurrent change in flight at any moment. If you don't do this run a very high risk of the entire initiative collapsing under its own weight.

Assuming your estimation is more or less correct and it really is a 400 eng-year project, then you also need political capital as well as technical leadership to make it happen. There are lots of companies where a smart engineer can see a potential path out of a local maximum, but the org structure and lack of technical leadership in the highest ranks means that the problem is effectively intractable.

trhway · on Feb 19, 2022

>I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future

sounds like a typical massive rewrite project. They almost never succeed, many fail outright and most hardly even reach the functionality/performance/etc. level of the stuff the rewrite was supposed to replace. 2-4 years is typical for such glorious attempt before being closed or folded into something else. Management in general likes such projects, and they declare victory usually around 2 years mark and move on on the wave of the supposed success before reality hits the fan.

>to convince anyone to take engineers out of product development.

that means raiding someone's budget. Not happening :) New glorious effort needs new glorious budget - that is what management likes and not doing much more on the same budget as you're basically suggesting (i.e. i'm sure you'll get much more traction if you restate your proposal as "to hire 200 more engineers ..." because that way you'll be putting serious technical foundation for some mid-managers to grow :). You're approaching this as an engineer and thus failing in what is the management game (or as Sun Tzu was pointing out one has to understand the enemy).

fxtentacle · on Feb 19, 2022

My impression has always been that FAANG need lots of engineers because the 10xers refuse to work there. I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing. FAANG instead seem to be more into chasing trends, inventing new frameworks, rewriting to another more hip language, etc.

I would have no idea how to coordinate 200 engineers. But then again, I have never worked on a project that truly needed 50+ engineers.

"Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog." Probably that's 4 friends in a basement, similar to the core Android team ;)

danny_taco · on Feb 19, 2022

Your impression comes from the fact that you have not worked at larger teams, as you said so yourself. It's relatively easy to build something scalable from the beginning if you know what you need to build and if you are not already handling large amounts of traffic and data.

It's a whole different ballgame to build on top of an existing complex system already in production that was made to satisfy the needs at the time it was built but it now needs to support other features, bug fixes and supporting existing features but at scale while having 50+ engineers not step on each other and not break each others code in the process. 4 friends in the basement will not achieve more than 50+ engineers in this scenario, even when considering the inefficiencies of the difficulty in communication that come along with so many minds working on the same thing.

ratww · on Feb 19, 2022

GP said they have never work on something that truly needed 50+ engineers. Truly being the keyword here IMO.

I have worked on a 1000+ engineer project and another that was 500+, but I'm on the same boat as GP. Both of those didn't needed 50+, and the presence of the extra 950/450 caused several communication, organisational and architectural issues that became impossible to fix on the long term.

So I can definitely see where they're coming from.

exikyut · on Feb 19, 2022

I've long wondered what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance.

I'm also admittedly extremely curious what (broadly) had 1000 (and 500) engineers dedicated to it, when arguably only 50 were needed. Abstractly speaking that sounds a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process :/

(Am I describing the patently impossible? Not yet had experience in these types of environments)

ratww · on Feb 19, 2022

> a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process

Yep, that's a fair assessment!

The 1000+ one was an ERP for mid-large businesses. They had 10 or so flagship products (all acquired) and wanted to consolidate it all into a single one. The failure was more on trying to join the 10 teams together (and including lots of field-only implementation consultants in the bunch), rather than picking a solid foundation that they already owned and handpicking what needed.

The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors. People ended up owning parts of a screen, so something that was a "two-man in a sprint" ended up being a whole team. It was demoralising but I still like the company.

I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis. Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.

exikyut · on Feb 20, 2022

Yay, I'm learning :D

> trying to join the 10 teams together

oh no

(insert https://webcomicname.com/ here)

> rather than picking a solid foundation that they already owned and handpicking what needed.

Mmmm.

I wonder if a close alternative (notwithstanding lack of context to optimally calibrate ideas off of) might have involved leaving all the engineers alone to compare notes for 6-12 months with the singular top-down goal of "decide what components and teams do what best." That could be interesting... but it leans very heavily on preexisting competence, initiative and proactivity (not to mention conflict resolution >:D), and is probably a bit spherical-cow...

> The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors.

*Constructs getaway vehicle in spare time* AAAAAaaaaaa

Sad engineering face :<

> I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis.

Noted. Thanks.

> Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.

That makes a lot of sense, and also filed away.

Also, I recently read this which resonates quite strongly with the economy-of-efficiency scale problem (which I totally agree with): https://rachelbythebay.com/w/2022/01/26/swcbbs/, and the update, https://rachelbythebay.com/w/2022/01/27/scale/

ethbr0 · on Feb 19, 2022

> what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance

The biggest thing I've been able to correlate are command styles: imperative vs declarative.

I.e. is management used to telling engineering how to do the work? Or communicating a desired end result and letting engineering figure it out?

I think fundamentally this is correlated with bloat vs lean because the kind of organizations that hire headcount thoughtlessly inevitably attempt to manage the chaos by pulling back more control into the PM role. Which consequently leads to imperative command styles: my boss tells me what to do, I tell you, you do it.

The quintessential quote from a call at a bad job was a manager saying "We definitely don't want to deliver anything they didn't ask for." This after having to cobble together 3/4 of the spec during the project, because so much functionality was missed.

Or in interview question form posed to the interviewer: "Describe how you're told what to build for a new project." and "Describe the process if you identify a new feature during implementation and want to pitch it for inclusion."

exikyut · on Feb 20, 2022

Of course. Wow, I never thought about management like that before. But particularly in software development it makes so much sense for people to jump toward this sort of mindset.

There really is an art to scaling problems to humans so the individual work (across management and engineering) falls within the sweet spot of cognitive saturation. TIL yet another dimension that can go sideways.

The signal to noise ratio is very appreciated.

fragmede · on Feb 19, 2022

Yeah, exactly. There is overhead simply because of the (necessary) cross-communication at that scale, and there's overhead from legacy support, but here's a thought experiment. Imagine that you've built the most perfect system from scratch that you can think of. Fast forward five years, and the business has pivoted so many times that system is doing all sorts of stuff it just wasn't designed for, and it's creaky and old. It just doesn't fit right anymore and even you want to throw it away and build a new one. So you form a tiger team full of the smartest people you know to greenfield build a new one, from scratch, but that's gonna take two years to write. (You think, hey, maybe we could just take this open source thing and adapt it to our purposes. To which I say, where do you think large open source projects come from‽)

How do you bridge the two systems? You build an interim system. But customers want new features, so those features need to be done twice (bridge+new) if you're lucky, three times (existing+interim+new) if not. Could a smaller team of 10x engineers come in and do better? First off, thanks for insulting all of us, as if none of us are 10x-ers. But no. There's simply not enough hours in the day.

We've all heard of large IT projects that failed to land and said "of course". But we don't hear about the huge ones that do. And plenty of them do land, quite succesfully, with these 200+ person teams where I, as an SRE, don't know the code for the system I'm supporting.

None of this is visible from the outside.

aij · on Feb 19, 2022

> I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing.

There is huge difference between building a system that could theoretically be scaled up and actually scaling it up efficiently.

At small scales, it's really easy to build on the work of others and take things for granted without even knowing where the scaling limits are. For example, if I suddenly find I need to double my data storage capacity, I can drive to a store and come back with a trunk full of hard drives the same day. I can only do that because someone already build the hard drives, and someone stocked the nearby stores with them. If a hyperscaler needs to double their capacity, they need to plan it well in advance, allocating a substantial fraction of global hard drive manufacturing capacity. They can't just assume someone would have already built the hardware, much less have it in stock near where it's needed.

danielmarkbruce · on Feb 20, 2022

Which FAANG is rewriting to another hip language and chasing trends (especially when it comes to infra services??)? I don't mean to be rude, but it doesn't sound like you are talking about any of the FAANGs, this sounds completely made up.

fxtentacle · on Feb 20, 2022

https://blog.pragmaticengineer.com/uber-app-rewrite-yolo/

danielmarkbruce · on Feb 21, 2022

FAANG is an acronym for Facebook, Amazon, Apple, Netflix, Google. Uber isn't in the same ballpark as those companies (arguably Netflix isn't really in the same ballpark as the other four either...).

andai · on Feb 19, 2022

Heh, I wish they still looked the same. They added an order of magnitude of HTML and JS bloat while removing functionality.

pojzon · on Feb 19, 2022

Had that issue in my previous job.

Higher management decided to migrate our properitary vendor locked platform from one cloud provider to the other one. Majority of migration fell on a single platform team that was constantly struggling with attricion.

Unfortunately I was not able (neither our architects) to explain the higherups that we need bigger team and overall way more resources to pull that off.

Hope that someone that comes after me will be able to make the miracle happen.

notimetorelax · on Feb 19, 2022

I usually move on to a different project/team/company when it gets to this. E.g. my new team builds a new product that grows like crazy and has its own set of challenges. I prefer to be deliver immediate customer value vs. long term hard to sell and hard to project the value work.

ClumsyPilot · on Feb 19, 2022

"That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user"

It seems to be the same story in fiels of Infrastructure maintenance, Aircraft design (boeing Max), and mortgage CDOs (2008). Was it always like this or the new management doesn not care untill something explodes?

imachine1980_ · on Feb 19, 2022

a manufacturing company is designed the ground up to works whit machine but isn't the same whit software, is hard to understand that triple data isn't only triple server but a totaly different software stack, and exponentially more complex is not only put more factories like textile.

fragmede · on Feb 19, 2022

There's still order of magnitude change analogies to real world processes, if people are willing to listen (which is the hard part). Use something that everybody can understand, like making pancakes or waffles or an omelet. Going from making 1 by hand, every 4 minutes at home for your family, to 1,000 pancakes per minute at a factory is obviously going to take a better system. You can scale horizontally, and do the equivalent of putting more VMs behind the load balancer, and hire 4,000+ people to cook, but you still need to have/make that load balancer in the first place for even that to work.

That's the tip of iceberg when going from 1 per 4 minutes to 1,000 per minute though. How do you make and distribute enough batter for that system, and plating and serving that is going to take a pub/sub bus, err, conveyor belt to support the cooks' output. Again though, you still gotta make that kafka queue, err, conveyor belt, plus the maintenance for that is going to a team of people if you need the conveyor belt to operate 24/7/52. If your standards are so high that the system can never go down for more than 52.6 minutes per year or 13.15 minutes per quarter, then that team needs to consist of highly-trained and smart (read: expensive) people to call when the system breaks in the middle of the night.

nostrebored · on Feb 19, 2022

You had problems with management of a cloud based api and executive visibility… so you bought a set of data centers to handle 500mio req per month?

The visibility you will get after the capex when there’s a truly disastrous outage will be interesting.

NavinF · on Feb 19, 2022

Hmm that’s only 190Hz on average, but we don’t know what kind of search engine it is. For example if he’s doing ML inference for every query, it would make perfect sense to get a few cabinets at a data center. I’ve done so for a much smaller project that only needs 4 GPUs and saved a ton of money.

fxtentacle · on Feb 19, 2022

Nah, it's text-only requests returning JSON arrays of which newspaper article URLs mention which influencer or brand name keyword.

The biggest hardware price point is that you need insane amounts of RAM so that you can mmap the bloom hash for the mapping from word_id to document_ids.

winrid · on Feb 19, 2022

You could have used a sharded database like Mongo. Just throw up 10 shards, use "source" (influencer or brand name) as shard key?

fxtentacle · on Feb 19, 2022

Yes, I could have used Mongo, but it would have been 100x to 1000x slower than an mmap-ed look up table.

nostrebored · on Feb 19, 2022

Why ever use mmap instead of sharded inverted indices of word-doc here, a la elasticsearch?

winrid · on Feb 19, 2022

Yeah the question is what level of performance you need I guess... was hoping you could clarify :)

joshuamorton · on Feb 19, 2022

But you don't actually need that level of performance? You've made this system more complex and expensive to achieve a requirement that doesn't matter?

shoo · on Feb 19, 2022

you seem to have a deeper knowledge of the business & organisational context that dictate the true requirements than someone working there. please share these details so we can all learn!

joshuamorton · on Feb 19, 2022

Sure: the network request time of a person making a request over the open internet is going to be an order of magnitude longer than a DB lookup (in the right style, with a reverse-index) on the scale of data this person is describing. So making the lookup 10x faster saves you...1% of the request latency.

And at the qps they've described, it's not a throughput issue either. So I'm pretty confident in saying that this is a case of premature optimization.

And at some point the increase in parallelization of scans dominates mmap speed, unless you're redundantly sharding your mmaped hash table across multiple machines. And there are cases where network bandwidth is the bottleneck before disk bandwidth, though probably not this case. But yeah basically, the answer is something like "if this is the optimal choice, it probably didnt matter that much".

fxtentacle · on Feb 20, 2022

This reads to me as if you have never really used mmap in a dedicated C/C++ application. Just to give you a data point, looking up one word_id in the LUT and reading 20 document_ids from it takes on average 0.0000015 ms.

So if that alternative database takes on average 0.1ms per index read, then it's starting out roughly 65000x slower.

"than a DB lookup (in the right style, with a reverse-index)"

Unless, of course, you're managing petabytes of data ;)

"at the qps they've described, it's not a throughput issue either"

It's mostly a cost thing. If a single request takes 2x the time, that's also a 2x on the hosting bill.

"parallelization of scans dominates mmap speed"

Yes, eventually that might happen. Roughly when you have 100000 servers. But before that your 10gbit/s node-to-node link will saturate. Oops.

joshuamorton · on Feb 21, 2022

> Unless, of course, you're managing petabytes of data ;)

Are...are you saying that you've purchased petabyte(s) of RAM, and that that multi-million dollar investment is somehow cheaper than...well really anything else?

> But before that your 10gbit/s node-to-node link will saturate. Oops.

Only if you're returning dense results, which it sounds like you aren't (and there are ways to address this anyhow), which is why I said the issue of saturating network before disk probably wasn't an issue for you ;)

fxtentacle · on Feb 21, 2022

No, of course I have a tiered architecture. HDDs + SSDs + RAM. By mmap-ing the file, the Linux kernel will make sure that whatever data I access is in RAM and it'll do best-effort pre-reading and caching, which works very well.

BTW, this is precisely how "real databases" also handle their storage IO internally. So all of the performance cost I have to pay here, they have to pay, too.

But the key difference is that with a regular database and indices, the database needs to be able to handle read and write loads, which leads to all sorts of undesirable trade-offs for their indices. I can use a mathematically perfect index if I split dataset generation off of dataset hosting.

It's really quite difficult to explain, so I'll just redirect you to the algorithms. A regular database will typically use a B-tree index, which is O(log(N)). I'm using a direct hash bucket look-up, which is O(1).

For a mental model, you can think of "mmap" as "all the results are already in RAM, you just need to read the correct variable". There is no network connection, no SQL parsing, no query planning, no index scan, no data retrieval. All those steps would just consume unnecessary RAM bandwidth and CPU usage. So where a proper DB needs 1000+ CPU cycles, I might get away with just 1.

winrid · on Feb 21, 2022

No modern DB uses mmap because it's unreliable and hard to tune for performance.

A custom cache manager will always perform better than mmap provided by the kernel.

The problem is you haven't explained how the overhead of a DB is too much. Sure, it sounds like a lot of work for your servers and the DB compared to reading from a hashmap.

Where I work right now we fire around 1.5B queries a day... to Mongo.

nostrebored · on Feb 21, 2022

And have your unreliable, iconsistent, unscalable system. That apparently goes down all the time.

Not using ES here is actually nuts.

winrid · on Feb 20, 2022

Are you managing petabytes of data though?

What kind of servers are you running? What's your max QPS?

The fact is with your mmap impl. you probably use ram + virtual memory, and have more ram than needed to compensate for the fact that you don't keep the most used keys in memory, which a DB will do for you.

Point is if you have petabytes of data and access patterns only mean you access a subset of it, even Mongo might be cheaper to run.

fxtentacle · on Feb 21, 2022

Just FYI, MongoDB storage also uses mmap internally.

So we are comparing here "just mmap" with "mmap + all that connection handling, query parsing, JSON formatting, buffering, indexing, whatever stuff that MongoDb does".

And no, MongoDB is effectively never a cheap solution. They are used because they are super convenient to work with, with all things being JSON documents. But all that conversion to and from JSON comes at a price. It'll eat up 1000s of CPU cycles just to read a single document. With raw mmap, you could read 1000s of documents instead.

jd_mongodb · on Feb 21, 2022

MongoDB uses the Wired Tiger storage engine internally. The MMAP storage engine was removed from MongoDB in V4.2 which was released in March 2020. The MMAP engine was deprecated two years previously.

In MongoDB conversion to and from raw JSON into BSON (Binary JSON) is done on the client (aka driver) so the server cycles are not consumed.

winrid · on Feb 21, 2022

And 2 you're looking past the point. Any DB work work fine for this use case. If you wanted sharding, there's vitess for mysql, for example.

winrid · on Feb 21, 2022

As another already said, Mongo doesn't use mmap anymore.

Mongo doesn't convert to and from JSON. The driver uses a binary protocol.

Damogran6 · on Feb 19, 2022

As a security guy I HATE the loss of visibility in going to the cloud. Can you duplicate it? Sure. Still not as easily as spanning a trunk and you still have to trust what you’re seeing to an extent.

nostrebored · on Feb 19, 2022

The visibility I was mentioning in the parent comment was visibility from executives in your business, but I can see how it would be confusing.

There are tradeoffs — cloud removes much of the physical security risks and gives you tools to help automated incident detection. Things like serverless functions let you build out security scaffolding pretty easily.

But in exchange you do have to give some trust. And I totally understand resistance there.

justinclift · on Feb 19, 2022

> cloud removes much of the physical security risks

Doesn't cloud increase the physical security risks, rather than decrease/remove?

fxtentacle · on Feb 19, 2022

You might be surprised. The performance equivalent of $100k monthly in EC2 spend fits into a 16m2 cage with 52HU racks.

_iziv · on Feb 19, 2022

Which costs you more than $100k monthly to operate with the same level of manageability and reliability.

We don't use AWS, because our use cases don't require that level of reliability and we simply cannot afford it, but if I needed a company to depend on IT that generates enough revenue... I probably wouldn't argue about the AWS bill. So long, prepaid at hetzner + in-house works good enough, but I know what I cannot offer with the click of a button to my user!

Spooky23 · on Feb 19, 2022

This is a religious debate among many. The IT/engineering nerd stuff doesn’t matter at all. Cloud migration decisions are always made by accounting and tax factors.

I run two critical apps, one on-prem and one cloud. There is no difference in people cost, and the cloud service costs about 20% more on the infrastructure side. We went cloud because customer uptake was unknown and making capital investments didn’t make sense.

I’ve had a few scenarios where we’ve moved workloads from cloud to on-prem and reverse. These things are tools and it doesn’t pay to be dogmatic.

sdoering · on Feb 19, 2022

> These things are tools and it doesn’t pay to be dogmatic.

I wish I would hear this line more often.

So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Especially bad imho when somebody tries to tell you how you could do better with 'not x' instead of x you are currently using without even trying to understand the context this decision resides in.

[Edit] typo

qorrect · on Feb 19, 2022

> So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Might have always been that way? We just have so many more tools to argue over now.

dekhn · on Feb 19, 2022

that cage is a liability, not an asset. How is the networking in that rack? What's its connection to large-scale storage (IE, petabytes, since that's what I work with). What happens if a meteor hits the cage? Etc.

fxtentacle · on Feb 19, 2022

That depends on what contracts you have. You could have multiple of these cages in different locations. Also, 1 PB is only 56 large enterprise HDDs. So you just put storage into the cage, too.

But my point wasn't about how precisely the hardware is managed. My point was that with a large cloud, a mid-sized company has effectively NO SUPPORT. So anything that gives you more control is an improvement.

dekhn · on Feb 19, 2022

"1 pb is only 56 large enterprise hdds".

umm, what happens when one fails?

With large cloud my startup had excellent support. We negotiated a contract. That's how it works.

fxtentacle · on Feb 19, 2022

Typically people use RAID or ZFS to prevent data loss when a few hdds fail.

dekhn · on Feb 19, 2022

OK, so basically you're in a completely different class of expectations about how systems perform under disk loss and heavy load then me. A drive array is very different from large-scale cloud storage.

fxtentacle · on Feb 19, 2022

Hard to say. My impression is:

- A large ZFS pool of SSDs is much faster than any cloud storage.

- Cloud storage failed much more often than the SSDs in our pool.

- "Noisy neighbor" is an issue on the cloud

qorrect · on Feb 19, 2022

This cracked me up. Thanks fxtentacle :D.

dekhn · on Feb 21, 2022

of course, the reason that's wrong is that if one drive fails you don't have a 56pb storage system, you have something smaller because of redundancy.

That redundancy, and the performance that scales due to it, place cloud services in an entirely different class from on prem servers.

ckdarby · on Feb 19, 2022

>I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

If the business can't afford to have downtime then they should be paying for enterprise support. You'll be able to connect to someone in < 10 mins and have dedicated individuals you can reach out to.

jerjerjer · on Feb 19, 2022

You never hosted on AWS, did you?

0x445442 · on Feb 19, 2022

In the two years I worked on serverless AWS I filed four support tickets. Three out of those four I came up with the solution or fix on my own before support could find a solution. The other ticket was still open when I left the company. But the best part is when support wanted to know how I resolved the issues. I always asked how much they were going to pay me for that information.

ckdarby · on Feb 19, 2022

>You never hosted on AWS, did you?

Previously 2k employee company, with the entire advertising back office on AWS.

Currently >$1M YR at AWS, you can get the idea of scale & what is running, here: https://www.youtube.com/playlist?list=PLf-67McbxkT6iduMWoUsh...

phillu · on Feb 19, 2022

Enterprise Support never disappointed me so far. Maybe not <10 minute response time, but we never felt left alone during an outage. But I guess this is also highly region/geo dependent.

FpUser · on Feb 19, 2022

>"they should be paying for enterprise support"

This sounds a bit arrogant. I think they found better and overall cheaper solution.

ckdarby · on Feb 19, 2022

>This sounds a bit arrogant.

The parent thread talks about how the business could not go down even with a triple AZ outage for S3, and I don't think it is arrogant to state they should be paying for enterprise support if that level of expectation is set.

>I think they found better and overall cheaper solution.

Cheaper solution does not just include the cost but also the time. For the time we need to look at the time they spent regardless of department to acquire, migrate off of AWS, modifying the code to work for their multi-private cloud, etc. I'd believe it if they're willing to say they did this, have been running for three years, and compiled the numbers in excel. It is common if you ask internally was it worth it to get a yes because people put their careers on it and want to have a "successful" project.

The math doesn't work out in my experiences with clients in the past. The scenarios that work out are, top 30 in the enitre tech industry, significant GPU training, egress bandwidth (CDN, video, assets), or business that are selling basically the infrastructure (think Dropbox, Backblaze, etc.).

I'm sure someone will throw down some post where their cost, $x is less than $y at AWS, but that is such a tiny portion that if the cost is not >50% it isn't even worth looking at the rest of the math. The absolute total cost of ownership is much harder than most clickbait articles are willing to go into. I have not seen any developers talk about how it changes the income statement & balance sheet which can affect total net income and how much the company will lose just to taxes. One argument assumes that it evens out after the full amortization period in the end.

Here are just a handful of factors that get overlooked, supply chain delays, migration time, access to expertise, retaining staff, churn increase due to pager/call rotation, opportunity cost of to capital being in idle/spare inventory and plenty more.

fxtentacle · on Feb 19, 2022

Back then, it was enough to saturate the S3 metadata node for your bucket and then all AZs would be unable to service GET requests.

And yes, this won't be financially useful in every situation. But if the goal is to gain operational control, it's worthwhile nonetheless. That said, for a high-traffic API, you're paying through the nose for AWS egress bandwidth, so it is one of those cases where it also very much makes financial sense.

ckdarby · on Feb 19, 2022

Same fxtenatcle as CTO of ImageRights? If that is the case my follow up question is did you actually move everything out of AWS? Or did you just take the same Netflix approach like Open Connect for the 95th billing + unmetered & peering with ISPs to reduce.

FpUser · on Feb 19, 2022

So you basically saying that no matter what one should always stick to Amazon. I have my own experience that tells exactly the opposite. To each their own. We do not have to agree.

ckdarby · on Feb 19, 2022

>So you basically saying that no matter what one should always stick to Amazon.

What I am saying is given the list of exceptions I gave the business should run/colocate their gear if they're in the exception list or those components that fall in the exception list should be moved out.

>I have my own experience that tells exactly the opposite.

You begin using AWS for your first day ever and on that day it has a tri AZ outage for S3. In this example the experience with AWS has been terrible. Zooming out though over 5 years it wouldn't look like a terrible experience at all considering outages are limited and honestly not that frequent.

FpUser · on Feb 19, 2022

>"You begin using AWS for your first day ever"

I am not talking about outages here. Bad things can happen. More like a price.

BossingAround · on Feb 19, 2022

I don't read that as arrogant. The full statement is:

> If the business can't afford to have downtime then they should be paying for enterprise support.

It's simply stating that it's either cheaper for business to have downtime, or it's cheaper to pay for premium support. Each business owner evaluates which is it for them.

If you absolutely can't afford downtime, chances are premium support will be cheaper.

jqgatsby · on Feb 19, 2022

@fxtentacle, I was curious which private search engine this is for. Is the system you are describing ImageRights.com?

fxtentacle · on Feb 19, 2022

No, ImageRights is much more requests and mostly images. Also, at ImageRights I don't have management above me that I would need to convince :)

This one is text-only and used by influencers and brands to check which newspapers report about their events. As I said, it's internally used by a few partner companies who buy the API from my client and sell news alerts to their clients.

BTW, I'm hoping to one day build something similar as an open source search engine where people pay for the data generation and then effectively run their own ad-free Google clone, but so far interest has been _very_ low:

https://news.ycombinator.com/item?id=30374611 (1 upvote)

https://news.ycombinator.com/item?id=30361385 (5 upvotes)

EDIT: Out of curiosity I just checked and found my intuition wrong. The ImageRights API averages 316rps = 819mio requests per month. So it's not that much bigger.

mmcnl · on Feb 19, 2022

If you rely on public cloud infrastructure, you should understand both the advantages and disadvantages. Seems like your company forgot about the disadvantages.

briandilley · on Feb 19, 2022

What i read here was "Cloud is hard, so I took on even more responsibility"

fxtentacle · on Feb 19, 2022

What you should read is: At the monthly spend of a mid-sized company, it is impossible to get phone support from any public cloud provider.

ddorian43 · on Feb 19, 2022

What are you using for aws alternatives? Example for S3?

ckdarby · on Feb 19, 2022

>What are you using for aws alternatives? Example for S3?

Not OP but they're probably using Rook/Minio

fxtentacle · on Feb 19, 2022

docker + self-developed image management + CEPH

flyinglizard · on Feb 19, 2022

Care to share uptime metrics on AWS vs your own servers?

fxtentacle · on Feb 19, 2022

That wouldn't be much help because the AWS and Heroku metrics are always green, no matter what. If you can't push updates to production, they count that as a developer-only outage and do not deduct it from their reported uptime.

For me, the most important metric would be time that me and my team spent fixing issues. And that went down significantly. After a year of everyone feeling burned out, now people can take extended vacations again.

One big issue for example was the connectivity between EC2 servers degrading, so that instead of the usual 1gbit/s they would only get 10mbit/s. It's not quite an outage, but it makes things painfully slow and that sluggishness is visible for end users. Getting reliable network speeds is much easier if all the servers are in the same physical room.