Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Nothing guarantees optimal placement, but all the mainstream schedulers attempt an approximation of it. The general assumption in mainstream schedulers is that servers are placement-sensitive, and batch jobs aren't.

I don't know the open source schedulers well, but modern Borg batch scheduling works quite differently than (my understanding of) your description. Batch jobs are still often placement-sensitive (needing to run near a large database, for example, for bandwidth reasons). The big distinction I see is that where serving tasks get tight SLOs about scheduling latency, evictions/preemptions, and availability of resources on the machine it's scheduled on, batch tasks basically don't. They can take a while to schedule, they can get preempted frequently, and they cram into the gap between the serving tasks' current usage and limit. E.g., if a serving job says it needs 10 cores but is only using 1, a batch job might use 8 of those then just kinda suffer if the serving job starts using more than 2, because CPU is "compressible", or it will be evicted if things get really bad. In the same situation with RAM (mostly "incompressible"), the batch job gets evicted ASAP if the serving job needs the RAM, or the system involves some second-class RAM solution (cross numa node, Optane, zramfs, rdma, page to SSD, whatever). Batch doesn't get better service in any respect, but it's cheaper.

> As for cost: we rent out server space. We rack servers to keep up with customer load. The more load we have, the more money we're making. If we're racking a bunch of new servers in FRA, that means FRA is making us more money.

and in the article, you wrote:

> It was designed for a cluster where 0% utilization was better, for power consumption reasons, than < 40% utilization. Makes sense for Google. Not so much for us.

IIUC, you mean that whoever you're renting space from doesn't charge you by power usage, so you have no incentive to prefer fully packing 1 machine before scheduling something on every machine available. Spreading is fine. Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

I think another aspect to consider is avoiding "stranded" resources. Avoiding situations in which say a task that needs most/all of a machine's remaining RAM but very little CPU gets scheduled on a machine with a whole bunch of CPU available, effectively making that CPU unusable until something terminates. You've got headroom, but I presume that's based on forecasted need, and if that gets higher because you'll still have stranded resources when that need comes, the stranding is costing you real money.

Maybe this problem is avoided well enough just by spreading things out? or maybe you don't allow weird task shapes? or maybe (I'm seeing your final paragraph now about growth) it's just not worth optimizing for yet?



> Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers? If you turn off the idle server, probably yes? If not, I'd have to see measurements, but I could imagine it going either way. Lower activity means less heat means lower voltage means less power per unit of work, maybe.

You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).


In modern CPUs with modern thermal management approaches, it's probable that fully loading two cores in two servers is much more efficient than even powering off the second server, because in each machine the primary delta in power draw between idle-state and max-load is in thermal management (fans), and running cores more distributed will increase passive cooling, as well as allowing the CPU cores that are in use to run in more energy-efficient modes.

That said, I haven't done the actual math here, just seen the power draw benchmarks that show idle -> single core draw -> all core draw as a curve with idle and single core usage well under the ratio of number of cores, without even accounting for the fact that each core is more performant under single-core workloads.


> Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers?

That's the premise, and I have no particular reason to doubt it. There are several levels at which it might be true, from using deeper sleep states (core level? socket level?) to going wild and de-energizing entire PDUs.

> You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).

Yeah, exactly, it's a double-edged sword. The fly.io article says the following...

> With strict bin packing, we end up with Katamari Damacy scheduling, where a couple overworked servers in our fleet suck up all the random jobs they come into contact with. Resource tracking is imperfect and neighbors are noisy, so this is a pretty bad customer experience.

...and I've seen problems along those lines too. State-of-the-art isolation is imperfect. E.g., some workloads gobble up the shared last-level CPU cache and thus cause neighbors' instructions-per-cycle to plummet. (It's not hard to write such an antagonist if you want to see this in action.) Still, ideally you find the limits ahead of time, so you don't think you have more headroom than you really do.


No, it's not that power usage for us is free, it's that the business is growing (like any startup), so there is already a constant expansionary pressure on our regions; for the foreseeable future (years), our regions will tend to have significantly more servers than a scheduler would tell us we needed. Whatever we save in power costs by keeping some of those servers powered off, we lose in technical operational costs by keeping the rest of the servers running artificially hot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: