This has always been in the back of my mind anytime AMD has some new GPUs with n...

AnthonyMouse · on May 26, 2023

This has been the case for a while because AMD never had the resources to do software well. But their market cap is 10x what it was 5 years ago, so now they do. That still takes time, and having resources isn't a guarantee of competent execution, but it's a lot more likely now than it used to be.

On top of that, Intel is making a serious effort to get into this space and they have a better history of making usable libraries. OpenVINO is already pretty good. It's especially good at having implementations in both Python and not-Python, the latter of which is a huge advantage for open source development because it gets you out of Python dependency hell. There's a reason the thing that caught on is llama.cpp and not llama.py.

dogma1138 · on May 26, 2023

AMDs problem with software goes well beyond people they can’t stick with anything for any significant length of time and the principal design behind ROCm is doomed to fail as it compiles hardware specific binaries and offers no backward or forward compatibility.

CUDA compiles to hardware agnostic intermediary binaries which can run on any hardware as long as the target feature level is compatible and you can target multiple feature levels with a single binary.

CUDA code compiled 10 years ago still runs just fine, ROCm require recompilation every time the framework is updated and every time a new hardware is released.

AnthonyMouse · on May 26, 2023

That's all software. There is nothing but resources between here and a release of ROCm that compiles existing code into a stable intermediate representation, if that's something people care about. (It's not clear if it is for anything with published source code; then it matters a lot more if the new version can compile the old code than if the new hardware can run the old binary, since it's not exactly an ordeal to hit the "compile" button once or even ship something that does that automatically.)

dogma1138 · on May 27, 2023

It’s a must, published source code or not it doesn’t help.

First there is no forward compatibility guarantee for compiling and based on current history it always breaks.

Secondly even if the code is available a design that breaks software on other users machine is stupid and anti user.

Plenty of projects could import libraries and then themselves be upstream dependencies for other projects, many of which may not be supported.

CUDA is king because people can and still do run 15 year old compiles CUDA code on a daily basis and they know that what they produced today is guaranteed to work on all current and future hardware.

With ROCm you have no guarantee that it would work on even the hardware from the same generation and you pretty much have a guarantee that the next update will break your stuff.

This was a problem with all AMD compilers for GPGPU and ROCm should’ve tried to solve it from day 1 but it still adopted a poor design and that has nothing to do with how many people are working on it.

AnthonyMouse · on May 27, 2023

> Secondly even if the code is available a design that breaks software on other users machine is stupid and anti user.

Most things work like this. You can't natively run ARM programs on x86 or POWER or vice versa, but in most languages you can recompile the code. If you have libraries then you recompile the libraries. All it takes is distributing the code instead of just a binary. Not distributing the code is stupid and anti-user.

> This was a problem with all AMD compilers for GPGPU and ROCm should’ve tried to solve it from day 1 but it still adopted a poor design and that has nothing to do with how many people are working on it.

It isn't even a design decision. Compilers will commonly emit machine code that checks for hardware features like AVX and branch to different instructions based on whether the machine it's running on supports that. That feature can be added to a compiler at any time.

The compiler is open source, isn't it? You could add it yourself, absent any resource constraints.

dogma1138 · on May 27, 2023

No most thing’s definitely don’t work like this. I don’t expect my x86 program to stop working after a software update or not to work on new x86 CPUs that’s just ridiculous.

Also if you expect anyone to compile anything you probably haven’t shipped anything in your life.

ROCm is a pile of rubbish until they throw it out and actually have a model that would guarantee forward and backward compatibility it would remain useless for anyone who actually builds software other people use.

AnthonyMouse · on May 29, 2023

> I don’t expect my x86 program to stop working after a software update or not to work on new x86 CPUs that’s just ridiculous.

Your x86 program doesn't work on Apple Silicon without something equivalent to a recompile. Old operating systems very commonly can't run on bare metal new hardware because they don't have drivers for it.

Even the IR isn't actually machine code, it's just a binary format of something that gets compiled into actual machine code right before use.

> Also if you expect anyone to compile anything you probably haven’t shipped anything in your life.

Half the software people run uses JIT compilation of some kind.

paulmd · on May 26, 2023

The only real remaining fronts in the war are consoles and smartphones, and NVIDIA just signed a deal to license GeForce IP to mediatek so that nut is being cracked as well, mediatek gives them mass-market access for CUDA tech, DLSS, and other stuff. Nintendo has essentially a mobile console platform and will be doing DLSS too on an Orin NX 8nm chip soon (very cheap) using that same smartphone-level DLSS (probably re-optimized for lower resolutions). Samsung 8nm is exactly Nintendo's kind of cheap, it'll happen.

The "NVIDIA they might leave graphics and just do AI in the future!" that people sometimes do is just such a batshit take because it's graphics that opens the door to all these platforms, and it's graphics that a lot of these accelerators center around. What good is DLSS without a graphics platform? Do you sign the Mediatek deal without a graphics platform? Do you give up workstation graphics and OptiX and raysampling and all these other raytracing techs they've spent billions developing, or do you just choose to do all the work of making Quadros and all this graphics tech but then not do gaming drivers and give up that gaming revenue and all the market access that comes with it? It's faux-intellectualism and ayymd wish-casting at its finest, it makes zero sense when you consider the leverage they get from this R&D spend across multiple fields.

CUDA is unshakeable precisely because NVIDIA is absolutely relentless in getting their foot in the door, then using that market access to build a better mousetrap with software that everyone else is constantly rushing to catch up to. Every segment has some pain points and NVIDIA figures out what they are and where the tech is going and builds something to address that. AMD's approach of trying to surgically tap high-margin segments before they have a platform worth caring about is fundamentally flawed, they're putting the cart before the horse, and that's why they keep spinning their wheels on GPGPU adoption for the last 15 years. And that's what people are clamoring for NVIDIA to do with this idea of "abandon graphics and just do AI" and it's completely batshit.

Intel gets it, at least. OneAPI is focused on being a viable product and they'll move on from there. ROCm is designed for supercomputers where people get paid to optimize for it - it's an embedded product, not a platform. Like you can't even use the binaries you compile on anything except one specific die (not even a generation, "this is binary is for Navi 21, you need the Navi 23 binary"). CUDA is an ecosystem that people reach for because there's tons of tools and libraries and support, and it works seamlessly and you can deliver an actual product that consumers can use. ROCm is something that your boss tells you you're going to be using because it's cheap, you are paying to engineer it from scratch, you'll be targeting your company's one specific hardware config, and it'll be inside a web service so it'll be invisible to end-users anyway. It's an embedded processor inside some other product, not a product itself. That's what you get from the "surgically tap high-margin segments" strategy.

But the Mediatek deal is big news. When we were discussing the ARM acquisition etc people totally scoffed that NVIDIA would ever license GeForce IP. And when that fell through, they went ahead and did it anyway. Because platform access matters, it's the foot in the door. The ARM deal was never about screwing licensees or selling more tegras, that would instantly destroy the value of their $40b acquisition. It was 100% always about getting GeForce as the base-tier graphics IP for ARM and getting that market access to crack one of the few remaining segments where CUDA acceleration (and other NVIDIA technologies) aren't absolutely dominant.

And graphics is the keystone of all of it. Market access, software, acceleration, all of it falls apart without the graphics. They'd just be ROCm 2.0 and nobody wants that, not even AMD wants to be ROCm. AMD is finally starting to see it and move away from it, it would be wildly myopic for NVIDIA to do that and Jensen is not an idiot.

Not entirely a direct response to you but I've seen that sentiment a ton now that AI/enterprise revenue has passed graphics and it drives me nuts. Your comment about "what would it take to get Radeon ahead of CUDA mindshare" kinda nailed it, CUDA literally is winning so hard that people are fantasizing about "haha but what if NVIDIA got tired of winning and went outside to ride bikes and left AMD to exploit graphics in peace" and it's crazy to think that could ever be a corporate strategy. Why would they do that when Jensen has spent the last 25 years building this graphics empire? Complete wish-casting, “so dominant that people can’t even imagine the tech it would take to break their ubiquity” is exactly where Jensen wants to be, and if anything they are still actively pushing to be more ubiquitous. That's why their P/E are insane (probably overhyped even at that, but damn are they good).

If there is a business to be made doing only AI hardware and not a larger platform (and I don’t think there is, at that point you’re a commodity like dozens of other startups) it certainly looks nothing like the way nvidia is set up. These are all interlocking products and segments and software, you can’t cut any one of them away without gutting some other segment. And fundamentally the surgical revenue approach doesn’t work, AMD has continuously showed that for the last 15 years.

Being unwilling to catch a falling knife by cutting prices to the bone doesn’t mean they don’t want to be in graphics. The consumer GPU market is just unavoidably soft right now, almost irregardless of actual value (see: 4070 for $600 with a $100 GC at microcenter still falling flat). Even $500 for a 4070 is probably flirting with being unsustainably low (they need to fund R&D for the next gen out of these margins) but if a de-facto $500 price doesn’t spark people’s interests/produce an increase in sales they’re absolutely not going any lower than that this early in the cycle. They’ll focus on margin on the sales they can actually make, rather than chasing the guy who is holding out for 4070 to be $329. People don't realize it but obstinently refusing to buy at any price (even a good deal) is paradoxically creating an incentive to just ignore them and chase margins.

It doesn’t mean they don’t want to be in that market but they’re not going to cut their own throat, mis-calibrate consumer expectations, etc.

Just as AMD is finding out with the RX 7600 launch - if you over-cut on one generation, the next generation becomes a much harder sell. Which is the same lesson nvidia learned with the 1080 ti and 20-series. AMD is having their 20-series moment right now, they over-cut on the old stuff and the new stuff is struggling to match the value. And the expectations of future cuts is only going to dampen demand further, they’re Osborne Effect’ing themselves with price cuts everyone knows are coming. Nvidia smartened up - if the market is soft and the demand just isn’t there… make less gaming cards and shift to other markets in the meantime. Doesn’t mean they don’t want to be in graphics.