As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)
> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices
By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).
Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.
> Nvidia B200s ... offer 2-3x the performance of H100s
For ML, not for HPC. ML and HPC are two completely different, only loosely related fields.
ML tasks are doing great with low precision, 16 and 8 bit precision is fine, arguably good results can be achieved even with 4 bit precision [0][1]. That won't do for HPC tasks, like predicting global weather, computational biology, etc. -- one would need 64 to 128 bit precision for that.
Nvidia needs to decide how to divide the billions of transistors on their new silicon. Greatly oversimplifying, they can choose to make one of the following:
* Card A with *n* FP64 cores, or
* Card B with *2n* FP32 cores, or
* Card C with *4n* FP16 cores, or
* Card D with *8n* FP8 cores, or (theoretically)
* Card E with *16n* FP4 cores (not sure if FP4 is a thing).
Card A would give HPC guys n usable cores, and it would give ML guys n usable cores. On the other end, Card E would give ML guys 16n usable cores (and zero usable cores for HPC guys). It's no wonder that HPC crowd wants Nvidia to produce Card A, while ML crowd wants Nvidia to produce Card E. Given that all the hype and the money are currently with the ML guys (and $NVDA reflects that), Nvidia will make a combination of different cores that is much much closer to Card E than it is to Card A.
Their new offerings are arguably worse than their older offerings for HPC tasks, and the feeling with the HPC crowd is that "Nvidia and AMD are in the process of abandoning this market".
Doesn't multiply area scale at O(n^2 * log(n)) ?? (At least, I'm pretty sure the Wallace Tree Multiplier circuit is somewhere in that order).
So a 64-bit multiplier is something like 32x more area than a 16-bit multiplier.
But what you say is correct for RAM area or the number of bits you need for register space. So taken holistically, it's difficult to say...
Okay, 64-bit FP is only like 53-bits and 16-bit FP is actually like 11 bits. But you know what I mean. I'm still doing quick napkin math here, nothing formal.
-------
We can ignore adders and subtractor circuits because they are so small. Division is often implemented as reciprocal followed by multiplication circuits for floating point (true division is very expensive).
With the B100 somehow announced to have lower scalar FP64 throughput than the H100 (did they remove the DP tensor cores ?), one will have to rely on Ozaki schemes (dgemm with int8 tensor cores) and lots of the recent body of work on mixed-precision linear algebra show there's a lot of computing power to be harnessed from Tensor Cores. One of the problems of HPC now is a level of ossification of some codebases (or the lack of availability of porting/coding/optimizing people). You shouldn't have to rewrite everything every 5 years but the hardware constructors go where they go and we still haven't found the right level of abstraction to avoid big porting efforts.
Yes, that's a great point that I missed. From anecdotal evidence, it seems more people are using supercomputers for ML use cases, that would have been traditionally done by HPC. (eg training models for weather forecasts)
The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.
If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.
From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.
It seems there was a misunderstanding, as I haven't made any value judgment about LINPACK.
Yes, LINPACK is indeed "old" with a heavy focus on compute power. However, its simplicity serves as a reliable baseline for the types of workflows that supercomputers are designed to handle. Also, at their core, most AI workloads perform essentially the same operations as HPC, albeit with less stability—which, I admit, is a feature, but likely the reason AI-focused systems do not prioritize LINPACK as much.
I am simply saying that any useful metric needs to not only be "stable", but also simple to grasp. Take Green500, probably a significant benchmark for understanding how algorithms consume power, but "too complex" to explain: yet, many cloud providers with their AI supercomputers avoid competing against HPC supercomputers in this domain.
This avoidance isn’t necessarily due to secrecy but rather inefficiencies inherent to cloud systems. Consider PUE (Power Usage Effectiveness)—a highly misleading metric that cloud providers frequently tout. PUE can easily be manipulated, especially with the use of liquid cooling, which is why optimizing for it has become a major factor contributing to water disruptions in several large cities worldwide.
DoE has absolutely no incentive (nor need, I'd argue) to compare their supercomputers to commercially owned data center operations though.
Comparing their crazy expensive custom built HPC to massive arrays of customer grade hardware doesn't bring them additional funds, nor help them more PR wise than being the owner of the fastest individual clusters.
Being at the top of some heap is visibly one of their goal:
DOE clusters are also massive arrays of customer grade hardware. Private cloud can only keep up in low precision work, and that is why they're still playing with remote memory access over TCP, because it's good enough for web and ML.
High precision HPC exists in the private cloud, but you only hear "we don't want to embarrass others" excuses because otherwise you would be able to calculate the cost.
On prem HPC is still very, very much cheaper than hiring out.
B200s have an incremental increase in FP64 and FP32 performance over H100s. That is the number format that HPC people care about.
The MI300A can get to 150% the FP64 peak performance that B200 devices can get, although AMD GPUs have historically underperformed their spec more than Nvidia GPUs. It's possible that B200 devices are actually behind for HPC.
MI300 also have decent performance in FP16 (~108 TFLOPS). Not as good as NVIDIA, but it's getting there. Anyone has experience using these on JAX? Support is said to be decent, but no idea if it's good enough for research-oriented tasks, i.e. stable enough for training and inference.
The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.
i wish people wouldn't make stuff up just to sound cool.
like do you have actual experience with gov/edu HPC? i doubt it because you couldn't be more wrong - lab HPC clusters are just very very poorly (relative to FAANG) strewn together nodes. there is absolutely no sense in which they are "one single machine" (nothing is "abstracted over" except NFS).
what you're saying is trivially false because no one ever requests all the machines at once (except when they're running linpack to produce top500 numbers). the rest of the time the workflow is exactly like in any industrial cluster: request some machines (through slurm), get those machines, run your job (hopefully you distributed the job across the nodes correctly), release those machines. if i still had my account i could tell you literally how many different jobs are running right now on polaris.
Actually, LLNL (the site of El Capitan) has a process for requesting Dedicated Application Time (a DAT) where you use up to a whole machine, usually over a weekend. They occur fairly regularly. Mostly it's lots of individual users and jobs, like you said though.
i mean rick stevens et al can grab all of polaris too but even so - it's just a bunch of nodes and you're responsible for distributing your work across those nodes efficiently. there's no sense in which it's a "single computer" in any way, shape or form.
>Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.
You can't go faster than the speed of light (yet) and traveling a few micrometers will always be much faster than traversing a room (plus routing and switching).
Many HPC tasks nowadays are memory-bound rather than CPU-bound, memory-latency-and-throughput-bound to be more precise. An actual supercomputer would be something like the Cerebras chip, a lot of the performance increase you get is due to having everything on-chip at a given time.
Really? How about: "This pointer is valid, has the same numeric value (address) and points to the same data in all threads".
The point is not the latency nor bandwidth. The point is the programming/memory model. Infiniband maybe makes multiprocessing across nodes as fast as multiprocessing on a single node. But it's not multithreading.
I feel sorry for you if you believe this. It's not true physically nor is it true on the level of the cache coherence protocol nor is it true from the perspective of the operating system.
Tell me you've never run a distributed workload without telling me. You realize if what you were saying were true, HPC would be trivial. In fact it takes a whole lot of PhDs to manage the added complexity because it's not just a "single computer".
GCP plans offer access to high-end NVIDIA GPUs, as well as TPUs. I thought Google products use the same pool of resources that is also resold to customers?
Apologies, to be clear what I meant was that to my knowledge Google doesn't use GPUs for it's own stuff, but does sell both TPUs and GPUs to others on Cloud.
Also, to be clear, I have no internal info about this, I'm going based on external stuff I've seen.
Generally HPC compute has lower margins similar to consoles. It makes sense that AMD would fight for that contract more than NVIDIA similar to IBM stopped doing this. Its sort of comparing Apples to Raspberry Pis.
Ya exactly - no one cares about top500 outside of academia (literally have never heard it come up at work). So this is like the gold star (participation award) of DCGPU competition.
After skimming the article, I'm confused -- where exactly is the headline being pulled from?
If you look at the table toward the bottom, no matter how you slice it, Nvidia has 50% of the total cores, 50% of the total flops, and 90% of the total systems among the Top 500, while AMD has 26% of the total cores, 27.5% of the total flops, and 7% of the total systems.
Is it a matter of newly-added compute?
> This time around, on the November 2024 Top500 rankings, AMD is the big winner in terms of adding capacity to the HPC base.
Said table is titled for "Accelerated Supers" (i.e. only ones with GPUs) so the numbers can't be applied to the Top500 as a whole like that. Combining numbers from the summaries at the bottom of the table titled for "All Supers", Nvidia is more like 38% of Top500 FLOPS as they don't have any non-accelerated systems in the list.
Knowing all of that it still leaves unexplained whether AMD has the needed ~70% of non-accelerated compute (assuming FLOPS) to clear the bar for the headline. It seems unlikely to me... but the article doesn't actually have enough data to be sure one way or the other.
That said, I assumed the context of the article was specifically on the topic of AMD GPUs, and not, say, Epyc processors, so if so, it's ultimately irrelevant.
(Also, there are just over 3 exaflops across non-accelerated supers; AMD would need > 2/3 of the remaining share in order to surpass Nvidia on that front as well.)
Top500 is weird for lots of reasons. It over-indexes on a few peculiar types of workloads and a few peculiar types of users (mostly gov).
Historically, those workloads and users were leading indicators of certain types of things. I don't think that's true anymore. In fact, I wonder if this is mostly a story of the government agencies not being able to compete with the private sector for NVIDIA gpus.
Companies like CoreWeave have deployed so many giant clusters (and growing), it is insane. Their IDLE compute is larger than most of the supercompuers out there.
There is another widespread common factor among the top machines. A large majority are based on HPE Slingshot networking (7 out of top 10 by my count).
Without blindingly fast, otherwise blinding numerical performance dims quite a lot. This is why the Cerebras numbers on heavy numerical problems are competitive up to a pretty severe ceiling. Below that point, their on wafer interconnects suffice, above it they cannot scale the data communications bandwidth necessary.
layperson with no industry knowledge, but it seems like nvidia's CUDA moat will fall in the next 2-5 years. It seems impossible to sustain those margins without competition coming in and getting a decent slice of the pie
But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.
You are right, eventually something's gotta give. The path for this next leg isn't yet apparent to me.
P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?
The API part isn't thaaat hard. Indeed HIP already works pretty well at getting existing CUDA code to work unmodified on AMD HW. The bigger challenge is that the AMD and Nvidia architectures are so different that the optimization choices for what the kernels would look like are more different between Nvidia and AMD than they would be between Intel and AMD in CPU land even including SIMD.
Only if the only thing one cares about is CUDA C++, and not CUDA C, CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.
CUDA C works fine with HIP not sure what you're referring to. As for the other pieces, GPU graphical debugging isn't relevant for CUDA and I don't know what IDE integration is special / relevant for CUDA but AMD does have a ROCm debugger which I would imagine would be sufficient for simultaneous debugging of CPU & GPU. You won't get developer tools like nsight systems but I'm pretty sure AMD has equivalent tooling.
As for Fortran, that doesn't come up much in modern AI stuff. I haven't observed PTX / GCN assembly within AI codebases but maybe you have extra insight there.
> P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?
Nominally, a measurement in "flops" is how many (typically 32-bit) FLoating-point Operations Per Second the hardware is capable of performing, so it's an approximate measure of total available computing power.
Nvidia has won because their compute drivers don't crash people's systems when they run e.g. Vulkan Compute.
You are mostly listing irrelevant nice to have things that aren't deal breakers. AMD's consumer GPUs have a long history of being abandoned a year or two after release.
CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.
Coupled with Khronos, Intel, AMD never delivering anything comparable with OpenCL, Apple losing interest after Khronos didn't took OpenCL into the direction they wanted, Google never adopting it favouring their Renderscript dialect.
The ratio between power usage and GPU cost is very, very different than with CPUs, though. If you could save e.g. 20-30% of the purchase price that might make it worth it.
e.g. you could run a H100 at 100% utilization 24/7 for 1 years at $0.4 per kWh (so assuming significant overhead for infrastructure etc.) and that would only cost ~10% of the purchase price of the GPU itself.
Yes, I know that. Hence I quadrupled the price of electricity or are you saying that the cost of capacity and cooling doesn't scale directly with power usage?
We can increase that another 2x and the cost would still be relatively low compared to the price/deprecation of the GPU itself.
CUDA is the assembly to Torch's high-level language; for most, it's a very good intermediary, but an intermediary nonetheless, as it is between the actual code they are interested in, and the hardware that runs it.
Most customers care about cost-effectiveness more than best-in-class raw-performance, a fact that AMD has ruthlessly exploited over the past 8 years. It helps that AMD products are occasionally both.
Software will bridge the gap. There are simply too many competing platforms out there that are not Nvidia based. Most decent AI libraries and frameworks already need to support more than just Nvidia. There's a reason macs are popular with AI researchers: many of these platforms support Apple's chips already and they perform pretty well. Anything that doesn't support those chips, is a problem waiting to be fixed with plenty of people working on fixing that. If it can be fixed for Apple's chips, it can also be fixed for other people's chips.
And of course there is some serious amount of money sloshing around in this space. Things being hard doesn't mean it's impossible. And there's no shortage of extremely well funded companies working on this stuff. All your favorite trillion $ companies basically. And most of them have their own AI chips too. And probably some reservations about perpetually handing a lot of their cash to Nvidia.
If you want an example of a company that used to have a gigantic moat that is now dealing with a lot of competition, look at Intel. X86 used to be that moat. And that's looking pretty weak lately. One reason that AMD is in the news a lot lately is that they are growing at Intel's expense. Nvidia might be their next target.
A high grade consumer gpu a (a 4090) is about 80 teraflops. So rounding up to 100, an exaflop is about 10,000 consumer grade cards worth of compute, and a petaflop is about 10.
Which doesn’t help with understanding how much more impressive these are than the last clusters, but does to me at least put the amount of compute these clusters have into focus.
My point of reference is that back in undergrad (~10-15 years ago), I recall a class assignment where we had to optimize matrix multiplication on a CPU; typical good parallel implementations achieved about 100-130 gigaflops (on a... Nehalem or Westmere Xeon, I think?).
According to Wikipedia the previous #1 was from 2022 with a peak petaflops of 2,055. This system is rated at 2,746. So about 33% faster than the old #1.
its possible. Just look at Apples GPU, its mostly supported by torch, what's left are mostly edge-cases. Apple should make a datacenter GPU :D that would be insanely funny. It's actually somewhat well positioned as, due to the MacBooks, the support is already there. I assume here that most things translate to linux, as I don't think you can sell MacOS in the cloud :D
I know a lot developing on apples silicon and just pushing it to clusters for bigger runs. So why not run it on an apple GPU there?
Aren't their GPUs pretty slow, though? Not even remotely close to Nvidia's consumer GPU with only (significant) upside being the much higher memory capacity.
For everything that isn't machine learning, I frankly feel like it's the other way around. Apple's "solution" to these edge cases is telling people to write compute shaders that you could write in Vulkan or DirectX instead. What sets CUDA apart is an integration with a complex acceleration pipeline that Apple gave up trying to replicate years ago.
When cryptocurrency mining was king-for-a-day, everyone rushed out to buy Nvidia hardware because it supported accelerated crypto well from the start. The same thing happened with the AI and machine learning boom. Apple and AMD were both late to the party and wrongly assumed that NPU hardware would provide a comparable solution. Without a CUDA competitor, Apple would struggle more than AMD to find market fit.
well, but machine learning is the major reason we use GPUs in the datacenter (not talking about consumer GPUs here). The others are edge-cases for data-centre applications! Apple is uniquely positioned exactly because it is already solved due to a significant part of the ML-engineers using MacBooks to develop locally.
The code to run these things on apples GPUs exist and is used every day! I don't know anyone using AMD GPUs, but pretty often its nvidia on the cluster and Apple on the laptop. So if nvidia is making these juicy profits, i think apple could seriously think about moving to the cluster if it wants to.
Software developers using Macbooks doesn't mean Apple solved the ML problem. The past 10 years of MacOS removing features has somewhat proved that software developers will keep using Macs even when the featureset regresses. Like how Apple used to support OpenCL as a CUDA alternative, but gave up on it altogether to focus on simpler, mobile-friendly GPU designs.
The Pytorch MPS patches are a fun appeasement for developers, but they didn't unthrone Nvidia's demand. They didn't beat Nvidia on performance per watt, they didn't match their price, their scale or CUDA's featureset, and they don't even provide basic server drivers. It's got nothing to do with what brand you prefer and everything to do with what makes actual sense in a datacenter. Apple can't take on Nvidia clusters without copying Nvidia's current architecture - Apple Silicon's current architecture is too inefficient to be a serious replacement to Nvidia clusters.
If Apple wanted to have a shot at entering the cluster game, that window of opportunity closed when Apple Silicon converged on simplified GPU designs. The 2w NPUs and compute shaders aren't going to make Nvidia scared, let alone compete with AMD's market share.
> But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.
NVidia currently has 80-90% gross margins on their LLM GPUs, that’s all the incentive another company needs to invest money into a CUDA alternative.
CUDA moat is highly overrated for AI in the first place and sold as the reason for the failure of AMD. Almost no one in AI uses CUDA. They only use pytorch or Triton. TPUs didn't face lot of hurdle due to CUDA because they were initially better in terms of price to performance and supported pytorch, tensorflow and jax.
The reason why AMD is behind is that it is behind in hardware. MI300x is more pricey per hour in all the cloud I can find compared to H100, and the MFU is order of magnitude lower compared to NVIDIA for transformers, even though transformers are fully supported. And I get same 40-50% MFU in TPU for the same code. If anyone is investing >10 million dollar for hardware, they sure can invest a million dollar to rewrite everything in whatever language AMD asks them to if it is cheaper.
You need to develop your own in house solution to distributing workloads.
The difference to regular clusters is that all the memory is globally visible, so machine 0023 can access and modify address 0x0123456789abcdef0123456789abcdef which happens to be on machine 0999.
CUDA is one part, but another part of Nvidia's lead is their focus on bandwidth both memory and GPU-GPU communication. AMD dramatically falls behind Nvidia in training because of its terrible collective times (AllReduce, AllGather, etc.)
Why the focus on AMD and Nvidia? It really isn't that hard to design a large number of ALU blocks into some silicon IP block and make them work together efficiently.
> It really isn't that hard to design a large number of ALU blocks into some silicon IP block and make them work together efficiently.
It really is that hard, and the fabrication side of the issue the easy part from Nvidia's perspective - you just pay TSMC a shitload of money. Nvidia's real victory (besides leading on performance-per-watt) is that their software stack doesn't suck. They invested in complex shader units and tensor accelerators that scale with the size of the card rather than being restrained in puny and limited NPUs. CUDA unified this featureset and was industry-entrenched for almost a decade, which gave it pretty much any feature you could want be it crypto acceleration or AI/ML primitives.
The ultimate tragedy is that there was a potential future where a Free and Open Source CUDA alternative existed. Apple wrote the OpenCL spec for exactly that purpose and gave it to Khronos, but later abandoned it to focus on... checks clipboard MLX and Metal Performance Shaders. Oh, what could have been if the industry weren't so stingy and shortsighted.
> Nvidia's real victory (besides leading on performance-per-watt) is that their software stack doesn't suck
YES! And it's not just CUDA and CUDA-adjacent tools, but also their cuDNN/cuBLAS/etc. libraries. They invest a massive amount of staffing into squeezingt the last drop of performance out of their hardware, identifying areas for improvement and feeding that back to the architects.
> Apple wrote the OpenCL spec for exactly that purpose and gave it to Khronos
Nitpick: Affie Munshi from Apple wrote down a draft and convinced his management to offer it to Khronos, where it was significantly modified over... was it a year or so?... by a number of representatives from a dozen companies or so. A ton of smart people contributed a ton of work into what became the 1.0 version.
And let me tell you that the discussions were often tense, both during the official meetings as well as what happened behind the scenes. The end result was as good as you can expect from a large committee composed of representatives from competing companies.
But, in summary, you get it, unlike so many commenters in HN.
The industry, meaning Google decided to go with Renderscript C99 dialect for Android, while Intel and AMD never delivered anything that could match CUDA ecosystem (note the ecosystem part), Khronos never understanding the value of C++ and Fortran in HPC, they still don't in regards to Fortran.
Intel actually has proven to be more clever than AMD in that regard, as DataParalell C++ builds on top of SYCL (it isn't only SYCL), and Intel Fortran now also does GPU offloading.
Sure, Apple did the same thing with TSMC's 5nm node. They still lost in performance-per-watt in direct comparison with Nvidia GPUs using Samsung's 8nm node. Money isn't everything, even when you have so much of it that you can deny your competitors access to the tech you use.
Nvidia's lead is not only cemented by dense silicon. Their designs are extremely competitive, perhaps even a generational leap over what their competitors offer.
The practical answer is that all of FAANG will have to pick up the pieces once their supply chain is shattered. Samsung would quickly reach capacity with AMD and potentially Nvidia as priority customers, and Intel will be trying to court Nvidia and Apple as high-margin customers for some low-yield 18A contract. Depending on whether TSMC's Arizona foundry ever reaches operational capacity, they will be balancing orders from Nvidia and Apple in the same way they do today. Given the pitifully low investment, it's not really likely the Arizona facility will make a dent in the supply chain.
Fact is, Nvidia is well positioned to pick up the pieces even if 5nm> processes go away for the next decade. The only question is whether or not people will continue to have demand for CUDA, and the answer has been "yes" since long before crypto and AI were popular. If TSMC was bombed tomorrow, Nvidia would still have demand for their product and they would still have the capacity to sell it. Their competition with AMD would be somewhat normalized and Apple would be blown into the stratosphere upon realizing that they have to contract either Samsung or Intel to stay afloat. The implications for the American economy are a little upsetting but there's nothing particularly world-ending about that scenario. It would be a sad day to be a Geekbench enthusiast but life would go on.
People use CUDA through a limited number of libraries, for example Torch and Tensorflow, so there isn't a really strong dependence on CUDA for many important applications.
It does not matter. AMD is shit when it comes to low-level processing, their algos are stuck that go nowhere. Nvidia is killing it. There is a reason why Zookerberg ordered billions in GPUs from Nvidia and not from AMD.
>AMD said it now expects to make more than $5 billion from sales of its Instinct data center GPUs this year due to high demand from hyperscalers like Meta and Microsoft
Why is AMD shit at low-level processing? What does it mean "their algos are stuck"? Having watched "the industry" for a few decades now, the appeal for NV smells heavily like the old appeal for Xeons, and Big Blue before them. The moat appears (to me, an unknowledgeable outsider) to be just cultural, not necessarily technical.
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)
> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices
By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).
Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.
> Nvidia B200s ... offer 2-3x the performance of H100s
For ML, not for HPC. ML and HPC are two completely different, only loosely related fields.
ML tasks are doing great with low precision, 16 and 8 bit precision is fine, arguably good results can be achieved even with 4 bit precision [0][1]. That won't do for HPC tasks, like predicting global weather, computational biology, etc. -- one would need 64 to 128 bit precision for that.
Nvidia needs to decide how to divide the billions of transistors on their new silicon. Greatly oversimplifying, they can choose to make one of the following:
Card A would give HPC guys n usable cores, and it would give ML guys n usable cores. On the other end, Card E would give ML guys 16n usable cores (and zero usable cores for HPC guys). It's no wonder that HPC crowd wants Nvidia to produce Card A, while ML crowd wants Nvidia to produce Card E. Given that all the hype and the money are currently with the ML guys (and $NVDA reflects that), Nvidia will make a combination of different cores that is much much closer to Card E than it is to Card A.Their new offerings are arguably worse than their older offerings for HPC tasks, and the feeling with the HPC crowd is that "Nvidia and AMD are in the process of abandoning this market".
[0] https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
[1] https://arxiv.org/abs/2212.09720
Doesn't multiply area scale at O(n^2 * log(n)) ?? (At least, I'm pretty sure the Wallace Tree Multiplier circuit is somewhere in that order).
So a 64-bit multiplier is something like 32x more area than a 16-bit multiplier.
But what you say is correct for RAM area or the number of bits you need for register space. So taken holistically, it's difficult to say...
Okay, 64-bit FP is only like 53-bits and 16-bit FP is actually like 11 bits. But you know what I mean. I'm still doing quick napkin math here, nothing formal.
-------
We can ignore adders and subtractor circuits because they are so small. Division is often implemented as reciprocal followed by multiplication circuits for floating point (true division is very expensive).
With the B100 somehow announced to have lower scalar FP64 throughput than the H100 (did they remove the DP tensor cores ?), one will have to rely on Ozaki schemes (dgemm with int8 tensor cores) and lots of the recent body of work on mixed-precision linear algebra show there's a lot of computing power to be harnessed from Tensor Cores. One of the problems of HPC now is a level of ossification of some codebases (or the lack of availability of porting/coding/optimizing people). You shouldn't have to rewrite everything every 5 years but the hardware constructors go where they go and we still haven't found the right level of abstraction to avoid big porting efforts.
You've heard of SIMD - it's possible to do both, in terms of throughput, with instruction/scheduler/port complexity overhead of course.
Yes, that's a great point that I missed. From anecdotal evidence, it seems more people are using supercomputers for ML use cases, that would have been traditionally done by HPC. (eg training models for weather forecasts)
The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.
[1] https://mlcommons.org/datasets/
If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.
From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.
It seems there was a misunderstanding, as I haven't made any value judgment about LINPACK.
Yes, LINPACK is indeed "old" with a heavy focus on compute power. However, its simplicity serves as a reliable baseline for the types of workflows that supercomputers are designed to handle. Also, at their core, most AI workloads perform essentially the same operations as HPC, albeit with less stability—which, I admit, is a feature, but likely the reason AI-focused systems do not prioritize LINPACK as much.
I am simply saying that any useful metric needs to not only be "stable", but also simple to grasp. Take Green500, probably a significant benchmark for understanding how algorithms consume power, but "too complex" to explain: yet, many cloud providers with their AI supercomputers avoid competing against HPC supercomputers in this domain.
This avoidance isn’t necessarily due to secrecy but rather inefficiencies inherent to cloud systems. Consider PUE (Power Usage Effectiveness)—a highly misleading metric that cloud providers frequently tout. PUE can easily be manipulated, especially with the use of liquid cooling, which is why optimizing for it has become a major factor contributing to water disruptions in several large cities worldwide.
Even the DoE posts top 500 results when they commission a supercomputer.
DoE has absolutely no incentive (nor need, I'd argue) to compare their supercomputers to commercially owned data center operations though.
Comparing their crazy expensive custom built HPC to massive arrays of customer grade hardware doesn't bring them additional funds, nor help them more PR wise than being the owner of the fastest individual clusters.
Being at the top of some heap is visibly one of their goal:
https://www.energy.gov/science/high-performance-computing
DOE clusters are also massive arrays of customer grade hardware. Private cloud can only keep up in low precision work, and that is why they're still playing with remote memory access over TCP, because it's good enough for web and ML.
High precision HPC exists in the private cloud, but you only hear "we don't want to embarrass others" excuses because otherwise you would be able to calculate the cost.
On prem HPC is still very, very much cheaper than hiring out.
B200s have an incremental increase in FP64 and FP32 performance over H100s. That is the number format that HPC people care about.
The MI300A can get to 150% the FP64 peak performance that B200 devices can get, although AMD GPUs have historically underperformed their spec more than Nvidia GPUs. It's possible that B200 devices are actually behind for HPC.
Top line comparison numbers for reference: https://www.theregister.com/2024/03/18/nvidia_turns_up_the_a...
It does seem like Nvidia is prioritizing int8 / fp8 performance over FP64, which given the current state of the ML marketplace is a great idea.
MI300 also have decent performance in FP16 (~108 TFLOPS). Not as good as NVIDIA, but it's getting there. Anyone has experience using these on JAX? Support is said to be decent, but no idea if it's good enough for research-oriented tasks, i.e. stable enough for training and inference.
A cluster is not a super computer.
The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.
> soup of nearly independent machines
that does a serious disservice to hyperscaler clusters.
Sure but it's closer to the truth than saying they have similar or more raw compute than a super computer.
i wish people wouldn't make stuff up just to sound cool.
like do you have actual experience with gov/edu HPC? i doubt it because you couldn't be more wrong - lab HPC clusters are just very very poorly (relative to FAANG) strewn together nodes. there is absolutely no sense in which they are "one single machine" (nothing is "abstracted over" except NFS).
what you're saying is trivially false because no one ever requests all the machines at once (except when they're running linpack to produce top500 numbers). the rest of the time the workflow is exactly like in any industrial cluster: request some machines (through slurm), get those machines, run your job (hopefully you distributed the job across the nodes correctly), release those machines. if i still had my account i could tell you literally how many different jobs are running right now on polaris.
Actually, LLNL (the site of El Capitan) has a process for requesting Dedicated Application Time (a DAT) where you use up to a whole machine, usually over a weekend. They occur fairly regularly. Mostly it's lots of individual users and jobs, like you said though.
> where you use up to a whole machine
i mean rick stevens et al can grab all of polaris too but even so - it's just a bunch of nodes and you're responsible for distributing your work across those nodes efficiently. there's no sense in which it's a "single computer" in any way, shape or form.
The same way that you're responsible for distributing your single threaded code between cores on your desktop.
No. Threads run typically in the same address space. HPC processes on different nodes typically do not.
Define address space.
Cache is not shared between cores.
HPCs just have more levels of cache.
Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.
>Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.
You can't go faster than the speed of light (yet) and traveling a few micrometers will always be much faster than traversing a room (plus routing and switching).
Many HPC tasks nowadays are memory-bound rather than CPU-bound, memory-latency-and-throughput-bound to be more precise. An actual supercomputer would be something like the Cerebras chip, a lot of the performance increase you get is due to having everything on-chip at a given time.
There are four sentences in your comment.
None of them logically relate to another.
One is a question.
And the rest are wrong.
Really? How about: "This pointer is valid, has the same numeric value (address) and points to the same data in all threads". The point is not the latency nor bandwidth. The point is the programming/memory model. Infiniband maybe makes multiprocessing across nodes as fast as multiprocessing on a single node. But it's not multithreading.
>Cache is not shared between cores.
I feel sorry for you if you believe this. It's not true physically nor is it true on the level of the cache coherence protocol nor is it true from the perspective of the operating system.
Tell me you've never run a distributed workload without telling me. You realize if what you were saying were true, HPC would be trivial. In fact it takes a whole lot of PhDs to manage the added complexity because it's not just a "single computer".
If you think parallelizing single threaded code is trivial ... well there's nothing else to say really.
Is there like a training program available for learning how to be this obstinate? I would love to attend so that I can win fights with my wife.
Maybe llm_trw is your wife?
Put slurm on it, bam. Supercomputer.
Google is running its own TPU hardware for internal workloads. I believe Nvidia is just resold for cloud customers.
Nvidia GPUs are also used for inference on Google products. It just depends on availability.
Interesting, do you have a source for this? I've not been able to find one.
GCP plans offer access to high-end NVIDIA GPUs, as well as TPUs. I thought Google products use the same pool of resources that is also resold to customers?
Only some Google products. Most still run on internal platforms, not GCP.
OK, interesting, so there is some dogfooding, but it's not complete.
Not true. Apple trained some models on their TPU
Apologies, to be clear what I meant was that to my knowledge Google doesn't use GPUs for it's own stuff, but does sell both TPUs and GPUs to others on Cloud.
Also, to be clear, I have no internal info about this, I'm going based on external stuff I've seen.
Huh? https://cloud.google.com/tpu/docs/intro-to-tpu
Generally HPC compute has lower margins similar to consoles. It makes sense that AMD would fight for that contract more than NVIDIA similar to IBM stopped doing this. Its sort of comparing Apples to Raspberry Pis.
Hey now I compare Apples to Raspberry Pi's regularly :)
China has been absent from TOP500 for years as well.
B200 is very much not rolling out because NVIDIA, after the respin, doesn't have the thermals under control (yet).
Your other points may be valid.
Source?
Reuters!
don't spread FUD please.
Ya exactly - no one cares about top500 outside of academia (literally have never heard it come up at work). So this is like the gold star (participation award) of DCGPU competition.
After skimming the article, I'm confused -- where exactly is the headline being pulled from?
If you look at the table toward the bottom, no matter how you slice it, Nvidia has 50% of the total cores, 50% of the total flops, and 90% of the total systems among the Top 500, while AMD has 26% of the total cores, 27.5% of the total flops, and 7% of the total systems.
Is it a matter of newly-added compute?
> This time around, on the November 2024 Top500 rankings, AMD is the big winner in terms of adding capacity to the HPC base.
Said table is titled for "Accelerated Supers" (i.e. only ones with GPUs) so the numbers can't be applied to the Top500 as a whole like that. Combining numbers from the summaries at the bottom of the table titled for "All Supers", Nvidia is more like 38% of Top500 FLOPS as they don't have any non-accelerated systems in the list.
Knowing all of that it still leaves unexplained whether AMD has the needed ~70% of non-accelerated compute (assuming FLOPS) to clear the bar for the headline. It seems unlikely to me... but the article doesn't actually have enough data to be sure one way or the other.
That's a good point.
That said, I assumed the context of the article was specifically on the topic of AMD GPUs, and not, say, Epyc processors, so if so, it's ultimately irrelevant.
(Also, there are just over 3 exaflops across non-accelerated supers; AMD would need > 2/3 of the remaining share in order to surpass Nvidia on that front as well.)
> AMD GPUs drove 72.1 percent of the new performance added for the November 2024 rankings
Yes, I saw that, but that doesn't justify the title as written. Had it said "AMD Now Has More New Compute" I wouldn't have said anything.
I'm sure there is also a lot not on the Top500. I've got enough AMD MI300x compute for about 140th position, but haven't submitted numbers.
Top500 is weird for lots of reasons. It over-indexes on a few peculiar types of workloads and a few peculiar types of users (mostly gov).
Historically, those workloads and users were leading indicators of certain types of things. I don't think that's true anymore. In fact, I wonder if this is mostly a story of the government agencies not being able to compete with the private sector for NVIDIA gpus.
I think you nailed it on the head.
Companies like CoreWeave have deployed so many giant clusters (and growing), it is insane. Their IDLE compute is larger than most of the supercompuers out there.
Of course, they aren't on the list either.
There is another widespread common factor among the top machines. A large majority are based on HPE Slingshot networking (7 out of top 10 by my count).
Without blindingly fast, otherwise blinding numerical performance dims quite a lot. This is why the Cerebras numbers on heavy numerical problems are competitive up to a pretty severe ceiling. Below that point, their on wafer interconnects suffice, above it they cannot scale the data communications bandwidth necessary.
layperson with no industry knowledge, but it seems like nvidia's CUDA moat will fall in the next 2-5 years. It seems impossible to sustain those margins without competition coming in and getting a decent slice of the pie
But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.
You are right, eventually something's gotta give. The path for this next leg isn't yet apparent to me.
P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?
The API part isn't thaaat hard. Indeed HIP already works pretty well at getting existing CUDA code to work unmodified on AMD HW. The bigger challenge is that the AMD and Nvidia architectures are so different that the optimization choices for what the kernels would look like are more different between Nvidia and AMD than they would be between Intel and AMD in CPU land even including SIMD.
Only if the only thing one cares about is CUDA C++, and not CUDA C, CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.
CUDA C works fine with HIP not sure what you're referring to. As for the other pieces, GPU graphical debugging isn't relevant for CUDA and I don't know what IDE integration is special / relevant for CUDA but AMD does have a ROCm debugger which I would imagine would be sufficient for simultaneous debugging of CPU & GPU. You won't get developer tools like nsight systems but I'm pretty sure AMD has equivalent tooling.
As for Fortran, that doesn't come up much in modern AI stuff. I haven't observed PTX / GCN assembly within AI codebases but maybe you have extra insight there.
> P.s. how much is an exaflop or petaflop, and how significant is it? The numbers thrown around in this article don't mean anything to me. Is this new cluster way more powerful than the last top?
Nominally, a measurement in "flops" is how many (typically 32-bit) FLoating-point Operations Per Second the hardware is capable of performing, so it's an approximate measure of total available computing power.
A high-end consumer-grade CPU can achieve on the order of a few hundred gigaflops (let's say 250, just for a nice round number). https://boinc.bakerlab.org/rosetta/cpu_list.php
A petaflop is therefore about four thousand of those; multiply by another thousand to get an exaflop.
For another point of comparison, a high-end GPU might be on the order of 40-80 teraflops. https://www.tomshardware.com/reviews/gpu-hierarchy,4388-2.ht...
How many teraflops in an exaflop? The tera is screwing me up.. Google not helping today, so many cards.
https://en.m.wikipedia.org/wiki/Metric_prefix
Anybody spending tens of billions annually on Nvidia hardware is going to be willing to spend millions to port their software away from CUDA.
First they need to support everything that CUDA is capable of in programing language portfolio, tooling and libraries.
A typical LLM might use about 0.1% of CUDA. That's all that would have to be ported to get that LLM to work.
Which is missing the point why CUDA has won.
Then again, maybe the goal is getting 0.1% of CUDA market share. /s
Nvidia has won because their compute drivers don't crash people's systems when they run e.g. Vulkan Compute.
You are mostly listing irrelevant nice to have things that aren't deal breakers. AMD's consumer GPUs have a long history of being abandoned a year or two after release.
CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging, aren't only nice to have things.
In the words of Gilfoyle-- I'll bite. Why has CUDA won?
CUDA C++, CUDA Fortran, CUDA Anything PTX, plus libraries, IDE integration, GPU graphical debugging.
Coupled with Khronos, Intel, AMD never delivering anything comparable with OpenCL, Apple losing interest after Khronos didn't took OpenCL into the direction they wanted, Google never adopting it favouring their Renderscript dialect.
For the average non-FAANG company, there's nothing to port to yet. We don't all have the luxury of custom TPUs.
To slower hardware? What are they supposed to port to, ASICs?
if the hardware is 30% slower and 2x cheaper, that's a pretty great deal.
Power density tends to be the limiting factor for this stuff, not money. If it's 30 percent slower per watt, it's useless.
The ratio between power usage and GPU cost is very, very different than with CPUs, though. If you could save e.g. 20-30% of the purchase price that might make it worth it.
e.g. you could run a H100 at 100% utilization 24/7 for 1 years at $0.4 per kWh (so assuming significant overhead for infrastructure etc.) and that would only cost ~10% of the purchase price of the GPU itself.
Power usage cost isn't the money but the capacity and cooling.
Yes, I know that. Hence I quadrupled the price of electricity or are you saying that the cost of capacity and cooling doesn't scale directly with power usage?
We can increase that another 2x and the cost would still be relatively low compared to the price/deprecation of the GPU itself.
CUDA is the assembly to Torch's high-level language; for most, it's a very good intermediary, but an intermediary nonetheless, as it is between the actual code they are interested in, and the hardware that runs it.
Most customers care about cost-effectiveness more than best-in-class raw-performance, a fact that AMD has ruthlessly exploited over the past 8 years. It helps that AMD products are occasionally both.
CUDA is much more than that, and missing that out is exactly why NVidia keeps winning.
Again, I have AMD hardware and can't use it.
AMD is to blame for where they stand.
Software will bridge the gap. There are simply too many competing platforms out there that are not Nvidia based. Most decent AI libraries and frameworks already need to support more than just Nvidia. There's a reason macs are popular with AI researchers: many of these platforms support Apple's chips already and they perform pretty well. Anything that doesn't support those chips, is a problem waiting to be fixed with plenty of people working on fixing that. If it can be fixed for Apple's chips, it can also be fixed for other people's chips.
And of course there is some serious amount of money sloshing around in this space. Things being hard doesn't mean it's impossible. And there's no shortage of extremely well funded companies working on this stuff. All your favorite trillion $ companies basically. And most of them have their own AI chips too. And probably some reservations about perpetually handing a lot of their cash to Nvidia.
If you want an example of a company that used to have a gigantic moat that is now dealing with a lot of competition, look at Intel. X86 used to be that moat. And that's looking pretty weak lately. One reason that AMD is in the news a lot lately is that they are growing at Intel's expense. Nvidia might be their next target.
A high grade consumer gpu a (a 4090) is about 80 teraflops. So rounding up to 100, an exaflop is about 10,000 consumer grade cards worth of compute, and a petaflop is about 10.
Which doesn’t help with understanding how much more impressive these are than the last clusters, but does to me at least put the amount of compute these clusters have into focus.
You're off by three orders of magnitude.
My point of reference is that back in undergrad (~10-15 years ago), I recall a class assignment where we had to optimize matrix multiplication on a CPU; typical good parallel implementations achieved about 100-130 gigaflops (on a... Nehalem or Westmere Xeon, I think?).
You are 100% correct, I lost a full prefix of performance there. Edited my message.
Which does make the clusters a fair bit less impressive, but also a lot more sensibly sized.
4090 tensor performance (FP8): 660 teraflops, 1320 "with sparsity" (i.e. max theoretical with zeroes in the right places).
https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid...
But at these levels of compute, the memory/interconnect bandwidth becomes the bottleneck.
According to Wikipedia the previous #1 was from 2022 with a peak petaflops of 2,055. This system is rated at 2,746. So about 33% faster than the old #1.
Also, of the top 10, AMD has 5 systems.
https://en.wikipedia.org/wiki/TOP500
> P.s. how much is an exaflop or petaflop
1 petaflop = 10^15 flops = 1,000,000,000,000,000 flops.
1 exaflop = 10^18 flops = 1,000,000,000,000,000,000 flops.
Note that these are simply powers of 10, not powers of 2, which are used for storage for example.
People have been chipping away at this for a while. HIP allows source-level translation, and libraries like Jax provide a HIP version.
There is ZLUDA to break the lock-in for those who are stuck with it. The rest will use something else.
Isn't porting software to the next generation supercomputer pretty standard for HPC?
its possible. Just look at Apples GPU, its mostly supported by torch, what's left are mostly edge-cases. Apple should make a datacenter GPU :D that would be insanely funny. It's actually somewhat well positioned as, due to the MacBooks, the support is already there. I assume here that most things translate to linux, as I don't think you can sell MacOS in the cloud :D
I know a lot developing on apples silicon and just pushing it to clusters for bigger runs. So why not run it on an apple GPU there?
> Apple should make a datacenter GPU
Aren't their GPUs pretty slow, though? Not even remotely close to Nvidia's consumer GPU with only (significant) upside being the much higher memory capacity.
> what's left are mostly edge-cases.
For everything that isn't machine learning, I frankly feel like it's the other way around. Apple's "solution" to these edge cases is telling people to write compute shaders that you could write in Vulkan or DirectX instead. What sets CUDA apart is an integration with a complex acceleration pipeline that Apple gave up trying to replicate years ago.
When cryptocurrency mining was king-for-a-day, everyone rushed out to buy Nvidia hardware because it supported accelerated crypto well from the start. The same thing happened with the AI and machine learning boom. Apple and AMD were both late to the party and wrongly assumed that NPU hardware would provide a comparable solution. Without a CUDA competitor, Apple would struggle more than AMD to find market fit.
well, but machine learning is the major reason we use GPUs in the datacenter (not talking about consumer GPUs here). The others are edge-cases for data-centre applications! Apple is uniquely positioned exactly because it is already solved due to a significant part of the ML-engineers using MacBooks to develop locally.
The code to run these things on apples GPUs exist and is used every day! I don't know anyone using AMD GPUs, but pretty often its nvidia on the cluster and Apple on the laptop. So if nvidia is making these juicy profits, i think apple could seriously think about moving to the cluster if it wants to.
Software developers using Macbooks doesn't mean Apple solved the ML problem. The past 10 years of MacOS removing features has somewhat proved that software developers will keep using Macs even when the featureset regresses. Like how Apple used to support OpenCL as a CUDA alternative, but gave up on it altogether to focus on simpler, mobile-friendly GPU designs.
The Pytorch MPS patches are a fun appeasement for developers, but they didn't unthrone Nvidia's demand. They didn't beat Nvidia on performance per watt, they didn't match their price, their scale or CUDA's featureset, and they don't even provide basic server drivers. It's got nothing to do with what brand you prefer and everything to do with what makes actual sense in a datacenter. Apple can't take on Nvidia clusters without copying Nvidia's current architecture - Apple Silicon's current architecture is too inefficient to be a serious replacement to Nvidia clusters.
If Apple wanted to have a shot at entering the cluster game, that window of opportunity closed when Apple Silicon converged on simplified GPU designs. The 2w NPUs and compute shaders aren't going to make Nvidia scared, let alone compete with AMD's market share.
> But how will AMD or anyone else push in? CUDA is actually a whole virtualization layer on top of the hardware and isn't easily replicable, Nvidia has been at it for 17 years.
NVidia currently has 80-90% gross margins on their LLM GPUs, that’s all the incentive another company needs to invest money into a CUDA alternative.
Maybe the DOJ will come in and call it anti-trust shenanigans.
Not that I would want this...
We donated one of our MI300x systems to the SCALE team. The moat-less future is coming more quickly than you think.
https://scale-lang.com/
CUDA moat is highly overrated for AI in the first place and sold as the reason for the failure of AMD. Almost no one in AI uses CUDA. They only use pytorch or Triton. TPUs didn't face lot of hurdle due to CUDA because they were initially better in terms of price to performance and supported pytorch, tensorflow and jax.
The reason why AMD is behind is that it is behind in hardware. MI300x is more pricey per hour in all the cloud I can find compared to H100, and the MFU is order of magnitude lower compared to NVIDIA for transformers, even though transformers are fully supported. And I get same 40-50% MFU in TPU for the same code. If anyone is investing >10 million dollar for hardware, they sure can invest a million dollar to rewrite everything in whatever language AMD asks them to if it is cheaper.
People most certainly do use CUDA
At this scale cuda is quite useless.
You need to develop your own in house solution to distributing workloads.
The difference to regular clusters is that all the memory is globally visible, so machine 0023 can access and modify address 0x0123456789abcdef0123456789abcdef which happens to be on machine 0999.
CUDA is one part, but another part of Nvidia's lead is their focus on bandwidth both memory and GPU-GPU communication. AMD dramatically falls behind Nvidia in training because of its terrible collective times (AllReduce, AllGather, etc.)
Why the focus on AMD and Nvidia? It really isn't that hard to design a large number of ALU blocks into some silicon IP block and make them work together efficiently.
The real accomplishment is fabricating them.
> It really isn't that hard to design a large number of ALU blocks into some silicon IP block and make them work together efficiently.
It really is that hard, and the fabrication side of the issue the easy part from Nvidia's perspective - you just pay TSMC a shitload of money. Nvidia's real victory (besides leading on performance-per-watt) is that their software stack doesn't suck. They invested in complex shader units and tensor accelerators that scale with the size of the card rather than being restrained in puny and limited NPUs. CUDA unified this featureset and was industry-entrenched for almost a decade, which gave it pretty much any feature you could want be it crypto acceleration or AI/ML primitives.
The ultimate tragedy is that there was a potential future where a Free and Open Source CUDA alternative existed. Apple wrote the OpenCL spec for exactly that purpose and gave it to Khronos, but later abandoned it to focus on... checks clipboard MLX and Metal Performance Shaders. Oh, what could have been if the industry weren't so stingy and shortsighted.
> It really is that hard
YES!! Thank you!
> Nvidia's real victory (besides leading on performance-per-watt) is that their software stack doesn't suck
YES! And it's not just CUDA and CUDA-adjacent tools, but also their cuDNN/cuBLAS/etc. libraries. They invest a massive amount of staffing into squeezingt the last drop of performance out of their hardware, identifying areas for improvement and feeding that back to the architects.
> Apple wrote the OpenCL spec for exactly that purpose and gave it to Khronos
Nitpick: Affie Munshi from Apple wrote down a draft and convinced his management to offer it to Khronos, where it was significantly modified over... was it a year or so?... by a number of representatives from a dozen companies or so. A ton of smart people contributed a ton of work into what became the 1.0 version.
And let me tell you that the discussions were often tense, both during the official meetings as well as what happened behind the scenes. The end result was as good as you can expect from a large committee composed of representatives from competing companies.
But, in summary, you get it, unlike so many commenters in HN.
The industry, meaning Google decided to go with Renderscript C99 dialect for Android, while Intel and AMD never delivered anything that could match CUDA ecosystem (note the ecosystem part), Khronos never understanding the value of C++ and Fortran in HPC, they still don't in regards to Fortran.
Intel actually has proven to be more clever than AMD in that regard, as DataParalell C++ builds on top of SYCL (it isn't only SYCL), and Intel Fortran now also does GPU offloading.
> you just pay TSMC a shitload of money
I guess with money you can win any argument ...
Sure, Apple did the same thing with TSMC's 5nm node. They still lost in performance-per-watt in direct comparison with Nvidia GPUs using Samsung's 8nm node. Money isn't everything, even when you have so much of it that you can deny your competitors access to the tech you use.
Nvidia's lead is not only cemented by dense silicon. Their designs are extremely competitive, perhaps even a generational leap over what their competitors offer.
Let me phrase it differently.
If Nvidia pulls the plug we can still go to AMD and have a reasonable alternative.
If TSMC pulls the plug, however ...
Samsung's fabrication is about as good as TSMC. Or at least it was when I retired a few years ago.
Then so what? It's whataboutism.
The practical answer is that all of FAANG will have to pick up the pieces once their supply chain is shattered. Samsung would quickly reach capacity with AMD and potentially Nvidia as priority customers, and Intel will be trying to court Nvidia and Apple as high-margin customers for some low-yield 18A contract. Depending on whether TSMC's Arizona foundry ever reaches operational capacity, they will be balancing orders from Nvidia and Apple in the same way they do today. Given the pitifully low investment, it's not really likely the Arizona facility will make a dent in the supply chain.
Fact is, Nvidia is well positioned to pick up the pieces even if 5nm> processes go away for the next decade. The only question is whether or not people will continue to have demand for CUDA, and the answer has been "yes" since long before crypto and AI were popular. If TSMC was bombed tomorrow, Nvidia would still have demand for their product and they would still have the capacity to sell it. Their competition with AMD would be somewhat normalized and Apple would be blown into the stratosphere upon realizing that they have to contract either Samsung or Intel to stay afloat. The implications for the American economy are a little upsetting but there's nothing particularly world-ending about that scenario. It would be a sad day to be a Geekbench enthusiast but life would go on.
It could be. But I don't read anything about upcoming AI chip companies.
My predicition is there will be some strong competition for Nvidia in the coming years.
Since most people use CUDA through some other library (like Torch or TF), I think the dependence on CUDA isn't as strong as you make it seem.
What is the reasonable alternative to CUDA Fortran on AMD?
One example out of many I can point out from CUDA ecosystem.
People use CUDA through a limited number of libraries, for example Torch and Tensorflow, so there isn't a really strong dependence on CUDA for many important applications.
Some people working in machine learning, do use CUDA via Torch and Tensorflow.
Yes, most people in ML, and this field is currently on an exponential growth curve.
And a tiny percentage of why CUDA is as big as it is.
AMD ships a Fortran OpenMP compiler with GPU offloading that works pretty well
Made public 6 days ago.
https://www.phoronix.com/news/AMD-Next-Gen-Fortran-Compiler
That's the next gen one. Older one based on classic Flang has been in production since quite a while.
Only if the execution follows.
But not the profits.
It does not matter. AMD is shit when it comes to low-level processing, their algos are stuck that go nowhere. Nvidia is killing it. There is a reason why Zookerberg ordered billions in GPUs from Nvidia and not from AMD.
AMD GPUs handle all inference for Llama3 at Meta btw.
>AMD said it now expects to make more than $5 billion from sales of its Instinct data center GPUs this year due to high demand from hyperscalers like Meta and Microsoft
It's no Nvidia but Meta has ordered AMD GPUs.
This comment is somewhat more insightful:
https://news.ycombinator.com/item?id=40791010
Why is AMD shit at low-level processing? What does it mean "their algos are stuck"? Having watched "the industry" for a few decades now, the appeal for NV smells heavily like the old appeal for Xeons, and Big Blue before them. The moat appears (to me, an unknowledgeable outsider) to be just cultural, not necessarily technical.
This is just silly fanboyism, there are pros and cons to each.