Comments

You must log in or register to comment.

lostmsu t1_j4lrt8a wrote

Performance/$ characteristic needs an adjustment based on longevity * utilization * electricity cost. Assuming you are going to use card for 5 years at full load, that's $1000-$1500 in electricity at 1$ per year per 1W of constant use (12c/kWh). This would take care of the laughable notion, that Titan Xp is worth anything, and sort cards much closer to their market positioning.

65

timdettmers t1_j4lvr7i wrote

I like this idea! I already factored in fixed costs for building a desktop computer but the electricity is also an important part of the overall cost especially if you compare it to cloud options.

I am currently gathering feedback to update the post later. I think it's quick to create a chart based on this data and create an update later today.

The main problem to estimate cost is to get a good number on the utilization time of GPUs for the average user. For PhD students, the number was about 15% utilization (fully using a GPU 15% of total time). This means, with an average of 60 watt idle and 350 watt max for a RTX 4090: 60 watt * 0.85 + 350 watt * 0.15=103.5 watt. That is 906 kWh per year or about $210 per year per RTX 4090 (assuming US average is 0.23 cents per kWh).

Does that look good to you?

I think its quick to create a chart based on this data and create an update later today.

Edit: part of this seemed to got lost in editing. Oops! I re-added the missing details.

54

tripple13 t1_j4m7ykq wrote

Well, somehow I expected TD's conclusion to be "Skip the current gen, wait for newer gen"

And yet, here we are.

14

BeatLeJuce t1_j4m9pp6 wrote

Overall nice, but the article also uses some expressions without ever explaining them. For example: What is H100, and what is A100. Somewhere in the Article, it says that H100=RTX40 cards, somewhere else it says A100 is a RTX40 card. Which is which?

Also, what is TF32? It's an expression that appears in a paragraph without explanation.

0

init__27 OP t1_j4mavhs wrote

Oh wow, Great to see you here as well Tim 🙏

​

As a Kaggler, the usage for my case varies extensively, if I end up in a Deep Learning competition, for 1-2 months, the usage usually is around 60-100% I would like to say.

​

I know many top Kagglers that compete year around, I would vaguely guess their usage is the highest in %

4

init__27 OP t1_j4mb2iy wrote

The author is hanging out and collecting feedback on here. I'm sure he'll correct it in an update.

​

Maybe I'm too in the roots but if I were in the author's shoes I would assume as well that the reader would know of these terms and cards.

7

init__27 OP t1_j4mb9ga wrote

If its any help, u/tripple13 I bought 2 3090s recently at a "discounted" rate (1400$ compared to 3500$ that I paid when they were kings), I'm really happy with it.

​

BUT OFCOURSE I ALSO WISH I HAD THE LATEST & FASTEST ONES :')

7

timdettmers t1_j4mfjbw wrote

I thought about making this recommendation, but the next generation of GPUs will not be much better. You probably need to wait until about 2027 for a better GPU to come along. I think for many waiting 4 years for an upgrade might be too long, so I recommend mostly buying now. I think the RTX 40 cards are a pretty good investment that will last a bit longer than previous generations.

19

timdettmers t1_j4mfra6 wrote

This is good feedback. Wanted to make another pass this morning to clean references like this up, but did not have the time. Will try to be more clear about this in the next update (later today, probably).

18

Freonr2 t1_j4mvhhf wrote

A100 and H100 are data center GPUs. Very expensive, tuned for training large models. They also use on-package HBM memory instead of GDDR on the board for improved memory bandwidth.

A100 is Ampere, same architecture as the 30xx series, but built for training with a lot more tensor cores and less focus on Cuda cores. Most often seen in SXM form factor in special servers that offers substantially more NVLink bandwidth between GPUs for multi-gpu training (and the special servers the SXM cards go into also have considerable network bandwidth for clustered training). They do make PCIe versions. Does not support FP8. Typical setup is an AGX server with 8xA100. These are a few hundred grand for the whole server, even ignoring the power and network requirements, etc to utilize it.

H100 is Hopper, newer than Ampere, but I don't believe ever made into a consumer part but perhaps closer to Ada (40xx) in features than it is to Ampere (30xx) since it has FP8. It's basically the replacement for A100, much like the 40xx is the replacement for the 30xx. These are again often in HGX server boxes for a several hundred grand. Unsure if there is a PCIe version?

Nvidia removed NVLink from the 40xx series, but its still technically available on 3090s. They're sort of segmenting the market here.

If they decide to release a 4090 with 48GB (or Ada Titan or whatever branding they decide on) it could be a monster card if you only need or want a single card, but it may also be $3k+...

12

JustOneAvailableName t1_j4my6kp wrote

Great article!

You say this about sparsity:

> It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured.

Wouldn't a more slightly more structured dropout be a perfect fit?

3

chief167 t1_j4n7wdv wrote

A little bit dangerous, because the A100 is a beast when you look at performance/kWh. If you run a lot of heavy workloads, it's your best option, but on this chart it looks like the worst

TCO != Purchase price

14

learn-deeply t1_j4na276 wrote

Note: graphs comparing GPUs are not actual benchmarks but theoretical results. Nvidia likes to arbitrarily add restraints to their non-datacenter GPUs, so its not clear what the real-word performance is.

8

SearchAtlantis t1_j4ne6jj wrote

For what it's worth I think 15% seems low. Having just finished an MS with Deep Learning in my thesis, over the course of a year I used it about 25% of the time. Test quick shallow for arch and other changes then running arch changes etc at full depths for comparison.

2

asdfzzz2 t1_j4nl1tn wrote

> This means, with an average of 60 watt idle and 350 watt max for a RTX 4090

RTX 4090 "idles" (stream at background) at 10-15 watt. 4k144hz monitor might change it, but 60 watt is way too much for GPU only.

8

protocolypse t1_j4o8t8d wrote

This and the back blaze report are among my favorite articles.

2

serge_mamian t1_j4ods2g wrote

The question is how the fuck does one get a 4090? I am really at my wits end, Amazon has a few at double MSRP.

2

anothererrta t1_j4pagpo wrote

If you go to all this trouble, please keep in mind that electricity prices vary a lot across the world. In some places in Europe people pay twice as much as you assumed above.

Making it clear how you arrive at your value calculation in an updated post (or even making it a dynamic calculator where people can enter their cost/kWh) would be very useful.

2

mentatf t1_j4pndf8 wrote

"legendary"..? really..?

0

royalemate357 t1_j4qdfwj wrote

Tbh I don't think it's an especially good name, but I believe the answer to your question is that it actually uses 32 bits to store a TF32 value in memory. its just that when they pass it into tensor cores to do matmuls, they temporarily downcast it to this 19-bit precision format.

>Dot product computation, which forms the building block for both matrix multiplies and convolutions, rounds FP32 inputs to TF32, computes the products without loss of precision, then accumulates those products into an FP32 output (Figure 1).

(from https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)

3

numpee t1_j4whr9r wrote

Hi u/timdettmers, I had a great time reading your blog post :) I just wanted to point out something that might be worth mentioning: the issue with 4090 (and probably 4080 as well) is that they won't fit in servers, specifically 4U rack mounted servers. In rack mounted servers, the PCIe slots are placed at the bottom (facing upwards), so the GPUs are placed "vertically" (PCIe pointing downwards). The 4090s are too tall for the 4U server, which makes it unusable (plus, 3.5slots for a single GPU complicates things further).

3