rlvsdlvsml t1_j3nd87h wrote on January 9, 2023 at 7:39 PM

Reply to comment by deephugs in [D] Deep Learning Training Server by joossss

I have always felt like the network/security and integration with internal it systems was worse than the physical maintenance. Like people should expect that they have to invest time into integrating into a on-prem data center environment or physical maintenance stuff. I think small teams are benefited by a small gpu cluster with a fixed budget over large cloud gpu training costs. Mid-large companies do better with cloud than on-prem bc they can have better separation of environments but they cost more.

rlvsdlvsml t1_j3n2it2 wrote on January 9, 2023 at 6:35 PM

Reply to comment by deephugs in [D] Deep Learning Training Server by joossss

If u use ray u can setup a gpu cluster in less than 30 min

rlvsdlvsml t1_j334v18 wrote on January 5, 2023 at 6:59 PM

Reply to comment by ZaZaMood in [News] AMD Instinct MI300 APU for AI and HPC announced by samobon

Rocm users have been failed for the past 3 years tho

rlvsdlvsml t1_j17frtb wrote on December 22, 2022 at 5:30 AM

Reply to [D] Using "duplicates" during training? by DreamyPen

Ugh if u have 5-50 test cases u need stats not ml and absolutely should not be using duplicates. Should probably use some classic stats model with groups like glm glme

rlvsdlvsml t1_j15vzxs wrote on December 21, 2022 at 10:06 PM

Reply to [D] Build a home PC to Run Large GPT Models or use AWS by [deleted]

So if you used clossalai u could do locally with a reasonable budget 2-3k. Without that it depends on how much u are using it. If you use it a lot you can do a home setup for anywhere from 32-64k. The cheapest you can do an 8 gpu nvidia home server is around 12k and you need 4x to do 500 Gb vram. (32k new 16gb ram gpu ~1k each ) with used everything maybe you could get to 8k. ( this assumes you could do 16 servers with 4 gpu each all used ) https://alpa.ai/tutorials/opt_serving.html

TLDR largest opt is going to require some model pipelining to keep costs down

rlvsdlvsml t1_iri172p wrote on October 8, 2022 at 9:43 AM

Reply to comment by master3243 in [D] AlphaTensor Explained (Video Walkthrough) by ykilcher

It’s misleading tho bc it’s the same time complexity as current approaches but optimized more for gpu kernel compilation. Arguably the end result is just using a known algorithm to find a more gpu compiler friendly approach since the solution space started with all the operations used in the current approaches.