MetaAI_Official OP t1_izfet1n wrote
Reply to comment by MetaAI_Official in [D] We're the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on December 8th starting at 10am PT. Ask us anything! by MetaAI_Official
One of our models trained for several days, and at certain times of the day (but not every day) training speeds would drop dramatically and certain machines became unstable. After a lot of investigation, it turned out that the datacenter cooling system was malfunctioning, and around mid-day on particularly hot days, GPU failure rates would skyrocket. For the rest of the model training run, we had a weather forecast bookmarked to look out for especially hot days! -JG
Liorogamer t1_izfn24a wrote
Love this story!! 😂 You have great investigation skills
Viewing a single comment thread. View all comments