FeepingCreature t1_jefl3ya wrote on March 31, 2023 at 5:27 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

Sure_Cicada_4459 OP t1_jefuu0z wrote on March 31, 2023 at 6:31 PM

-Yes, but GPT-4 wasn't public till they did extensive red teaming. They looked at all the worst cases before letting it out, not that GPT-4 can't cause any damage by itself just not the kind ppl are freaked about.

-That is a given with the aforementioned arguments, ASI assumes superhuman ability on any task and metric. I really think if GPT-5 is showing this same trend that alignment ease scales with intelligence, people should seriously update their p(doom).

-My argument boils down that the standard of sufficiency can only be satisfied to the degree that one can't observe failure modes anymore, you can't arbitrarily satisfy it just like you can't observe anything smaller then Planck length. There is a finite resolution to this problem, whether it is limited by human cognition or infinite possible imagine substructures. We obvious need more interpretability research, and there are some recent trends like Reflexion, ILF and so on that will over the long term yield more insight into the behaviour of systems as you can work with "thoughts" in text form instead of inscrutable matrices. There will be likely some form of cognitive structures inspired by the human brain which will look more like our intuitive symbolic computations and allow us to measure these failure modes better. Misalignments on the lower level could still be possible ofc, but that doesn't say anything about the system on the whole, it could be load bearing in some way for example. That's why I think the only way one can approach this is empirical, and AI is largely an empirical science let's be real.