FeepingCreature t1_jefl3ya wrote on March 31, 2023 at 5:27 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

FeepingCreature t1_jef872m wrote on March 31, 2023 at 4:03 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

The problem with "simulating understanding" is what happens when you leave the verified-safe domain. You have no way to confirm you're actually getting a sufficiently close simulacrum, especially if the simulation dynamically tracks your target. The simulation may even be better at it than the real thing, because you're also imperfectly aware of your own meaning, but you're rating it partially on your understanding of yourself.

> To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query."

Seems to me if you can rely on it to interpret your words correctly, you can just say "Be good, not bad" and skip all this. "Brainwash" and "transparent" aren't fundamentally less difficult to semantically interpret than "good".

FeepingCreature t1_jef1wb3 wrote on March 31, 2023 at 3:23 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

Also: we have at present no way to train a system to reason from instructions.

GPT does it because its training set contained lots of humans following instructions from other humans in text form, and then RLHF semi-reliably amplified these parts. But it's not "trying" to follow instructions, it's completing the pattern. If there's an interiority there, it doesn't necessarily have anything to do with how instruction-following looks in humans, and we can't assume the same tendencies. (Not that human instruction-following is even in any way safe.)

> But that would be as simple as adding that clause to your query

And also every single other thing that it can possibly do to reach its goal, and on the first try.

FeepingCreature t1_jeesov2 wrote on March 31, 2023 at 2:22 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

Sure, and I agree with the idea that deceptions have continuously increasing overhead costs to maintain, but the nice thing about killing everyone is that it clears the gameboard. Sustaining a lie is in fact very easy if shortly - or even not so shortly - afterwards, you kill everyone who heard it. You don't have to not get caught in your lie, you just have to not get caught before you win.

In any case, I was thinking more about deceptive alignment, where you actually do the thing the human wants (for now), but not for the reason the human assumes. With how RL works, once such a strategy exists, it will be selected for, especially if the human reinforces something other than what you would "naturally" do.

FeepingCreature t1_jeeh7mz wrote on March 31, 2023 at 12:56 PM

Reply to comment by Sure_Cicada_4459 in The only race that matters by Sure_Cicada_4459

Higher intelligence also means better execution of human skills, which means harder to verify. Once you have loss flow through deception, all bets are off.

I think it gets easier, as the model figures out what you're asking for - and then it gets a lot harder, as the model figures out how to make you believe what it's saying.

FeepingCreature t1_iy0jpwf wrote on November 27, 2022 at 8:53 PM

Reply to comment by RoboticPro in Google Has a Secret Project That Is Using AI to Write and Fix Code by nick7566

Sorry, who exactly doesn't like it?

FeepingCreature t1_itwgxdk wrote on October 26, 2022 at 8:24 PM

Reply to comment by NotLondoMollari in Our Conscious Experience of the World Is But a Memory, Says New Theory by Shelfrock77

Sure, doesn't answer any questions about consciousness though. Like, what if consciousness is just electromagnetic fields? Look at the brain. It's already electromagnetic fields!

A field isn't any less or more mysterious than a particle.

FeepingCreature t1_ittgz7e wrote on October 26, 2022 at 4:37 AM

Reply to comment by americanpegasus in Our Conscious Experience of the World Is But a Memory, Says New Theory by Shelfrock77

I think it's a long loop. Unconscious decisionmaking, but conscious reflection generates a training signal that eventually feeds back into the unconscious.

I heard somewhere that consciousness can hold or veto decisions as well.

FeepingCreature t1_ittgsoj wrote on October 26, 2022 at 4:35 AM

Reply to comment by SejaGentil in Our Conscious Experience of the World Is But a Memory, Says New Theory by Shelfrock77

Helpful reminder that consciousness is the thing that makes you talk about consciousness.

(How exactly is the "observer outside the universe" making it back to your fingers, for you to talk about it on Reddit?)