FeepingCreature

FeepingCreature t1_jefl3ya wrote

> By the way yeah, I think so but we will likely be ultra precise on the first tries because of the stakes.

Have you met people. The internet was trying to hook GPT-4 up to unprotected shells within a day of release.

> it might actually warn me because it is context aware enough to say that this action will yield net negative outcome if I were to assess the future state

Sure if I have successfully trained it to want to optimize for my sense of negative rather than its proxy for my proxy for my sense of negative. Also if my sense of negative matches my actual dispreference. Keep in mind that failure can look very similar to success at first.

> You can append to general relativity a term that would make the universe collapse into blackhole in exactly 1 trillion years, no way to confirm it either

Right, which is why we need to understand what the models are actually doing, not just train-and-hope.

We're not saying it's unknowable, we're saying what we're currently doing is in no way sufficient to know.

1

FeepingCreature t1_jef872m wrote

The problem with "simulating understanding" is what happens when you leave the verified-safe domain. You have no way to confirm you're actually getting a sufficiently close simulacrum, especially if the simulation dynamically tracks your target. The simulation may even be better at it than the real thing, because you're also imperfectly aware of your own meaning, but you're rating it partially on your understanding of yourself.

> To your last point, yes you'd have to find a set of statements that exhaustively filters out undesirable outcomes, but the only thing you have to get right on the first try is "don't kill, incapacitate, brain wash everyone." + "Be transparent about your actions and their reasons starting the logic chain from our query."

Seems to me if you can rely on it to interpret your words correctly, you can just say "Be good, not bad" and skip all this. "Brainwash" and "transparent" aren't fundamentally less difficult to semantically interpret than "good".

2

FeepingCreature t1_jef1wb3 wrote

Also: we have at present no way to train a system to reason from instructions.

GPT does it because its training set contained lots of humans following instructions from other humans in text form, and then RLHF semi-reliably amplified these parts. But it's not "trying" to follow instructions, it's completing the pattern. If there's an interiority there, it doesn't necessarily have anything to do with how instruction-following looks in humans, and we can't assume the same tendencies. (Not that human instruction-following is even in any way safe.)

> But that would be as simple as adding that clause to your query

And also every single other thing that it can possibly do to reach its goal, and on the first try.

1

FeepingCreature t1_jeesov2 wrote

Sure, and I agree with the idea that deceptions have continuously increasing overhead costs to maintain, but the nice thing about killing everyone is that it clears the gameboard. Sustaining a lie is in fact very easy if shortly - or even not so shortly - afterwards, you kill everyone who heard it. You don't have to not get caught in your lie, you just have to not get caught before you win.

In any case, I was thinking more about deceptive alignment, where you actually do the thing the human wants (for now), but not for the reason the human assumes. With how RL works, once such a strategy exists, it will be selected for, especially if the human reinforces something other than what you would "naturally" do.

1

FeepingCreature t1_jeeh7mz wrote

Higher intelligence also means better execution of human skills, which means harder to verify. Once you have loss flow through deception, all bets are off.

I think it gets easier, as the model figures out what you're asking for - and then it gets a lot harder, as the model figures out how to make you believe what it's saying.

−3