linearmodality t1_iyb8fij wrote on November 30, 2022 at 2:39 AM

Reply to comment by [deleted] in [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga

Well intelligence is correlated with willingness to do what people want. This is very straightforward to observe in natural intelligences. The most intelligent beings (adult humans) are the most willing to do what people want. This is also presently true for existing AI agents, if being "willing" even makes sense for such agents: the ones that possess better problem-solving abilities are more "willing" (because they are more able) to do what people want. This is so clearly the case that I suspect you mean something other than "correlated" here.

>It's fucking hard to specify what we want AI systems to do in ways that avoid undesirable side effects. Everyone agrees on this with respect to current AI. The only remaining question is whether we should expect it to become easier or harder to control machine intelligences as they become more sophisticated.

Well, that's the wrong question. Yes, it's hard to specify what we want a system to do in a way that avoids side effects. However, this hardness is a property of the specification, not of the learned model itself. It doesn't get harder or easier as the model becomes more accurate, because it's independent of the model.

>Do you personally, really and honestly, believe that it's so obvious that control will get easier as intelligence gets greater

Certainly it will get easier to produce specifications of what we want an AI system to do in a way that avoids undesirable side effects because we can get a sufficiently intelligent AI to write the specifications—and furnish us with a proof of safety (that the specification will guarantee that we avoid the undesirable side effects). "Control" is a more general word, though, and you'll have to nail down exactly what it means before we can evaluate whether we should expect it will get easier or harder over time.

>that you'd label people who worry otherwise as cultists?

Oh, LessWrongers aren't cultists because they worry otherwise. There are lots of perfectly reasonable non-cultists who worry otherwise, like Stuart Russell.

[deleted] t1_iyb99qw wrote on November 30, 2022 at 2:46 AM

[deleted]

linearmodality t1_iybbrz3 wrote on November 30, 2022 at 3:05 AM

> There's an extremely obvious restricted range problem here

Then you're not talking about actual correlation over the distribution of actually extant intelligent agents, but rather about something else. In which case: what are you talking about?

>This is literally the Orthogonality Thesis stated in plain English.

Well, no. The orthogonality thesis asserts, roughly, that an AI agent's intelligence and goals are somehow orthogonal. Here, we're talking about an AI agent's intelligence and the difficulty of producing a specification for a given task that avoids undesirable side effects. "Goals" and "the difficulty of producing a specification" are hardly the same thing.

>I don't think that this solution will work.

This sort of approach is already working. On the one side we have tools like prompt engineering that automatically develop specifications of what an AI system should do, for things like zero-shot learning. On the other side we have robust control results which guarantee that undesirable outcomes are avoided, even when a learned agent is used as part of the controller. There's no reason to think that improvements in this space won't continue.

Even if they don't, the problem of producing task specifications does not get worse with AI intelligence (because as we've already seen, the difficulty of producing a specification is independent) which is fundamentally inconsistent with the LessWrongist viewpoint.

[deleted] t1_iybd2qu wrote on November 30, 2022 at 3:16 AM

[deleted]

sdmat t1_iyc8ab6 wrote on November 30, 2022 at 8:58 AM

> The problem of producing task specifications does not get worse with AI intelligence (because as we've already seen, the difficulty of producing a specification is independent) which is fundamentally inconsistent with the LessWrongist viewpoint.

I think LW viewpoint is that for the correctness of a task specification to be genuinely independent of the AI it is necessary to include preferences that cover the effects of all possible ways to execute the task.

The claim is that for our present AIs we don't need to be anywhere near this specific only because they can't do very much - we can accurately predict the general range of possible actions and the kinds of side effects they might cause in executing the task, so only need to worry about whether we get useful results.

Your view is that this is refuted by the existence of approaches that generate a task specification and check execution against the specification. I don't see how that follows - the LW concern is precisely that this kind of ad-hoc understanding of what we actually mean by the original request is only safe for today's less capable systems.