linearmodality t1_iybbrz3 wrote
Reply to comment by [deleted] in [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga
> There's an extremely obvious restricted range problem here
Then you're not talking about actual correlation over the distribution of actually extant intelligent agents, but rather about something else. In which case: what are you talking about?
>This is literally the Orthogonality Thesis stated in plain English.
Well, no. The orthogonality thesis asserts, roughly, that an AI agent's intelligence and goals are somehow orthogonal. Here, we're talking about an AI agent's intelligence and the difficulty of producing a specification for a given task that avoids undesirable side effects. "Goals" and "the difficulty of producing a specification" are hardly the same thing.
>I don't think that this solution will work.
This sort of approach is already working. On the one side we have tools like prompt engineering that automatically develop specifications of what an AI system should do, for things like zero-shot learning. On the other side we have robust control results which guarantee that undesirable outcomes are avoided, even when a learned agent is used as part of the controller. There's no reason to think that improvements in this space won't continue.
Even if they don't, the problem of producing task specifications does not get worse with AI intelligence (because as we've already seen, the difficulty of producing a specification is independent) which is fundamentally inconsistent with the LessWrongist viewpoint.
[deleted] t1_iybd2qu wrote
[deleted]
sdmat t1_iyc8ab6 wrote
> The problem of producing task specifications does not get worse with AI intelligence (because as we've already seen, the difficulty of producing a specification is independent) which is fundamentally inconsistent with the LessWrongist viewpoint.
I think LW viewpoint is that for the correctness of a task specification to be genuinely independent of the AI it is necessary to include preferences that cover the effects of all possible ways to execute the task.
The claim is that for our present AIs we don't need to be anywhere near this specific only because they can't do very much - we can accurately predict the general range of possible actions and the kinds of side effects they might cause in executing the task, so only need to worry about whether we get useful results.
Your view is that this is refuted by the existence of approaches that generate a task specification and check execution against the specification. I don't see how that follows - the LW concern is precisely that this kind of ad-hoc understanding of what we actually mean by the original request is only safe for today's less capable systems.
Viewing a single comment thread. View all comments