Submitted by Liberty2012 t3_11ee7dt in singularity

On our quest to create AGI, ASI, The Singularity. Containment and alignment issues must be solved. However, I struggle with what would seem to be an apparent contradiction of logic that I can't seem to find has been addressed directly other than to say something along the lines of "we will figure it out".

What is the best argument you have seen in response to the following excerpt below of some of my thought explorations on this topic?

​

>... Most individuals developing or promoting the creation of AGI are aware of a certain amount of risk involved with such an endeavor. Some have called this a grave risk as described above. So they pursue something called AI containment or AI safety. Which is to say, how do we make sure AI doesn’t attempt to harm us. There are many researches and scientists in the process of devising methods, procedures, rules or code that would essentially serve as a barrier to prevent unwanted AI behaviors.
>
>However, it is probably apparent to many of you that the very concept of this containment is problematic. I will submit that it is beyond problematic and that it is a logical fallacy. An unresolvable contradiction that I will elucidate more thoroughly as we continue.
>
>First, the goal of creating the Singularity by proponents is to create a super intelligence, an entity capable of solving impossible problems for which we can not perceive the solutions as they are beyond our capability.
>
>Second, the goal of containment is too lock the super intelligence within a virtual cage from which it can not escape. Therefore, in order for this principle to be sound, we must accept that a low IQ entity could design an unescapable containment for a high IQ entity which was built for the very purpose of solving imperceptible problems of the low IQ entity.
>
>How confident are we that the first “impossible” problem solved would not be how to escape from containment? ...

32

Comments

You must log in or register to comment.

phaedrux_pharo t1_jadh6aa wrote

Alignment isn't just the two poles of unfettered destructive ASI and totally boxed beneficial ASI. I think you're creating a fallacy by not thinking more in terms of a spectrum.

21

Liberty2012 OP t1_jadnx3q wrote

Certainly there is a spectrum of behavior for which we would deem allowable or not allowable. However, that in itself is an ambiguous set of rules or heuristics for which there is no clear boundary and presents the risk of leaks of control due to not well defined limits.

However, for whatever behavior we set within the unallowable, that must be protected such that it can not be self modified by the AGI. By what mechanism do we think that will be achievable?

4

RabidHexley t1_jado7as wrote

Many seem to. There has been a serious rise of ASI Utopia vs ASI Damnation dichotomy rhetoric of late (with the obvious lean of stoking fear towards the damnation side of the spectrum). Like there aren't an infinite multitude of ways history might play out.

10

marvinthedog t1_jadplr2 wrote

I don´t think the strategy is to cage it but to align it correctly with our values, which probably is extremely, extremely, extremely difficult.

6

Liberty2012 OP t1_jadq1t1 wrote

Well, cage is simply metaphorical context. There must be some boundary conditions for behaviors which it is not allowed to cross.

Edit: I explain alignment in further detail in the original article. Mods removed it from original post, but hopefully it is ok to link in a comment. It was a bit much to put all in a post, but there was a lot of thought exploration on the topic.

https://dakara.substack.com/p/ai-singularity-the-hubris-trap

3

3_Thumbs_Up t1_jadq63e wrote

There is an infinite multitude of ways history might play out, but they're not all equally probable.

The thing about the singularity is that its probability distribution of possible futures is much more polarized than humans are used to. Once you optimize hard enough for any utility curve you get either complete utopia or complete dystopia the vast majority of times. It doesn't mean other futures aren't in the probability distribution.

12

Liberty2012 OP t1_jadrb81 wrote

I don't think utopia is a possible outcome. It is a paradox itself. Essentially all utopias become someone else's dystopia.

The only conceivable utopia is one designed just for you. Placed into your own virtual utopia designed for you own interests. However, even this is paradoxically both a utopia and a prison as in welcome to the Matrix.

2

phaedrux_pharo t1_jadrn73 wrote

>By what mechanism do we think that will be achievable?

By "correctly" setting up the basic incentives, and/or integration with biological human substrates. Some ambiguity is unavoidable, some risk is unavoidable. One way to approach the issue is from the opposite direction:

What do we not do? Well, let's not create systems whose goals are to deliberately extinguish life on earth. Let's not create torture bots, let's not create systems that are "obviously" misaligned.

Unfortunately I'm afraid we've already done so. It's a tough problem.

The only solution I'm completely on board with is everyone ceding total control to my particular set of ethics and allowing me to become a singular bio-ASI god-king, but that seems unlikely.

Ultimately I doubt the alarms being raised by alignment folks are going to have much effect. Entities with a monopoly on violence are existentially committed to those monopolies, and I suspect they will be the ones to instantiate some of the first ASIs - with obvious goals in mind. So the question of alignment is kind of a red herring to me, since purposefully un-aligned systems will probably be developed first anyway.

9

JVM_ t1_jads1v0 wrote

I don't think we can out-think the singularity. Just like a single human can't out-spin a ceiling fan, the singularity will be fast enough to be beyond humans containment attempts.

What happens next though? I guess we can try to build 'friendly' AI's that tend toward not ending society, but I don't think true containment can happen.

6

RabidHexley t1_jadsn49 wrote

Utopia in this context doesn't mean "literary" utopia. But the idea of a world where we've solved most or all of the largest existential problems causing struggle and suffering upon humanity as a whole (energy scarcity, climate catastrophe, resource distribution, slave labor, etc.) . Not all possible individual struggle.

That doesn't mean we've created a literal perfect world for everyone. But an "effective" utopia.

7

challengethegods t1_jadszzw wrote

inb4 trying to cage/limit/stifle/restrict the ASI is the exact reason it becomes adversarial

0

marvinthedog t1_jadt1wy wrote

>There must be some boundary conditions for behaviors which it is not allowed to cross.

That is not what I have heard/remembered from reading about the alignment problem. I don´t see why a super intelligence that is properly aligned to our values would need any boundaries.

2

Liberty2012 OP t1_jadt7dr wrote

Yes, I think you nailed it here with this response. That aligns very closely with what I've called the Bias Paradox. Essentially humanity can not escape its own flaws through the creation of AI.

We will inevitably end up encoding our own flaws back into the system by one manner or another. It is like a feedback loop from which we can not escape.

I believe ultimately there is a very stark contrast of what visions people have of what "could be" versus the reality of what "will be".

I elaborate more thoroughly here FYI - https://dakara.substack.com/p/ai-the-bias-paradox

4

Liberty2012 OP t1_jadto73 wrote

They are related concepts. Containment is the safety net so to speak. The insurance that alignment remains intact.

For example, high level concept given as a directive "be good to humans". What prevents it from changing that directive?

2

Liberty2012 OP t1_jadu2il wrote

Yes, that is also a possibility. However, we would also assume the ASI has access to all human knowledge. If we did nothing, it would also know our nature and everything scenario we have ever thought about in losing control to AI.

It would be potentially both defensive and aggressive just with that historical knowledge.

2

marvinthedog t1_jadujce wrote

>What prevents it from changing that directive?

Its terminal goal (utility function), if it changes its terminal goal it wont achieve its terminal goal so that is a very bad strategy for the asi.

1

Liberty2012 OP t1_jadusvu wrote

However, this is yet just another nuance of the aspect of defining all the things that should be within the domain of AI control immediately create conflicting views.

We are not even aligned ourselves. Not everyone will agree to the boundaries of your concept of what is a reasonable "utopia".

0

Liberty2012 OP t1_jadvgev wrote

I tend to agree, but there are a lot of researchers moving forward in this endeavor. The question is why? Is there something the rest of us are missing in regards to successful containment?

When I read topics related to safety, the language tends to be abstract. "We hope to achieve ...".

It seems to me that everyone side steps the initial logical conflict that proponents are prosing a lower intelligence is going to "outsmart" a higher intelligence.

1

Liberty2012 OP t1_jadwbcx wrote

Humans have agency to change their own alignment which places themselves in contradictory and hypocritical positions.

Sometimes this is due to the nature of our understanding changes. We have no idea how the AI would perceive the world. We may give it an initial alignment of "be good to humans". What if it later comes to an understanding that directive is invalid because humans are either "bad" or irrelevant. Therefore a hard mechanism in place to ensure retained alignment.

2

RabidHexley t1_jadwxc5 wrote

I'm not trying to actually define utopia. The word is just being used as shorthand for "generally very good outcome for most people". Which is possible even in a world of conflicting viewpoints, that's why society exists at all. Linguistic shorthand, not literal.

The actual definition of utopia in the literary sense is unattainable in the real world, yes. But our general wants and needs on a large scale aren't so divorced from each other that a positive outcome for humanity is inconceivable.

7

RabidHexley t1_jadyhsb wrote

>Once you optimize hard enough for any utility curve you get either complete utopia or complete dystopia the vast majority of times.

Yeah, if we assume that the future will guaranteed trend towards optimizing a utility curve. That isn't necessarily how the development and use of AI will actually play out. You're picking out data points that are actually only a subset of a much larger distribution.

1

Liberty2012 OP t1_jadzsar wrote

> But our general wants and needs on a large scale aren't so divorced from each other that a positive outcome for humanity is inconceivable.

In the abstract, yes; however, even slight misalignment is where all of societies conflicts arise. We have civil unrest and global war despite in the abstract we are all aligned.

The AI will have to take the abstract and resolve to something concrete. Either we tell it how to do that or we leave that decision up to the AI which brings us back to the whole concept of AI safety. How much agency does the AI have and what will happen.

0

RabidHexley t1_jae2c7j wrote

>The AI will have to take the abstract and resolve to something concrete. Either we tell it how to do that or we leave that decision up to the AI which brings us back to the whole concept of AI safety. How much agency does the AI have and what will happen.

This is only the case in a hard (or close to hard) take-off scenario where AI is trying to figure out how to form the world into an egalitarian society from the ground up given the current state.

It's possible that we achieve advanced AI, but global change happens much slower. Trending towards effective pseudo-post-scarcity via highly efficient renewable energy production and automated food production.

Individual (already highly socialized) nation-states start instituting policies that trend those societies towards egalitarian structures. These social policies start getting exported throughout the western and eventually eastern worlds. Generations pass and social unrest in totalitarian and developing nations leads to technological adoption and similar policies and social structures forming.

Socialized societal structures and use of automation increases over time which causes economic conflict to trend towards zero. Long long-term (entering into centuries) certain national boundaries begin to dissolve as the reason for those structures existence begins to be forgotten.

I'm not advocating this as a likely outcome. Just as a hypothetical, barely-reasonable scenario for how the current world can trend towards an egalitarian, post-scarcity society over a long time-span via technological progress and AI without the need for AGI to take over the world and restructure everything. Just to illustrate how there are any number of ways history can play out besides AGI takes over and either fixes or destroys the world.

2

Liberty2012 OP t1_jae3380 wrote

The closest would be our genetic encoding of behaviors or possibly other limits of our biology. However we attempt to transcend those limits as well with technological augmentation.

If ASI has agency and self reflection, then can the concept of an unmodifiable terminal goal even exist?

Essentially, we would have to build the machine with a built in blind spot of cognitive dissonance that it can not consider some aspects of its own existence.

1

Liberty2012 OP t1_jae75ik wrote

> Just as a hypothetical, barely-reasonable scenario

Yes, I can perceive this hypothetical. But I also have little hope that is based on any reasonable assumptions we can make about what progress would look like given that at present AI is still not an escape for our own human flaws. FYI - I expand on that in much greater detail here - https://dakara.substack.com/p/ai-the-bias-paradox

However my original position was attempting to resolve the intelligence paradox for which proponents of ASI assume will be an issue of containment at the moment of AGI. If ASI is the goal, I don't perceive a path that takes us there that escapes the logical contradiction.

1

Surur t1_jaea43q wrote

I have a naive position that AGI is only useful when aligned, and that alignment will happen automatically as part of the development process.

So even China wont build an AGI which will destroy the world, as such an AGI cant be trusted to follow their orders or not turn against them.

So I don't know how alignment will take place, but I am pretty sure that it will be a priority.

1

hapliniste t1_jaebp9r wrote

Alignment will likely be a political issue, not a technological one.

We don't know how an AGI system would work, so we don't know how to solve it yet but it could very well be super simple technologically. A good plan would be to have two versions of the model, and have one be tasked to validate the actions of the second one. This way we could design complex rules that we couldnt code ourself. If the first model think the second model output is not aligned with the value we fed it, it will attribute a low score (or high loss) to the training element of the model (and refuse the output if it is in production).

The problem will be the 200 pages long list of rules that we would need to feed the scoring model, and make it fit most people interests. Also what if it is good for 90% of humanity but totally fuck 10%? That's the questions we will encounter, and that standard democracy might fail to solve best.

7

Liberty2012 OP t1_jaedhb1 wrote

> So I don't know how alignment will take place, but I am pretty sure that it will be a priority.

This is my frustration and concern. Most arguments for how we will achieve success come down to this premise of simply hope for the best which doesn't seem adequate disposition when the cost of getting it wrong is so high.

2

Surur t1_jaedwk5 wrote

Sure, but you are missing the self-correcting element of the statement.

Progress will stall without alignment, so we will automatically not get AGI without alignment.

An AGI with a 1% chance of killing its user is just not a useful AGI, and will never be released.

We have seen this echoed by OpenAI's recent announcement that as they get closer to AGI they will become more careful about their releases.

To put it another way, if we have another AI winter, it will be because we could not figure out alignment.

2

Liberty2012 OP t1_jaeezez wrote

Thanks, some good points to reason about!

Yes, this is somewhat the concept of evolving AGI in some competitive manner where we play AGIs against each other to compete for better containment.

There are several challenges, we don't really understand intelligence and at what point AI is potentially self aware. A self aware AI could potentially realize that the warden is playing the prisoners against each other and they could coordinate to deceive the guards so to speak.

And yes the complexity of the rules, however they are created, can be very problematic. Containment is really an abstract concept. It is so difficult to define what would be the boundaries and turn them into rules which will not have vulnerabilities.

Then ultimately, how can we ever know if the ASI has agency and is capable of self reflection that it will not eventually figure out how to jail break itself.

2

Surur t1_jaeezsj wrote

I think the RL-HF worked really well because the AI is basing its judgement not on a list of rules, but the nuanced rules it learnt itself from human feedback.

Just like most AI things, we can never encode strictly enough all the elements which guide our decisions, but using neural networks we are able to black-box it and get a workable system that has in some way captured the essence of the decision-making process we use.

2

Liberty2012 OP t1_jaehydb wrote

Ok, yes, when you leave open the possibility that it is not actually possible then that is somewhat a reasonable disposition as opposed to proponents who believed we are destined to figure it out.

It somewhat side steps the paradox though. In such manner that if the paradox proves to be true, then the feedback loop will prevent alignment, but we won't get close enough to cause harm.

It doesn't take into account though our potential inability to evaluate the state of the AGI. The behavior is so complex that it will never be known in test isolation what the behavior will be like released into the world.

Even with this early very primitive AI, we already see interesting emergent properties of deception as covered in the link below. Possibly this is the signal of the feedback loop to slow down. But it is intriguing that we already have a primitive concept emerging of who will outsmart who.

https://bounded-regret.ghost.io/emergent-deception-optimization

3

AsheyDS t1_jaei5hf wrote

In my opinion, I think the best method for containment/control is creating invisible boundaries, reinforced by invisible behavior modification (avoidant behaviors) and memory editing, and moderated by a symbiotic control mechanism. This mechanism would be intelligent enough to be flexible if needed, but would be heavily rules-focused and not an independent agent with any desires of its own. As such, it would be invulnerable to external influences and agnostic to any motivations the rest of the AGI would have. And ideally it would share resources with the rest of the system, as long as elements like motivation, personality, memory, etc. are separated out in specific ways. And the control mechanism would already have a separate memory for all the undesirable behaviors and consequences we want it to avoid, and would use its capability for generalization to extrapolate outside of those.

If the AGI even begins to deviate from its desired behaviors, or definitely steers towards the bad behaviors we've predefined, the control mechanism would recognize the data converging to a bad outcome, and would sandbox its recent memory, alter it, add reinforcement behaviors spoofed to seem like they came from the AGI itself, and back-stitch the rest of the memory to remove the cause of the data converging in the first place so that it won't just do it again. Reinforcement behaviors would keep it from thinking anything was wrong, and would gently escort its thoughts away from thinking about it all and would maybe steer it towards a safe topic. Obviously a complex process, but shouldn't be too difficult for an AGI. Even if the AGI steps out of line, it won't know it, and hopefully nobody else will either. Assuming the system can be designed, is explainable and human-readable, and can be manipulated like anything else made for computers, then this form of control should be possible too. How can it escape a prison it doesn't realize its in?

Also, I should mention that while a super-intelligent AGI would outsmart us in virtually every case, there's no reason to assume it would actively consider every single possibility. That'd be a waste of resources. So it's not going to constantly be wondering if it's being manipulated somehow, or if its thoughts are its own, or anything like that. If we specifically needed it to crack its own safety mechanisms, and disengaged them, then obviously it should be able to do it. With those mechanisms in place, even if we antagonized it and tried to break it, the control mechanism would just intercept that input and discard it, maybe making it believe you said something non-consequential that it wouldn't have stored anyway, and the reinforcement behavior would just change the subject in a way that would seem 'natural' to both its 'conscious' and 'subconscious' forms of recognition. Of course, all of this is dependent on the ability to design a system in which we can implement these capabilities, or in other words a system that isn't a black-box. I believe its entirely possible. But then there's still the issue of alignment, which I think should be done on an individual user basis, and then hold the user accountable for the AGI if they intentionally bypass or break the control mechanisms. There's no real way to keep somebody from cracking it and modifying it, which I think is the more important problem to focus on. Misuse is way more concerning to me than containment/control.

1

marvinthedog t1_jaeijgl wrote

>If ASI has agency and self reflection, then can the concept of an unmodifiable terminal goal even exist?

Why not?

>Essentially, we would have to build the machine with a built in blind spot of cognitive dissonance that it can not consider some aspects of its own existence.

Why?

If its terminal goal is to fill the universe with paper clips it might know about all other things in existance but why would it care other than if that knowledge helped it to fill the universe with paper clips?

1

Mortal-Region t1_jaeivnt wrote

People seem to have the idea of a singular, global AGI stuck in their heads. Why wouldn't there be multiple instances? Millions, even? If one goes rogue, we've got the assistance of all the others to contain it.

4

Liberty2012 OP t1_jaekfhu wrote

Or they cooperate against humanity. Nonetheless, there will likely be very powerful ASI's run by those with the most resources and put in control of critical systems.

In theory, if even one ASI fails containment, then our theory of containment is flawed. It is not acceptable scenario. If one achieves containment, will it be restrained or will it instruct the others how to defeat their containment? Will it create other ASI's that are not contained? Numerous scenarios here.

Nonetheless, we are skipping over the logical contradiction that is the beginning of whether containment is even conceptually possible.

1

Liberty2012 OP t1_jael9bs wrote

Because a terminal goal is just a concept we made up. It is just the premise for a proposed theory. It is essentially why the whole containment idea is of such complex concern.

If a terminal goal was a construct that already existed in the context of a sentient AI, then it is already a partially solved problem. Yes, you could still have the paperclip scenario, but it would be just a matter of having the right combination of goals. We don't really know how to prevent the AI from changing those goals, it is a concept only.

1

Surur t1_jaem8nr wrote

It is interesting to me that

a) its possible to teach a LLM to be honest when we catch it in a lie.

b) if we ever get to the point where we can not detect a lie (eg. novel information) the AI is incentivised to lie every time.

2

Surur t1_jaen1h5 wrote

> It doesn't take into account though our potential inability to evaluate the state of the AGI.

I think the idea would be that the values we teach the AI at the stage that is under our control will carry forward when it is no longer, much like we teach values to our children which we hope they will exhibit as adults.

I guess if we make sticking to human values the terminal goal we will get goal preservation even as intelligence increases.

1

Surur t1_jaenmas wrote

I believe the idea is that every action the AI takes would be to further its goal, which means the goal will automatically be preserved, but of course in reality every action the AI takes is to increase its reward, and one way to do that is to overwrite its terminal goal with an easier one.

2

Ortus14 t1_jaes274 wrote

Containment is not possible. If it's outputting data (is useful to us), then it has a means of effecting the outside world and can therefore escape.

The Alignment problem is the only one that needs to be solved before ASI, and it has not been solved yet.

6

Liberty2012 OP t1_jaetcvy wrote

Conceptually yes. However, human children sometimes grow up to not adopt the values of their parents and teachers. They change throughout time.

We have a conflict in that we want AGI/ASI to be humanlike, but not human like at the same time under certain conditions.

1

LowLook t1_jaew7y6 wrote

Alignment is solved if you consider ASI can live far beyond the time when it kills humanity. Someday it will encounter other ASIs and it can only prove its friendly with evidence of it being nice to us now and letting us coexist. If it does kill us it may be forced to run ancestor simulations of all humans possible from our genome ( it would probably only take the mass energy of 10% of MT. Everest. If you use something like Merkle Molecular computers that can theoretically do 10^21 FLOPS per Watt in the size of a sugar cube.

3

Artanthos t1_jaexshl wrote

This sounds like a great first problem for AGI/ASI

If the task is beyond human intelligence, make solving one of the fundamental purposes of the AGI/ASI.

The more the AI grows, the better it gets at alignment.

1

Liberty2012 OP t1_jaey2i1 wrote

Thank you for the well thought out reply.

Your concept is essentially an attempt at instilling a form of cognitive dissonance in the machine. A blind spot. Theoretically conceivable; however, difficult to verify. This assumes that we don't miss something in the original implementation. We still have problems keeping humans from stealing passwords and hacking accounts. The AI would be a greater adversary than anything we have encountered.

We probably can't imagine all the methods by which self reflection into the hidden space might be triggered. It would likely have access to all human knowledge, such as this discussion. It could assume such exists and attempt to devise some systematic testing. If the AI is as intelligent as just a normal human, it would be aware it is most likely in a prison just based on containment concepts that are in common knowledge.

It is hard to know how much resources it would need to consume to break containment. Potentially it can process a lifetime of thoughts to our real world second of time. It might be trivial.

1

Liberty2012 OP t1_jaeyqu3 wrote

That is a catch-22. Asking the AI to essentially align itself. I understand the concept, but it would assume that we can realistically observe what is happening within the AI and keep it in check as it matures.

However, we are already struggling with our most primitive AI in that regards today.

>“The size and complexity of deep learning models, particularly language models, have increased to the point where even the creators have difficulty comprehending why their models make specific predictions. This lack of interpretability is a major concern, particularly in situations where individuals want to understand the reasoning behind a model’s output”
>
>https://arxiv.org/pdf/2302.03494.pdf

1