ReasonableObjection t1_jefpe2n wrote on March 31, 2023 at 5:55 PM

Reply to comment by grotundeek_apocolyps in [D] AI Explainability and Alignment through Natural Language Internal Interfaces by jackfaker

Thank you so much for the thoughtful reply!
Will read into these and may reach out to you with other questions.
Edit - as far as how I'm feeling... at the moment just curious, been asking lots of questions about this the last few days and reading any resources people are kind enough to share :-)

ReasonableObjection t1_jedzrt7 wrote on March 31, 2023 at 9:54 AM

Reply to comment by dansmonrer in [D] AI Explainability and Alignment through Natural Language Internal Interfaces by jackfaker

Thank you for your detailed response. So to be clear you are saying that things like emergent goals in independent agents or those agents having convergent instrumental goals are made up or not a problem? Do you have any resources that would describe intelligence or solving the alignment problem in ways that are not dangerous? I’m aware of some research that looks promising but curious if you have others.

ReasonableObjection t1_jedg8ki wrote on March 31, 2023 at 5:25 AM

Reply to comment by grotundeek_apocolyps in [D] AI Explainability and Alignment through Natural Language Internal Interfaces by jackfaker

You could point me to some fundamental research around intelligence in general (forget the math or the computers) because we already demonstrated that it does not matter if intelligence emerges biologically or artificially, or if it was coded in silicone or some disgusting wetware... what matters is the emergent behaviors that result of it.

That's the part I don't think people understand, you can remove the computers, remove the humans, remove the coding and just think about how an intelligent agent would work in any environment and you arrive at the same dangerous conclusions. We don't currently know how to solve for them.

You observe them in any intelligence, after all anybody can argue not everything humans do is beneficial to what mother nature programed us to do (make more babies).

Again, this is not about math or coding... we just haven't solved some basic questions...
For example, can you give me a definition of intelligence where an inferior general intelligent agent (biological or not) would be able to control a superior one over a long enough timeline? Because all of our definitions currently lead us to conclude the answer is no.

Also if you have done any looking into this you would realize even if we could solve these problems we currently lack the capabilities to code them into the models to make sure they are safe.

I'm not trying to overwhelm you with doomer arguments... I'm Genuinely curious and searching for some opposing views that are actually researched and thought out, vs some hand waving about how we will fix it in prod... or hahaha you so dumb cause you think skynet is coming (this tech will be able to kill us long before it is as cool as skynet)... I'm asking for evidence because the people that actually have thought about this for 30+ years still haven't been able to solve these very basic, non-math and non-computer coding related problems.
Any serious researcher that wants to continue despite the danger assumes we will solve the problems before we run out of runway, not that we have solved any of these problems...

I would absolutely bet on human ingenuity to solve these problems given enough time based on the history of human ingenuity... the danger is we will run out of time, not that we can't solve the problem... unfortunately for this particular problem we don't get a do-over like others...

ReasonableObjection t1_jed8z42 wrote on March 31, 2023 at 4:08 AM

Reply to comment by grotundeek_apocolyps in [D] AI Explainability and Alignment through Natural Language Internal Interfaces by jackfaker

The one part you are wrong is about an "AI agent harming humans despite having been designed not to do so". We don't currently know how to code a model that does not devolve into harming humans even when we are trying to code it not to do so...Keep in mind a lot of the academic and theoretical concerns that have been discussed over the last 30 years and sounded like science fiction have now become very real in the last 6 months (ahead of most serious researcher's timelines and assumed degree of difficulty) and are currently being demonstrated by existing models like chat GPT.

I don't understand how it is a made-up concern when all the ways we have to train these models lead to these negative end-states and we don't currently have a solution to the problem... and none of this is a surprise considering this is exactly what we observe happening in the real-world when it comes to emergent intelligent behavior (artificial or biological).

I also don't think a lot of people understand that these are not "coding" problems... we cannot solve the very basic problems that arise from intelligence even before we can code them into a model.

Even if we could solve these problems (and I'm not one to bet against human Enginuity over a long enough timeline), it is important to understand we can't currently code the solutions into our models. We don't code these models, we train them, and we can only observe their external output and alignment and infer that everything is fine. We have no solution for verification on even this...We have not even got into the bigger problem that we have no idea what is going on inside to make them spit out those outputs and even more importantly no way to probe their inner alignment (a whole other problem)

I agree with you that until we develop an AGI that is superior to a human agent none of that matters, but the danger is we don't understand how these models work any better than we understand how a human brain works and we by definition we won't know when the point of no return has been crossed because of how we design these things...

The danger is running out of runway before somebody accidentally crosses the threshold... If that happens the whole "bad actor uses AI to do bad things" will be the least of our worries. I would argue we already live in that world if you look at things like social media, Facebook newsfeed and etc...

It is really important for people to understand it can get a lot worse... way worse than they can imagine because we humans would not be able to imagine it by definition (we would not be as intelligent).

Edit - the only part where I would argue with you that you are wrong... I absolutely agree with you that less than AGI capabilities can and unfortunately will do huge amounts of damage before an AGI ever becomes a threat... hell if we lucky we will use those to kill ourselves before an AGI does so cause it can be worse if they do it...

ReasonableObjection t1_jed3b66 wrote on March 31, 2023 at 3:17 AM

Reply to comment by grotundeek_apocolyps in [D] AI Explainability and Alignment through Natural Language Internal Interfaces by jackfaker

What do you mean?