Search

8 results for www.alignmentforum.org:

currentscurrents OP t1_j2hdsvv wrote

Someone else posted this example](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking), which is kind of what I was interested in. They trained a neural network to do a toy problem, addition mod 113, and then were able

6

currentscurrents t1_j2uwlrh wrote

think interpretability will help us build better models too. For example, in [this paper](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking) they deeply analyzed a model trained to do a toy problem - addition `mod 113`. They found that

34

DukkyDrake t1_j8fvyr5 wrote

mostly in hand. Here are some more informed comments regarding alignment concerns and [CAIS](https://www.alignmentforum.org/posts/HvNAmkXPTSoA4dvzv/comments-on-cais), which is what I think we'll end up with by default at the turn of the decade

3

SchmidhuberDidIt OP t1_j9rqdje wrote

Thanks, I actually read [this](https://www.alignmentforum.org/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer) today. He and Richard Ngo are the names I've come across for researchers who've deeply thought about alignment and hold views grounded in the literature

9

mano-vijnana t1_j9s5zl4 wrote

they don't see doom as inevitable. This is the sort of scenario Christiano worries about: [https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like](https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like) And this is Ngo's overview of the topic: [https://arxiv.org/abs/2209.00626](https://arxiv.org/abs/2209.00626)

15