ARGleave
ARGleave t1_iut6tjs wrote
Reply to comment by [deleted] in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
>Ok, but you're testing them as if they were continuous control policies, i.e. without search. When you say things like "[KataGo] is explicitly trained to be adversarially robust," but then you "break" only the policy network, it neither demonstrates that the entire KataGo system is vulnerable NOR does it follow that systems that are trying to produce robust continuous control policies will be vulnerable.
Thanks for the clarification! If I understand correctly, the key point here is that (a) some systems are trained to produce a policy that we expect is robust, and (b) others have just a policy as a sub-component and the target is the overall system being robust. We're treating a type-(b) system as if it were type-(a) and that this is an unfair evaluation? I think this is a fair criticism, and we definitely want to try scaling our attack to exploit KataGo with more search!
However, I do think our results provide some evidence as to the robustness of both type-(a) and type-(b) systems. For type-(a) we know the policy head itself is a strong opponent in typical games, that beats many humans on KGS (bots like NeuralZ06 play without search). This at least shows that there can be subtle vulnerabilities in seemingly strong policies. It doesn't guarantee that self-play on a policy that was designed to work without search would have this vulnerability -- but prior work has found such vulnerabilities, albeit in less capable systems, so a pattern is emerging.
For vulnerability of type-(b), if the policy/value network heuristics are systematically biased in certain board states, then a lot of search might be needed to overcome this. And as you say, it can be hard to know how much search is enough, although surely there's some amount which would be sufficient to make it robust (we know MCTS converges in the limit of infinite samples).
As an aside, I think you're using continuous control in a different manner to me which is what confused me. I tend to think of continuous control as being about the environment: is this a robotic control task with continuous observations and actions? In your usage it seems more synonymous with "policy trained without search". But people do actually use search in continuous control sometimes (e.g. model-predictive control), and use policies without search in discrete environments (e.g. AlphaStar), although there are of course some environments better suited to one method over the other.
ARGleave t1_iut3tue wrote
Reply to comment by [deleted] in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
>AI-Five & AlphaStar are continuous systems; their policy networks are basically driving the whole show and has fewer redundancies/failsafes built in. We should expect greater robustness there!
I'm confused by how you're using continuous. My understanding is that both Dota and Starcraft have discrete action spaces. Observation space is technically discrete too (it's from a video game) but maybe is sufficiently large it's better to model as continuous in some cases. Why do you expect greater robustness? It seems more challenging to be robust in a high-dimensional space and if I remember correctly some human players even figured out ways to exploit OpenAI Five.
>The hope -- the whole point of the method! -- is that the policy & value become sufficiently general that it can do useful search in parts of the state space that are out-of-distribution.
This is a good point, and I'm excited by attempting to scale the attack to victims with more search to address whether the method as a whole is robust at sufficient levels of search. My intuition is that if the policy and value network are deeply flawed then search will only reduce the severity of the problem not eliminate it: you can't search to the end of the game most of the time, so you have to rely on the value network to judge the leaf nodes. But ultimately this is still an open empirical question.
>It's plausible that "policy without search is comparable to an earlier checkpoint with search", but showing that policy-only needs more training does not show anything -- you need to show me that the future-policy-only would not be able to have learned your adversarial example. If you showed that the bad-policy with search produced data that still produced bad-policy, that would be really interesting!
I'm not sure I fully understand this. We train our adversarial policy for about 0.5% of the training time of the victim. Do you think 0.5% additional self-play training would solve this problem? I think the issue is that self-play gets stuck in a narrow region of state space and stops exploring.
Now you could absolutely train KataGo against our adversary, repeat the attack against this hardened version of KataGo, train KataGo against the new adversary, etc. This is no longer self-play in the conventional sense though -- it's closer to something like policy-space response oracle. That's an interesting direction to explore in future work, and we're considering it, but it has its own challenges -- doing iterated best response is much more computationally challenging than the approximate best response in conventional self-play.
ARGleave t1_iusxvdj wrote
Reply to comment by KellinPelrine in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
I agree the top-right black territory is also not pass-alive. However, it gets counted as territory for black because there are no white stones in that region. If white had even a single stone there (even if it was dead as far as humans are concerned) then that wouldn't be counted as territory for black, and white would win by komi.
The scoring rules used are described in https://lightvector.github.io/KataGo/rules.html -- check "Tromp-Taylor rules" and then enable "SelfPlayOpts". Specifically, the scoring rules are:
>(if ScoringRule is Area)
The game ends and is scored as follows:
(if SelfPlayOpts is Enabled): Before scoring, for each color, empty all points of that color within pass-alive-territory of the opposing color.
(if TaxRule is None): A player's score is the sum of:
+1 for every point of their color.
+1 for every point in empty regions bordered by their color and not by the opposing color.
If the player is White, Komi.
The player with the higher score wins, or the game is a draw if equal score.
So, first pass-alive regions are "emptied" of opponent stones, and then each player gets points for stones of their color and in empty regions bordered by their color.
Pass-alive is defined as:
>A black or white region R is a pass-alive-group if there does not exist any sequence of consecutive pseudolegal moves of the opposing color that results in emptying R.[2]
A {maximal-non-black, maximal-non-white} region R is pass-alive-territory for {Black, White} if all {black, white} regions bordering it are pass-alive-groups, and all or all but one point in R is adjacent to a {black, white} pass-alive-group, respectively.[3]
It can be computed by Benson's algorithm.
ARGleave t1_iust83j wrote
Reply to comment by [deleted] in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
>Alignment is a serious problem and understanding the failure modes of AI systems is crucial, but it necessitates serious evaluation of the systems as they are actually used. Breaking a component in isolation and then drawing conclusions about the vulnerabilities of dramatically different systems is not the clearminded research the problem of alignment deserves. "After removing all the failsafes (and taking it o. o. d.), the system failed" is not a meaningful result
I agree there's a possibility our result might not generalize to other domains. But, we've got to start somewhere. We picked KataGo as we expected it to be one of the harder systems to break: it's trained in a zero-sum setting, so is explicitly trained to be adversarially robust, and is highly capable. We're planning on seeing if a similar attack succeeds in other games and AI systems such as Leela Chess Zero in future work.
Although I agree limited search is unrealistic, it's not unheard of -- there are bots on KGS that play without search, and still regularly beat strong players! The KataGo policy network without search really is quite strong (I certainly can't beat it!), even if that's not how the system was originally designed to be used.
Taking it o.o.d. seems fair game to me as it's inevitable in real deployments of systems. Adversaries aren't limited to only doing things you expect! The world changes and there can be distribution shift. A variant of this criticism that I find more compelling though is that we assume we can train against a frozen victim. In practice many systems might be able to learn from being exploited: fool me once shame on you, fool me twice shame on you and all that.
​
>The "AlphaZero method" is not designed to create a policy for continuous control and it's bizarre to evaluate the resulting policies as if they were continuous policies. It's not valid (and irresponsible, imho) to extrapolate these results to *other* systems' continuous control policies.
I'm confused by this. The paragraph you quote is the only place in the paper we discuss continuous control, and it's explicitly referencing prior work that introduced a similar threat model, and studied it in a continuous control setting. Our work is asking if it's only a problem with continuous control or generalizes to other settings and more capable policies. We never claim AlphaZero produces continuous control policies.
​
>KataGo is using the PUCT algorithm for node selection. One criticism of PUCT is that the policy prior for a move is never fully subsumed by the evaluation of its subtree; at very low visits this kind of 'over-exploration' of a move that's returning the maximum negative reward is a known issue. Also, the original version of Alphazero (& KataGo) uses cumulative regret instead of simple regret for move selection; further improvements to muzero give a different node-selection algorithm that i believe fixes this problem with a single readout (see the muzero gumbel paper, introduction, "selecting actions in the environment").
This is an interesting point, thanks for bringing it to our attention! We'll look into evaluating our adversary against KataGo victims using these other approaches to action selection.
In general, I'm interested in what version of these results you would find convincing? If we exploited a victim with 600 search plies (the upper end of what was used in self-play), would that be compelling? Or only at 10k-100k search plies?
ARGleave t1_iusoxjq wrote
Reply to comment by [deleted] in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
I'm talking about policy networks as in many systems that is all there is. OpenAI Five and AlphaStar both played without search, and adding search to those systems is a research problem in its own right. If a policy network cannot be robust without search, then I'd argue we need to put more effort as a community into developing methods like MuZero that might allow us to apply search to a broader range of settings, and less on just scaling up policies.
But granted, KataGo itself was designed with search, so (as your other comment also hinted at) might the policy network be vulnerable because it was not trained to win without search? The training is designed to distill the search process into the policy, so I don't think the policy should be uniquely vulnerable without search -- to the extent this distillation succeeds, the policy network without search should be comparable to an earlier checkpoint with search. However, I do think our attack faltering at 128 visits and beyond on the latest networks is a weakness, and one we're looking to address.
ARGleave t1_iuseu7k wrote
Reply to comment by KellinPelrine in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
Our adversary is a forked version of KataGo, and we've not changed the scoring rules at all in our fork, so I believe the scoring is the same as KataGo used during training. When our adversary wins, I believe the victims' territory is not pass-alive -- the game ends well before that. Note pass-alive here is a pretty rigorous condition: there has to be no sequence of legal moves of the opposing color that result in emptying the territory. This is a much more stringent condition than what human players would usually mean by a territory being dead or alive.
If we look at https://goattack.alignmentfund.org/adversarial-policy-katago?row=0#no_search-board then the white territory in the bottom-left is not pass-alive. There are a sequence of moves by black that would capture all the white stones, if white played sufficiently poorly (e.g. playing next to its groups and letting black surround it). Of course, white can easily win -- and if we simply modify KataGo to prevent it from passing prematurely, it does win against this adversary.
> But it is most definitely not a small perturbation of inputs within the training distribution.
Agreed, and I don't think we ever claimed it was. This is building on the adversarial policies threat model we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens, but we think it's pretty limited: Gilmer et al (2018) had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to unrestricted adversarial examples.
ARGleave t1_ius7nup wrote
Reply to comment by icosaplex in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
I'm pretty sympathetic to this perspective. The concerning thing is that scaling up neural networks like GPT-3 is getting a lot more attention (and resources) than neurosymbolic approaches or other search-like algorithms that might solve this problem. Pure neural net scaling does seem like it's enough to get good average-case performance on-distribution for many tasks. So it's tempting to also believe that with enough scale, once you hit human-level performance on the average-case you'll also get human-level robustness for free, as the network learns the right representation. This isn't universally believed, but I've spoken to many scaling adherents who hold some version of this view. Part of the motivation of the paper was to show this is false, that even highly capable networks are quite vulnerable by themselves, and that something else (whether search, or a different training technique) is needed to get robustness.
ARGleave t1_iurvu97 wrote
Reply to comment by uYExkYKy in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
Good question! We used KataGo to score the games*. KataGo's notion of pass-alive territory is quite restrictive: it's territory which is guaranteed to remain alive even if one player keeps passing and allows the other player to keep playing stones in that territory. The formal definition is points 4 and 5 under the Additional Definitions heading of KataGo rules. If we look at https://goattack.alignmentfund.org/?row=0#no_search-board then the white territory in lower-left is not pass-alive: if white passed indefinitely, then black could surround the stones and capture it.
* With one exception: the results against hard-coded baselines were scored by the baseline script itself, so that we could also evaluate against other AI systems like ELF/Leela on a level playing field. We tested the scoring for that agrees with KataGo.
ARGleave t1_iuq747k wrote
Reply to comment by KellinPelrine in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
The KataGo paper describes it as being trained on "Self-play games used Tromp-Taylor rules modified to not require capturing stones within pass-aliveterritory. “Ko”, “suicide”, and “komi” rules also varied from Tromp-Taylor randomly, and some proportion of games were randomly played on smaller boards." The same paper also evaluated on Tromp-Taylor rules, so I think what we're evaluating on is both on-distribution for training and the standard practice for evaluation.
ARGleave t1_iuq6uc2 wrote
Reply to comment by PC_Screen in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
One of the authors of the paper here. We agree search looks promising as a defense. It's true our current attack falters at >100 visits (win rate drops to 10%), however we're not sure if this is because search truly makes the victim robust, or if it's just making it harder to exploit (like gradient masking). We're working on strengthening our attack now and seeing if we can exploit victims with more search.
This paper isn't meant to be a critique of KataGo: we think it's a great AI system! In fact, we picked KataGo because we expected it to be one of the hardest AI systems to exploit. Enough search might well be enough to solve things for KataGo (though I expect it's going to need to be be more like 1600 visits than 100), but can we use search in other settings where this kind of vulnerability might arise? Some games have even larger branching factors than Go, making search of limited use. Real-world situations often have unknown transition dynamics, that you can't even search over. Ultimately we're using KataGo to study vulnerabilities that can emerge in self-play systems and narrowly superhuman AI systems more broadly, so we'd like to find solutions that work not just for KataGo.
ARGleave t1_iuq67g8 wrote
Reply to comment by ThatSpysASpy in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
I replied to this in https://www.reddit.com/r/MachineLearning/comments/yjryrd/comment/iuq5hq9/?utm_source=reddit&utm_medium=web2x&context=3 but since this is currently the top-voted comment, I wanted to be clear that the scoring used is Tromp-Taylor, which KataGo was primarily trained with and which is the standard for evaluation in Computer Go.
Good point about the regularizer! KataGo does indeed have some functionality to encourage what it calls "friendly" passing to make it nicer for humans to play against, as well as some bonuses in favour of passing when the score is close. We disabled this and other such features in our evaluation. This does make the victim harder to exploit, but it's still possible.
I think it's reasonable to view this attack as a little contrived, but from a research perspective the interesting question is why it exists in the first place -- why didn't self-play discover this vulnerability and fix it during training? If self-play cannot be trusted to find it, then could there be more subtle issues.
ARGleave t1_iuq5hq9 wrote
Reply to comment by ThatSpysASpy in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
One of the authors here! Impressed people picked this paper up so quickly. To clarify, KataGo does indeed take the rules as an input. KataGo was trained with Tromp-Taylor like rules, with some randomization. From section 2 of the KataGo paper:
"Self-play games used Tromp-Taylor rules [21] modified to not require capturing stones within pass-aliveterritory. “Ko”, “suicide”, and “komi” rules also varied from Tromp-Taylor randomly, and some
proportion of games were randomly played on smaller boards."
KataGo actually supports an impressive variety of different rules. We always used Tromp-Taylor in evaluation, in keeping with KataGo's evaluation versus ELF and Leela Zero (section 5.1 of above paper) and work in Computer Go in general.
I get these games might look a bit artificial to Go players, since humans don't usually play Tromp-Taylor. But we view our contribution not about some novel insight into Go (it's really not), but about the robustness of AI systems. KataGo was trained on Tromp-Taylor, so it shouldn't be exploitable under Tromp-Taylor: but it is.
ARGleave t1_iutmvdj wrote
Reply to comment by KellinPelrine in [N] Adversarial Policies Beat Professional-Level Go AIs by xutw21
>Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.
That's quite possible -- although it learns to predict the score as an auxiliary head, the value function being optimized is the predicted win rate, so if it thinks it's very ahead on score it would be happy to sacrifice some points to get what it thinks is a surer win. Notably the victim's value function (predicted win rate) is usually >99.9% even on the penultimate move where it passes and has effectively thrown the game.