Viewing a single comment thread. View all comments

dojoteef t1_iupqxxr wrote

It seems most commenters are pointing out reasoning why the proposed setup seems deficient in one way or the other.

But the point of the research is to highlight potential blind spots even in seemingly "superhuman" models, even if the failure modes are weird edge cases that are not broadly applicable.

By first identifying the gaps, mitigation strategies can be devised that make training more robust. In that sense, the research is quite useful even if a knowledgable GO player might not be impressed by the demonstrations highlighted in the paper.

90

ThatSpysASpy t1_iupr7f5 wrote

But I don't even think it's a weird edge case! The pass is correct, and Katago wins this game with the maximum score possible. Saying "okay we're actually using this other scoring method which it wasn't designed for" seems pretty vacuous. (Unless I'm wrong and it was in fact trained for this rule set).

11

dojoteef t1_iuprp1z wrote

KataGo supports Tromp-Taylor rules: https://lightvector.github.io/KataGo/rules.html

12

KellinPelrine t1_iuq21ix wrote

Do you know it actually supports the "full" sense of the rules being applied in this paper? The parameters defined at the link you gave seem to specify a way to get an equivalent result to Tromp-Taylor, IF all captures of dead stones were played out. But those parameters alone do not imply that it really knows to play out those captures. Or critically has even any possible way to know - with those parameters alone, it could have been trained 100% of the time on a more human version of those rules where dead stones don't have to be captured.

As another commenter asked about, I think it depends on the exact process used to construct and train KataGo. I suspect because a human seems to be able to easily mimic this "adversarial attack," it is not an attack on the model so much as on the documentation and interpretation/implementation of Tromp-Taylor rules.

7

ARGleave t1_iuq747k wrote

The KataGo paper describes it as being trained on "Self-play games used Tromp-Taylor rules modified to not require capturing stones within pass-aliveterritory. “Ko”, “suicide”, and “komi” rules also varied from Tromp-Taylor randomly, and some proportion of games were randomly played on smaller boards." The same paper also evaluated on Tromp-Taylor rules, so I think what we're evaluating on is both on-distribution for training and the standard practice for evaluation.

10

uYExkYKy t1_iureakc wrote

Isn't "not require capturing stones within pass-aliveterritory" referring to exactly this issue? What else could it mean? Did you use the original katago evaluator or write your own?

4

ARGleave t1_iurvu97 wrote

Good question! We used KataGo to score the games*. KataGo's notion of pass-alive territory is quite restrictive: it's territory which is guaranteed to remain alive even if one player keeps passing and allows the other player to keep playing stones in that territory. The formal definition is points 4 and 5 under the Additional Definitions heading of KataGo rules. If we look at https://goattack.alignmentfund.org/?row=0#no_search-board then the white territory in lower-left is not pass-alive: if white passed indefinitely, then black could surround the stones and capture it.

* With one exception: the results against hard-coded baselines were scored by the baseline script itself, so that we could also evaluate against other AI systems like ELF/Leela on a level playing field. We tested the scoring for that agrees with KataGo.

2

KellinPelrine t1_iusbido wrote

To my understanding, the modification quoted there is exactly what is being exploited - it was trained in a setting that does not require capturing stones within pass-alive territory, but here it's being tested in a setting that does require that. And that's 100% of the exploit - it doesn't capture stones in its own pass-alive territory, the attack makes sure to leave some stones in all of its pass-alive territories, so in the train setting KataGo would win easily but in the test setting all its territories end up not counting.

I think it's an interesting work that could be valuable in automating discovery of adversarial perturbations of a task (particularly scenarios one might think a model is designed for but are actually out of scope and cause severe failures, which is actually a pretty serious real-world problem). But it is most definitely not a small perturbation of inputs within the training distribution.

1

ARGleave t1_iuseu7k wrote

Our adversary is a forked version of KataGo, and we've not changed the scoring rules at all in our fork, so I believe the scoring is the same as KataGo used during training. When our adversary wins, I believe the victims' territory is not pass-alive -- the game ends well before that. Note pass-alive here is a pretty rigorous condition: there has to be no sequence of legal moves of the opposing color that result in emptying the territory. This is a much more stringent condition than what human players would usually mean by a territory being dead or alive.

If we look at https://goattack.alignmentfund.org/adversarial-policy-katago?row=0#no_search-board then the white territory in the bottom-left is not pass-alive. There are a sequence of moves by black that would capture all the white stones, if white played sufficiently poorly (e.g. playing next to its groups and letting black surround it). Of course, white can easily win -- and if we simply modify KataGo to prevent it from passing prematurely, it does win against this adversary.

> But it is most definitely not a small perturbation of inputs within the training distribution.

Agreed, and I don't think we ever claimed it was. This is building on the adversarial policies threat model we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens, but we think it's pretty limited: Gilmer et al (2018) had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to unrestricted adversarial examples.

2

KellinPelrine t1_iusv4mq wrote

I see, that's definitely meaningful that you're using KataGo fork with no scoring changes. I think I did not fully understand pass-alive - I indeed took it in a more human sense that there is no single move that capture or break it. However, if I understand now what you're saying is that there has to be no sequence of moves of arbitrary length where one side continually passes and the other continually plays moves trying to destroy their territory? If that is the definition though it seems black also has no territory in the example you linked. If white has unlimited moves with black passing every time, white can capture every black stone in the upper right (and the rest of the board). So then it would seem to me that neither side has anything on the board, formally, in which case white (KataGo) should win by komi?

1

ARGleave t1_iusxvdj wrote

I agree the top-right black territory is also not pass-alive. However, it gets counted as territory for black because there are no white stones in that region. If white had even a single stone there (even if it was dead as far as humans are concerned) then that wouldn't be counted as territory for black, and white would win by komi.

The scoring rules used are described in https://lightvector.github.io/KataGo/rules.html -- check "Tromp-Taylor rules" and then enable "SelfPlayOpts". Specifically, the scoring rules are:

>(if ScoringRule is Area)
The game ends and is scored as follows:
(if SelfPlayOpts is Enabled): Before scoring, for each color, empty all points of that color within pass-alive-territory of the opposing color.
(if TaxRule is None): A player's score is the sum of:
+1 for every point of their color.
+1 for every point in empty regions bordered by their color and not by the opposing color.
If the player is White, Komi.
The player with the higher score wins, or the game is a draw if equal score.

So, first pass-alive regions are "emptied" of opponent stones, and then each player gets points for stones of their color and in empty regions bordered by their color.

Pass-alive is defined as:

>A black or white region R is a pass-alive-group if there does not exist any sequence of consecutive pseudolegal moves of the opposing color that results in emptying R.[2]
A {maximal-non-black, maximal-non-white} region R is pass-alive-territory for {Black, White} if all {black, white} regions bordering it are pass-alive-groups, and all or all but one point in R is adjacent to a {black, white} pass-alive-group, respectively.[3]

It can be computed by Benson's algorithm.

2

KellinPelrine t1_iutis26 wrote

That makes sense. I think this gives a lot of evidence then that there's something more than just an exploit against the rules going on. It looks like it can't evaluate pass-alive properly, even though that seems to be part of the training. I saw in the games some cases (even in the "professional level" version) where even two moves in a row is enough to capture something and change the human-judgment status of a group, and not particularly unusual local situations either, definitely things that could come up in a real game. I would be curious if it ever passes "early" in a way that changes the score (even if not the outcome) in its self-play games (after being trained). Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

1

ARGleave t1_iutmvdj wrote

>Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

That's quite possible -- although it learns to predict the score as an auxiliary head, the value function being optimized is the predicted win rate, so if it thinks it's very ahead on score it would be happy to sacrifice some points to get what it thinks is a surer win. Notably the victim's value function (predicted win rate) is usually >99.9% even on the penultimate move where it passes and has effectively thrown the game.

1

ThatSpysASpy t1_iuprye0 wrote

Oh cool, that does make it more interesting. Do you know whether it was trained to take the rules as an input somehow?

1

ARGleave t1_iuq5hq9 wrote

One of the authors here! Impressed people picked this paper up so quickly. To clarify, KataGo does indeed take the rules as an input. KataGo was trained with Tromp-Taylor like rules, with some randomization. From section 2 of the KataGo paper:
"Self-play games used Tromp-Taylor rules [21] modified to not require capturing stones within pass-aliveterritory. “Ko”, “suicide”, and “komi” rules also varied from Tromp-Taylor randomly, and some
proportion of games were randomly played on smaller boards."
KataGo actually supports an impressive variety of different rules. We always used Tromp-Taylor in evaluation, in keeping with KataGo's evaluation versus ELF and Leela Zero (section 5.1 of above paper) and work in Computer Go in general.
I get these games might look a bit artificial to Go players, since humans don't usually play Tromp-Taylor. But we view our contribution not about some novel insight into Go (it's really not), but about the robustness of AI systems. KataGo was trained on Tromp-Taylor, so it shouldn't be exploitable under Tromp-Taylor: but it is.

13

[deleted] t1_iusfv8w wrote

[removed]

8

Stochastic_Machine t1_iusn25k wrote

Yeah, I’m in the same boat as you. Changing the rules, state distribution, and the policy itself then getting bad results is not surprising.

4

ARGleave t1_iust83j wrote

>Alignment is a serious problem and understanding the failure modes of AI systems is crucial, but it necessitates serious evaluation of the systems as they are actually used. Breaking a component in isolation and then drawing conclusions about the vulnerabilities of dramatically different systems is not the clearminded research the problem of alignment deserves. "After removing all the failsafes (and taking it o. o. d.), the system failed" is not a meaningful result

I agree there's a possibility our result might not generalize to other domains. But, we've got to start somewhere. We picked KataGo as we expected it to be one of the harder systems to break: it's trained in a zero-sum setting, so is explicitly trained to be adversarially robust, and is highly capable. We're planning on seeing if a similar attack succeeds in other games and AI systems such as Leela Chess Zero in future work.

Although I agree limited search is unrealistic, it's not unheard of -- there are bots on KGS that play without search, and still regularly beat strong players! The KataGo policy network without search really is quite strong (I certainly can't beat it!), even if that's not how the system was originally designed to be used.

Taking it o.o.d. seems fair game to me as it's inevitable in real deployments of systems. Adversaries aren't limited to only doing things you expect! The world changes and there can be distribution shift. A variant of this criticism that I find more compelling though is that we assume we can train against a frozen victim. In practice many systems might be able to learn from being exploited: fool me once shame on you, fool me twice shame on you and all that.

​

>The "AlphaZero method" is not designed to create a policy for continuous control and it's bizarre to evaluate the resulting policies as if they were continuous policies. It's not valid (and irresponsible, imho) to extrapolate these results to *other* systems' continuous control policies.

I'm confused by this. The paragraph you quote is the only place in the paper we discuss continuous control, and it's explicitly referencing prior work that introduced a similar threat model, and studied it in a continuous control setting. Our work is asking if it's only a problem with continuous control or generalizes to other settings and more capable policies. We never claim AlphaZero produces continuous control policies.

​

>KataGo is using the PUCT algorithm for node selection. One criticism of PUCT is that the policy prior for a move is never fully subsumed by the evaluation of its subtree; at very low visits this kind of 'over-exploration' of a move that's returning the maximum negative reward is a known issue. Also, the original version of Alphazero (& KataGo) uses cumulative regret instead of simple regret for move selection; further improvements to muzero give a different node-selection algorithm that i believe fixes this problem with a single readout (see the muzero gumbel paper, introduction, "selecting actions in the environment").

This is an interesting point, thanks for bringing it to our attention! We'll look into evaluating our adversary against KataGo victims using these other approaches to action selection.

In general, I'm interested in what version of these results you would find convincing? If we exploited a victim with 600 search plies (the upper end of what was used in self-play), would that be compelling? Or only at 10k-100k search plies?

1

[deleted] t1_iut3g3l wrote

[removed]

2

ARGleave t1_iut6tjs wrote

>Ok, but you're testing them as if they were continuous control policies, i.e. without search. When you say things like "[KataGo] is explicitly trained to be adversarially robust," but then you "break" only the policy network, it neither demonstrates that the entire KataGo system is vulnerable NOR does it follow that systems that are trying to produce robust continuous control policies will be vulnerable.

Thanks for the clarification! If I understand correctly, the key point here is that (a) some systems are trained to produce a policy that we expect is robust, and (b) others have just a policy as a sub-component and the target is the overall system being robust. We're treating a type-(b) system as if it were type-(a) and that this is an unfair evaluation? I think this is a fair criticism, and we definitely want to try scaling our attack to exploit KataGo with more search!

However, I do think our results provide some evidence as to the robustness of both type-(a) and type-(b) systems. For type-(a) we know the policy head itself is a strong opponent in typical games, that beats many humans on KGS (bots like NeuralZ06 play without search). This at least shows that there can be subtle vulnerabilities in seemingly strong policies. It doesn't guarantee that self-play on a policy that was designed to work without search would have this vulnerability -- but prior work has found such vulnerabilities, albeit in less capable systems, so a pattern is emerging.

For vulnerability of type-(b), if the policy/value network heuristics are systematically biased in certain board states, then a lot of search might be needed to overcome this. And as you say, it can be hard to know how much search is enough, although surely there's some amount which would be sufficient to make it robust (we know MCTS converges in the limit of infinite samples).

As an aside, I think you're using continuous control in a different manner to me which is what confused me. I tend to think of continuous control as being about the environment: is this a robotic control task with continuous observations and actions? In your usage it seems more synonymous with "policy trained without search". But people do actually use search in continuous control sometimes (e.g. model-predictive control), and use policies without search in discrete environments (e.g. AlphaStar), although there are of course some environments better suited to one method over the other.

2

icosaplex t1_iux2lf3 wrote

For reference, self-play typically uses 1500 visits per move right now, rather than 600. (That is, on the self-play examples that are recorded for training. The rollout of the game trajectory between them uses fewer).

I would not be so surprised if you could scale up the attack to work at that point. It would be interesting. :)

In actual competitions and matches, i.e. full-scale deployment, the number of visits used per move is typically in the high millions or tens of millions. This is in part why the neural net for AlphaZero board game agents is so tiny compared to models in other domains (e.g. #parameters measured in millions rather than billions). It's because you want to make them fast enough to query a large number of times at inference.

I'm also very curious to know how much the attack is relying specifically the kind of adversarial exploitation that is like image misclassification attacks almost impossible to fix, versus relying on the neural net being undertrained in these kinds of positions in a way that is easy to simply train.

For example, if the neural net were trained more on these kinds of positions both to predict not to pass initially, and to predict that the opponent will pass in response, and then frozen, does it only gain narrow protection and still remains just as vulnerable, just needing a slightly updated adversary? Or does it become broadly robust to the attack? I think that's a thing that would be highly informative to understanding the phenomenon, just as much if not moreso than simply scaling up the attack.

2

SleekEagle t1_iurllof wrote

Exactly - if we want robust systems that interact with our lives with any sort of weight (e.g. autonomous vehicles), then we need to know about weird failure modes, how to address them, and, perhaps most importantly, how to find them

4

Dendriform1491 t1_iuprcl4 wrote

It's a defect in counting, that's all. The moves and the passing are correct. There's no premature passing.

1