Viewing a single comment thread. View all comments

ARGleave t1_iut3tue wrote

>AI-Five & AlphaStar are continuous systems; their policy networks are basically driving the whole show and has fewer redundancies/failsafes built in. We should expect greater robustness there!

I'm confused by how you're using continuous. My understanding is that both Dota and Starcraft have discrete action spaces. Observation space is technically discrete too (it's from a video game) but maybe is sufficiently large it's better to model as continuous in some cases. Why do you expect greater robustness? It seems more challenging to be robust in a high-dimensional space and if I remember correctly some human players even figured out ways to exploit OpenAI Five.

>The hope -- the whole point of the method! -- is that the policy & value become sufficiently general that it can do useful search in parts of the state space that are out-of-distribution.

This is a good point, and I'm excited by attempting to scale the attack to victims with more search to address whether the method as a whole is robust at sufficient levels of search. My intuition is that if the policy and value network are deeply flawed then search will only reduce the severity of the problem not eliminate it: you can't search to the end of the game most of the time, so you have to rely on the value network to judge the leaf nodes. But ultimately this is still an open empirical question.

>It's plausible that "policy without search is comparable to an earlier checkpoint with search", but showing that policy-only needs more training does not show anything -- you need to show me that the future-policy-only would not be able to have learned your adversarial example. If you showed that the bad-policy with search produced data that still produced bad-policy, that would be really interesting!

I'm not sure I fully understand this. We train our adversarial policy for about 0.5% of the training time of the victim. Do you think 0.5% additional self-play training would solve this problem? I think the issue is that self-play gets stuck in a narrow region of state space and stops exploring.

Now you could absolutely train KataGo against our adversary, repeat the attack against this hardened version of KataGo, train KataGo against the new adversary, etc. This is no longer self-play in the conventional sense though -- it's closer to something like policy-space response oracle. That's an interesting direction to explore in future work, and we're considering it, but it has its own challenges -- doing iterated best response is much more computationally challenging than the approximate best response in conventional self-play.

1