fév. 2026
| Intervenant : | Julien Grand-Clément |
| Institution : | HEC |
| Heure : | 14h00 - 15h00 |
| Lieu : | 3L15 |
Self-play via online learning is a leading paradigm for solving large-scale games and has enabled recent superhuman performance (e.g., Go, Poker). This work clarifies that different convergence notions in self-play (last iterate, best iterate, and a randomly sampled iterate) can behave fundamentally differently. For a broad class of learning dynamics, including Optimistic Multiplicative Weights Update (OMWU), we prove a separation: even in two-player zero-sum games, last-iterate convergence can be arbitrarily slow, random-iterate convergence can be slower than any polynomial, while best-iterate convergence is polynomial. This departs from much prior theory where these notions align, and we attribute the gap to OMWU’s insufficient “forgetfulness,” linking it to empirical behavior in practical game solving.
based on https://arxiv.org/pdf/2503.02825 and https://arxiv.org/pdf/2406.10631