Authors:
(1) Avrim Blum, Toyota Technological Institute at Chicago, IL, USA;
(2) Melissa Dutz, Toyota Technological Institute at Chicago, IL, USA.
Table of Links
Abstract and 1 Introduction
2 Setting and 2.1 Models of behaviorally-biased opponents
3 Preliminaries and Intuition
4.1 Myopic Best Responder and 4.2 Gambler’s Fallacy Opponent
4.3 Win-Stay, Lose-Shift Opponent
4.4 Follow-the-Leader Opponent and 4.5 Highest Average Payoff Opponent
5 Generalizing
5.1 Other Behaviorally-Biased Strategies
5.2 Exploiting an Unknown Strategy from a Known Set of Strategies
6 Future Work and References
A Appendix
A.1 Win-Stay Lose-Shift Variant: Tie-Stay
A.2 Follow-the-Leader Variant: Limited History
A.3 Ellipsoid Mistake Bounds
A.4 Highest Average Payoff Opponent
5.2 Exploiting an Unknown Strategy from a Known Set of Strategies
Another natural extension is to consider the scenario where we know the opponent uses some strategy from a known set of biased strategies B, but we don’t know which one. Again, we assume strategies in B are deterministic when parameterized with one of t possible deterministic tie-breaking mechanisms.
Predicting the Opponent’s Actions
Learning Best Responses
The challenge, again, is to learn best responses. Generally, any time we can use the same algorithm to learn best responses given any strategy in B, this extension is trivial. For example if all strategies in B have the natural property discussed above (in 5.1) that after we play an action “enough” times, the opponent will eventually play a best response to it, we can use the same algorithm from 5.1 to learn best responses (using the maximum value required by any strategy as “enough”).
▶ Observation 9. If all strategies in B have property that the opponent plays a best response to an action if it is played “enough” times in a row, and we know the maximum upper bound on “enough”, we can learn best responses as described in 5.1.
If it is not possible to use the same algorithm to learn best responses given any strategy in B, it might instead be possible to first distinguish between strategies which need different best-response algorithms, and then learn best responses. As a very simple example, if we know the opponent uses one of the Tie-Stay variant or the Tie-Shift variant of the WinStay, Lose-Shift strategy, we cannot use the strategy mentioned above because the Tie-Stay variant does not have the property in question. However, we can distinguish between the two strategies by checking whether the opponent “stays” after we play the same action as it (since a permissible game is symmetric, this must be a tie. We can check this within n rounds by playing the same action a until either the opponent switches to playing a, or stays on some other action b. If the opponent stays on b, we can then play b to check their tie behavior). Then, knowing our opponent’s strategy, we can use the corresponding algorithm from Section 4.3 to learn best responses (and in this case, to predict efficiently as well).
Note, however, that it is not always possible to distinguish between deterministic strategies. As a simple example, consider playing a 3-action game against either the Myopic Best Responder, who plays a best response to our last action, or another opponent, call them Myopic Worst Responder, who plays a worst response to our last action. No matter how we play, the Myopic Best Responder with respect to rock-paper-scissors behaves identically to the Myopic Worst Responder with respect to a reverse version of rock-paper-scissors where the wins and losses are switched. Since both games are possible, we cannot distinguish whether the opponent is playing Myopic Best Response or Myopic Worst Response. Moreover, best responses with respect to one scenario are worst responses with respect to the other. Either of these opponents, if known, would be easy to exploit. However, in this case we cannot guarantee an average payoff greater than zero: if we play a best responses with respect to either opponent in a given round, it is a loss with respect to the other so our expected payoff for that round is 0.
▶ Observation 10. There exist collections of pairs of game matrices and strategies such that it is not possible to guarantee a positive average payoff, even if we could exploit any of the individual strategies if known to win nearly every round.