Course:CPSC522/Improving Human Behavior Prediction in SimultaneousMove Games
Contents
Title
Improving Human Behavior Prediction in SimultaneousMove Games
Principal Author: Adnan Reza
Hypothesis: In the previous assignment, I studied literature that analyzed and evaluated behavioral game theoretic models' performance in predicting human behavior. I want to study the existing behavioral game theory literature to identify ways in which the best performing models can be modified to improve their predictive performance.
This page is based on work done by James Wright and Kevin L. Brown^{[1]}, showing how to modify the best performing behavioral models to improve their predictive performance.
Abstract
In multiagent settings, we often assume that agents will adopt Nash equilibrium strategies. However, studies in experimental economics demonstrate that Nash equilibrium is a poor indicator of human players’ initial behavior in normalform games. The previous page showed how these models performed using publiclyavailable experimental data from the literature. This page suggests improvements to the best performing behavioral models and tests their performance relative to the original models. Finally, we look at some key issues that may impact the models' predictive performance: how the models perform on unseen data and whether payoff scaling matters in the context of the analysis presented.
Builds on
This page builds on Predicting Human Behavior in NormalForm Games^{[2]} , by (1) suggesting modifications to existing models and observing their relative predictive performance and (2) testing hypotheses related to behavioral models.
Related Pages
This page assumes that readers are familiar with Game Theory^{[3]} and Maximum Likelihood Estimation.
Content
Background Knowledge^{[2]}
Game theory is a mathematical system for analyzing and predicting how idealized rational agents behave in strategic situations.
Behavioral game theory aims to extend game theory to modelling human agents.
In game theory, the normalform representation of a game includes all conceivable strategies, and their corresponding payoffs, for each player.
In a normal form game:
 Each agent simultaneously chooses an action from a finite action set.
 Each combination of actions yields a known utility to each agent.
 The agents may choose actions either deterministically or stochastically.
In a Nash equilibrium, each agent best responds to the others. An agent best responds to other agents’ actions by choosing a strategy that maximizes utility, conditional on the other agents’ strategies.
A Worked Example: Game of Chicken^{[2]}
For “simple” games a convenient way to represent normalform games is in bimatrix form. We can look at a simple game called "the chicken game" to understand how a game is represented in normal form and how nash equilibria is determined. Figure 1a shows that the game has two players (you can think of them as two cars currently opposite to each other). Figure 1b shows the bimatrix representation for the game of chicken:
 There are two players: Player 1 (Row Player) and Player 2 (Column Player).
 Each player can either “dare” or “chicken out” .
 If both players “Dare”, they collide and receive payoffs
 If player 1 “Dares” & player 2 “Chickens out”, then player 1 receives a payoff of and player 2 receives only .
 If player 1 “Chickens Out” & player 2 “Dares”, then player 1 receives a payoff of only and player 2 receives .
 If players 1 and 2 both “Chicken Out”. Then each receives a payoff of .
 The purestrategy Nash equilibria for this game are and , where each player has no incentive to deviate. What this means is that these pairs of strategies which constitute the purestrategy Nash equilibria are "selfenforcing", i.e. it makes each player's strategy an optimal (best) response to the other player's strategy.
Poor Predictive Performance of Nash equilibrium^{[2]}
However, even though the Nash Equilibrium is an intuitive and appealing notion of achieving equilibrium, it is often a poor predictor of human behavior. The vast majority of human players choose in the game of chicken. Modifications to a game that don’t change Nash equilibrium predictions at all can cause large changes in how human subjects play the game. In Game of Chicken: When the penalty is large, people play much closer to Nash equilibrium. Clearly Nash equilibrium is not the whole story. Behavioral game theory proposes a number of models to better explain human behavior.
Related Literature
This central question the previous page tried to answer was, “What model is best for predicting human behavior in simultaneousmove games?”. Now, we look at ways of improving predictive performance by modifying existing models and evaluating and comparing their performance. Before doing this we study the existing literature to determine the extent to which this question had already been answered. Our findings indicate that the significant majority of the existing literature appears to be more concerned with explaining behavior than with predicting it. Therefore, comparisons of outofsample prediction performance were rare. A few notable exceptions are:
 Camerer et al. (2011)^{[4]} evaluated the performance of QRE and cognitive hierarchy variants on one experimental treatment using parameters estimated on two separate experimental treatments.
 Crawford and Iriberri (2007)^{[5]} compared the performance of two models by training each model on each game in their dataset individually, and then evaluating the performance of each of these trained models on each of the − 1 other individual games;
 Camerer et al. (2004)^{[6]} and Chong et al. (2005)^{[7]} computed likelihoods on each individual game in their datasets after using models fit to the − 1 remaining games;
 Morgan and Sefton (2002)^{[8]} and Hahn et al. (2010)^{[9]} evaluated prediction performance using heldout test data;
 Stahl and Wilson (1995)^{[10]} evaluated prediction performance on 3 games using parameters fit from the other games;
Behavioral Game Theory Models
Themes:
 Quantal response: Agents bestrespond with high probability rather than deterministically best responding.
 Iterative strategic reasoning: Agents can only perform limited steps of strategic “lookahead”.
One model is based on quantal response, two models are based on iterative strategic reasoning, and one model incorporates both.
Quantal Response Equilibrium (QRE)
One prominent behavioral theory asserts that agents become more likely to make errors as those errors become less costly. We refer to this property as costproportional errors. This can be modeled by assuming that agents best respond quantally, rather than via strict maximization.
QRE model ^{[11]}
 Agents quantally best respond to each other.
 A (logit) quantal best response by agent to a strategy profile is a mixed strategy such that
where λ (the precision parameter) indicates how sensitive agents are to utility differences. Note that unlike regular best response, which is a setvalued function, quantal best response returns a single mixed strategy.
Levelk
Another key idea from behavioral game theory is that humans can perform only a bounded number of iterations of strategic reasoning. The levelk model ^{[12]} captures this idea by associating each agent i with a level ∈ , corresponding to the number of iterations of reasoning the agent is able to perform.
 A level0 agent plays randomly, choosing uniformly at random from his possible actions.
 A levelk agent, for ≥ , best responds to the strategy played by level − agents. If a levelk agent has more than one best response, he mixes uniformly over them.
Here we consider a particular levelk model, dubbed , which assumes that all agents belong to levels and . Each agent with level > has an associated probability of making an “error”, i.e., of playing an action that is not a best response to the level ( − ) strategy. However, the agents do not account for these errors when forming their beliefs about how lowerlevel agents will act.
Poisson Cognitive Hierarchy
The cognitive hierarchy model ^{[6]}, like QLk, aims to model agents with heterogeneous bounds on iterated reasoning. It differs from the QLk model in two ways:
 First, agents use standard best response, rather than quantal response.
 Second, agents best respond to the full distribution of lowerlevel types, rather than only to the strategy one level below.
More formally, every agent again has an associated level ∈ . Let be the cumulative distribution of the levels in the population. Level0 agents play (typically uniformly) at random. Levelm agents ≥ best respond to the strategies that would be played in a population described by the cumulative distribution . (Camerer, Ho, and Chong 2004)^{[6]} advocate a single parameter restriction of the cognitive hierarchy model called PoissonCH, in which the levels of agents in the population are distributed according to a Poisson distribution.
Quantal Levelk
The Quantal Level K (QLK) model combines elements of the QRE and levelk models; we refer to it as the quantal levelk model. In QLk, agents have one of three levels, as in Lk. Each agent responds to its beliefs quantally, as in QRE. Like Lk, agents believe that the rest of the population has the nextlower type. The main difference between QLk and Lk is in the error structure. In Lk, higherlevel agents believe that all lowerlevel agents best respond perfectly, although in fact every agent has some probability of making an error. In contrast, in QLk, agents are aware of the quantal nature of the lowerlevel agents’ responses, and have a (possiblyincorrect) belief about the lowerlevel agents’ precision.
Quantal Cognitive Hierarchy (QCH)
QCH is a hybrid behavioral model which combines the best elements of the QLk and Cognitive Hierrachy models.
In quantal cognitive hierarchy model (QCH), all agent levels:
 Respond quantally (as in QLk).
 Respond to truncated, true distribution of lower levels (as incognitive hierarchy).
 Agents have the same precision λ.
 Agents are aware of the true precision of lower levels.
The results from the previous assignment indicate that QCH outperforms both QLk and the Cognitive Hierarchy models.
Model Variations and Analysis
Data
We use nine largescale, publiclyavailable sets of humansubject experimental data. Each observation of an action by an experimental subject is represented as a pair , where is the action that the subject took when playing as player in game . All games were two player, so each single play of a game generated two observations. Data from six experimental studies, plus a combined dataset was used:
 SW94: 400 observations from [Stahl & Wilson 1994]^{[13]}
 SW95: 576 observations from [Stahl & Wilson 1995]^{[10]}
 CGCB98: 1296 observations from [CostaGomes et al. 1998]^{[14]}
 GH01: 500 observations from [Goeree & Holt 2001]^{[15]}
 CVH03: 2992 observations from [Cooper & Van Huyck 2003]^{[16]}
 RPC09: 1210 observations from [Rogers et al. 2009]^{[17]}
 ALL6: All 6974 observations
Subjects played 2player normal form games once each. Each action by an individual player is a single observation.
To see how the original models (before modification) performed against each other, refer to Predicting Human Behavior in NormalForm Games.
Model Variations
We investigate the properties of the QLk model by evaluating the predictive power of a family of systematic variations of this model. In the end, we identify a simpler model that dominated QLk on our data. In other words, we seek to improve the predictive performance QLk and QCH models by tweaking their properties.
Specifically, we construct a broad family of models by modifying the QLk model along four different axes.
 First, QLk assumes a maximum level of 2; we evaluated QLk with maximum levels of 3 to 7 to observe performance. What we observe is that prediction performance increases by around 12% when we choose k=3 (other things being equal) and up to 20% with k=4. With k>4, there is not a significant increase in performance.
 Second, QLk assumes inhomogeneous precisions in that it allows each level to have a different precision; we varied this by also considering homogeneous precision models. We observe that having inhomogeneous precisions leads to a noticeable increase in performance (at the cost of having a more complex model i.e. more parameters.)
 Third, QLk allows general precision beliefs that can differ from lowerlevel agents’ true precisions; we also constructed models that make the simplifying assumption that all agents have accurate precision beliefs about lowerlevel agents. We again see that assuming that agents have accurate precision beliefs about lowerlevel agents leads to about a 10% fall in performance across most of the model variants.
Figure 2 shows model variations with prediction performance on the combined dataset. The models with max level of ∗ used a Poisson distribution. Models are named according to precision beliefs, precision homogeneity, population beliefs, and type of level distribution. E.g., ahQCH3 is the model with accurate precision beliefs, homogeneous precisions, cognitive hierarchy population beliefs, and a discrete distribution over levels 0–3.
Evaluating the Model Variants
We evaluate the predictive performance of each model variant on the combined dataset using 10 rounds of 10fold crossvalidation. Specifically, for each round, each dataset was randomly divided into 10 parts. For each of the 10 ways of selecting 9 parts from the 10, the maximum likelihood estimate of the model’s parameters were computed based on those 9 parts, using the NelderMead simplex algorithm.^{[18]}^{[19]}Then the log likelihood of the remaining part was computed given the prediction. We call the average of this quantity across all 10 parts, the crossvalidated log likelihood.
Simplicity Versus Predictive Performance
Other things being equal, a model with higher performance is more desirable, as is a model with fewer parameters. We can plot an “efficient frontier” of those models that achieved the best performance for a given number of parameters or fewer; see Figure 3.
The original QLk model (giQLk2) is not efficient in this sense; it is dominated by ahQCH3^{[1]}, which has significantly better predictive performance and fewer parameters (because it restricts agents to homogeneous precisions and accurate beliefs). We could argue that the flexibility added by inhomogeneous precisions and general precision beliefs is less important than the number of levels and the choice of population belief. On the other hand, the poor performance of the Poisson variants relative to ahQCH3 may indicate that flexibility in describing the level distribution is more important than the total number of levels modeled.
There is a noticeable pattern in the models along the efficient frontier: this set consists exclusively of models with accurate precision beliefs, homogeneous precisions, and cognitive hierarchy beliefs. This suggests that the most parsimonious way to model human behavior in normalform games is to use a model of this form, with the tradeoff between simplicity (i.e., number of parameters) and predictive power determined solely by the number of levels modeled. For the combined dataset, adding additional levels yielded small increases in predictive power until level 5, after which it yielded no further, statistically significant improvements.^{[1]}
SpikePoisson Variant
A Poisson distribution might be better able to fit our data if the proportion of level0 agents were specified separately. The intuition behind the SpikePoisson variant is to have bimodal distributions instead of unimodal ones (specifiy level 0 agents separately) while still offering the advantage of representing higherlevel agents without needing a separate parameter for each level.^{[1]} In this section, we evaluate an ahQCH model that uses just such a distribution: a mixture of a deterministic distribution of level0 agents and a standard Poisson distribution. We refer to this mixture as a “spikePoisson” distribution. We define the full SpikePoisson QCH
model here.
SpikePoisson Quantal Cognitive Hierarchy (QCH) model'^{[1]}'. Let πSP i,m ∈ Π(Ai) be the distribution over actions predicted for an agent i with level m by the SpikePoisson QCH model. Let:
Let G(s_{−i}; λ) denote i’s quantal best response in game to the strategy profile s_{−i} (of the n1 other agents), given precision parameter λ. Let
be the “truncated” distribution over actions predicted for an agent conditional on that agent’s having level 0 ≤L ≤ m. Then πSP is defined as
The overall predicted distribution of actions is a weighted sum of the distributions for each level:
The model thus has three parameters: the mean of the Poisson distribution τ, the spike probability ∈, and the precision λ.
Figure 4 compares the performance of the spikepoissoin variant (ahQCHsp) to the ahQCH model variants discussed above. The threeparameter ahQCHsp
model outperformed every model except for ahQCH5. In particular, it outperformed both ahQCH3 and ahQCH4, despite having fewer parameters than either of them.
The modeling of very highlevel agents (level5 and higher) seems to determine the performance differences between ahQCHsp and the other ahQCH models. Hence ahQCHsp, which includes level5 agents, outperforms models that do not; but precisely tuning the proportions of levels 5 and below is more important for prediction performance than including levels 6 and above. This is surprising, as agents higher than level3 are not generally believed to exist in any substantial numbers for instance, Arad and Rubinstein (2012)^{[20]} found no evidence for beliefs of fourth order or higher. One possible explanation for the influence of very highlevel responses is that they may be correcting for our overly simple specification of level0 behavior as uniform randomization. An interesting direction for future work would be to investigate richer specifications of level0 behavior. Overall, the results recommend the use of SpikePoisson QCH variant for predicting human play in unrepeated normalform games. It was the best performing model variant except for ah QCH5, and that model required twice as many parameters to achieve slightly better predictive performance.
How would the models perform on Unseen Games?^{[1]}
In the performance comparisons made thus far, we have used crossvalidation at the level of individual data points . This means that with very high probability, every game in every testing fold also appeared in the corresponding training set. What this essentially means is that, we never evaluate a model’s predictions on an entirely unseen game. This may lead you to question the models' predictive performance on completely unseen data.
We could ameliorate this situation by performing an alternative analysis in which we compared the performance of the original models to some of the variants, using a modified crossvalidation procedure.^{[1]} In this procedure, we divide our combined dataset into folds containing equal numbers of games, with all of the data points for a given game belonging to a single fold. Hence, we evaluate each model entirely using games that were absent from the training set. We report the average of 10 splits into folds to reduce variance in our estimates. Figure 5 shows the performance of the QRE, PoissonCH, Lk, QLk, ahQCHp, ahQCHsp, and ahQCH5 models on the combined dataset under both crossvalidation procedures. Overall, performance was virtually identical under the two procedures, suggesting that the models we studied generalize well to unseen games. For the efficientfrontier models we observed small but
statistically significant degradations in performance on unseen games; the other
models had indistinguishable performance^{[1]}.
Does Payoff Scaling Matter?^{[21]}
It is important to note that the datasets used do not use the same scale across the board. So were the payoffs in the different games in the dataset in appropriate units? Unlike the levelk and PoissonCH models, both QRE and quantal levelk depend on the units used to represent payoffs in a game. When considering a single setting this is not something to worry about, because the precision parameter can scale a game to appropriatelysized units. However, when data is combined from multiple studies in which payoffs are expressed on different scales (as is the case with the combined dataset), a single precision parameter may be insufficient to compensate for QRE’s scale dependence.^{[21]}
To investigate this issue, we propose two hypotheses. Firstly, were the subjects concerned only with relative
scales of payoff differences within individual games? To test this hypothesis, we constructed a model (NQRE) ^{[21]}that normalizes a game’s payoffs to lie in the interval [0, 1], and then predicts based on the QRE of the normalized game. Secondly, we hypothesize that the subjects were concerned only with the expected monetary value of their payoffs. To test this hypothesis, we constructed a model (CNQRE)^{[21]} that normalizes payoffs so that they are denominated in expected cents. Figure 7 reports the likelihood ratio between the modified QRE models and QRE. Both NQRE and CNQRE performed worse than the original unnormalized QRE on every disaggregated dataset except for SW94 and SW95, where the improvements were very small (although significant). We conclude that subjects responded to the raw payoff numbers, not to the actual values behind those payoff numbers, and not solely to the relative size of the payoff differences. We can argue that this is indeed plausible based on the “money illusion” effect (Shafir, Diamond, and Tversky 1997)^{[22]}, in which people focus on nominal rather than real monetary values.
However, on the aggregated ALL6 dataset, the situation was quite different, with NQRE performing well and CNQRE performing very poorly. This suggests that normalization can yield a betterperforming QRE estimate for aggregated experimental data, but that expected monetary value is not a helpful normalization to use^{[21]}.
Future Directions
So far we have only looked at data sets consisting of simultaneousmove games. We still don’t know how these behavioral models would perform on more richer game types. This includes sequential games (like chess where agents choose actions sequentially rather than simultaneously) or games that take into account repeated interactions or learning. Investigating richer specifications of level0 behavior would also be an interesting direction.
Annotated Bibliography
 ↑ ^{1.00} ^{1.01} ^{1.02} ^{1.03} ^{1.04} ^{1.05} ^{1.06} ^{1.07} ^{1.08} ^{1.09} ^{1.10} ^{1.11} ^{1.12} ^{1.13} Predicting Human Behavior in Unrepeated, SimultaneousMove Games. James R. Wright and Kevin LeytonBrown. [arXiv preprint] Games and Economic Behavior, Revisions requested.
 ↑ ^{2.0} ^{2.1} ^{2.2} ^{2.3} Predicting Human Behavior in NormalForm Games wiki page from Course:CPSC522/Predicting_Human_Behavior_in_NormalForm_Games
 ↑ Game Theory CPSC522 wiki from https://en.wikipedia.org/wiki/Game_Theory
 ↑ Camerer, C., Nunnari, S., and Palfrey, T. R. (2011). Quantal response and nonequilibrium beliefs explain overbidding in maximumvalue auctions. Working paper, California Institute of Technology.
 ↑ Crawford, V. and Iriberri, N. (2007b). Levelk auctions: Can a nonequilibrium model of strategic thinking explain the winner’s curse and overbidding in privatevalue auctions? Econometrica, 75(6):1721–1770.
 ↑ ^{6.0} ^{6.1} ^{6.2} Camerer, C.; Ho, T.; and Chong, J. 2004. A cognitive hierarchy model of games. QJE 119(3):861–898.
 ↑ Camerer, C., Ho, T., and Chong, J. (2004). A cognitive hierarchy model of games. Quarterly Journal of Economics, 119(3):861–898.
 ↑ Morgan, J. and Sefton, M. (2002). An experimental investigation of unprofitable games. Games and Economic Behavior, 40(1):123–146.
 ↑ Hahn, P. R., Lum, K., and Mela, C. (2010). A semiparametric model for assessing cognitive hierarchy theories of beauty contest games. Working paper, Duke University.
 ↑ ^{10.0} ^{10.1} Stahl, D., and Wilson, P. 1995. On players’ models of other players: Theory and experimental evidence. GEB 10(1):218– 254.
 ↑ McKelvey, R., and Palfrey, T. 1995. Quantal response equilibria for normal form games. GEB 10(1):6–38.
 ↑ CostaGomes, M.; Crawford, V.; and Broseta, B. 2001. Cognition and behavior in normalform games: An experimental study. Econometrica 69(5):1193–1235.
 ↑ Stahl, D., and Wilson, P. 1994. Experimental evidence on players’ models of other players. JEBO 25(3):309–327.
 ↑ CostaGomes, M.; Crawford, V.; and Broseta, B. 1998. Cognition and behavior in normal form games: an experimental study. Discussion paper 9822, UCSD
 ↑ Goeree, J. K., and Holt, C. A. 2001. Ten little treasures of game theory and ten intuitive contradictions. AER 91(5):1402–1422.
 ↑ Cooper, D., and Van Huyck, J. 2003. Evidence on the equivalence of the strategic and extensive form representation of games. JET 110(2):290–308.
 ↑ Rogers, B. W.; Palfrey, T. R.; and Camerer, C. F. 2009. Heterogeneous quantal response equilibrium and cognitive hierarchies. JET 144(4):1440–1467.
 ↑ Nelder, J. A., and Mead, R. 1965. A simplex method for function minimization. Computer Journal 7(4):308–313.
 ↑ Nelder Mead Simplex Algorithm Code in python from https://github.com/adnanreza/neldermead
 ↑ Arad, A. and Rubinstein, A. (2012). The 1120 money request game: A levelk reasoning study. American Economic Review, 102(7):3561–3573.
 ↑ ^{21.0} ^{21.1} ^{21.2} ^{21.3} ^{21.4} ^{21.5} Beyond Equilibrium: Predicting Human Behavior in Normal Form Games. J. Wright, K. LeytonBrown. Conference of the Association for the Advancement of Artificial Intelligence (AAAI10), 2010., from http://www.cs.ubc.ca/~kevinlb/pub.php?u=2010AAAIBeyondEquilibrium.pdf
 ↑ Shafir, E.; Diamond, P.; and Tversky, A. 1997. Money
To Add
Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.
