Course:CPSC522/Improving Human Behavior Prediction in Simultaneous-Move Games

From UBC Wiki
Jump to: navigation, search


Improving Human Behavior Prediction in Simultaneous-Move Games

Principal Author: Adnan Reza
Hypothesis: In the previous assignment, I studied literature that analyzed and evaluated behavioral game theoretic models' performance in predicting human behavior. I want to study the existing behavioral game theory literature to identify ways in which the best performing models can be modified to improve their predictive performance.

This page is based on work done by James Wright and Kevin L. Brown[1], showing how to modify the best performing behavioral models to improve their predictive performance.


In multi-agent settings, we often assume that agents will adopt Nash equilibrium strategies. However, studies in experimental economics demonstrate that Nash equilibrium is a poor indicator of human players’ initial behavior in normal-form games. The previous page showed how these models performed using publicly-available experimental data from the literature. This page suggests improvements to the best performing behavioral models and tests their performance relative to the original models. Finally, we look at some key issues that may impact the models' predictive performance: how the models perform on unseen data and whether payoff scaling matters in the context of the analysis presented.

Builds on

This page builds on Predicting Human Behavior in Normal-Form Games[2] , by (1) suggesting modifications to existing models and observing their relative predictive performance and (2) testing hypotheses related to behavioral models.

Related Pages

This page assumes that readers are familiar with Game Theory[3] and Maximum Likelihood Estimation.


Background Knowledge[2]

Game theory is a mathematical system for analyzing and predicting how idealized rational agents behave in strategic situations.
Behavioral game theory aims to extend game theory to modelling human agents.

Figure 1a: Chicken Game

In game theory, the normal-form representation of a game includes all conceivable strategies, and their corresponding payoffs, for each player.
In a normal form game:

  • Each agent simultaneously chooses an action from a finite action set.
  • Each combination of actions yields a known utility to each agent.
  • The agents may choose actions either deterministically or stochastically.

In a Nash equilibrium, each agent best responds to the others. An agent best responds to other agents’ actions by choosing a strategy that maximizes utility, conditional on the other agents’ strategies.

Figure 1b: The Game of Chicken: Bi-Matrix

A Worked Example: Game of Chicken[2]

For “simple” games a convenient way to represent normal-form games is in bi-matrix form. We can look at a simple game called "the chicken game" to understand how a game is represented in normal form and how nash equilibria is determined. Figure 1a shows that the game has two players (you can think of them as two cars currently opposite to each other). Figure 1b shows the bi-matrix representation for the game of chicken:

  • There are two players: Player 1 (Row Player) and Player 2 (Column Player).
  • Each player can either “dare” or “chicken out” .
  • If both players “Dare”, they collide and receive payoffs
  • If player 1 “Dares” & player 2 “Chickens out”, then player 1 receives a payoff of and player 2 receives only .
  • If player 1 “Chickens Out” & player 2 “Dares”, then player 1 receives a payoff of only and player 2 receives .
  • If players 1 and 2 both “Chicken Out”. Then each receives a payoff of .
  • The pure-strategy Nash equilibria for this game are and , where each player has no incentive to deviate. What this means is that these pairs of strategies which constitute the pure-strategy Nash equilibria are "self-enforcing", i.e. it makes each player's strategy an optimal (best) response to the other player's strategy.

Poor Predictive Performance of Nash equilibrium[2]

However, even though the Nash Equilibrium is an intuitive and appealing notion of achieving equilibrium, it is often a poor predictor of human behavior. The vast majority of human players choose in the game of chicken. Modifications to a game that don’t change Nash equilibrium predictions at all can cause large changes in how human subjects play the game. In Game of Chicken: When the penalty is large, people play much closer to Nash equilibrium. Clearly Nash equilibrium is not the whole story. Behavioral game theory proposes a number of models to better explain human behavior.

Related Literature

This central question the previous page tried to answer was, “What model is best for predicting human behavior in simultaneous-move games?”. Now, we look at ways of improving predictive performance by modifying existing models and evaluating and comparing their performance. Before doing this we study the existing literature to determine the extent to which this question had already been answered. Our findings indicate that the significant majority of the existing literature appears to be more concerned with explaining behavior than with predicting it. Therefore, comparisons of out-of-sample prediction performance were rare. A few notable exceptions are:

  • Camerer et al. (2011)[4] evaluated the performance of QRE and cognitive hierarchy variants on one experimental treatment using parameters estimated on two separate experimental treatments.
  • Crawford and Iriberri (2007)[5] compared the performance of two models by training each model on each game in their dataset individually, and then evaluating the performance of each of these trained models on each of the − 1 other individual games;
  • Camerer et al. (2004)[6] and Chong et al. (2005)[7] computed likelihoods on each individual game in their datasets after using models fit to the − 1 remaining games;
  • Morgan and Sefton (2002)[8] and Hahn et al. (2010)[9] evaluated prediction performance using held-out test data;
  • Stahl and Wilson (1995)[10] evaluated prediction performance on 3 games using parameters fit from the other games;

Behavioral Game Theory Models


  1. Quantal response: Agents best-respond with high probability rather than deterministically best responding.
  2. Iterative strategic reasoning: Agents can only perform limited steps of strategic “look-ahead”.

One model is based on quantal response, two models are based on iterative strategic reasoning, and one model incorporates both.

Quantal Response Equilibrium (QRE)

One prominent behavioral theory asserts that agents become more likely to make errors as those errors become less costly. We refer to this property as cost-proportional errors. This can be modeled by assuming that agents best respond quantally, rather than via strict maximization.

QRE model [11]

  • Agents quantally best respond to each other.
  • A (logit) quantal best response by agent to a strategy profile is a mixed strategy such that

where λ (the precision parameter) indicates how sensitive agents are to utility differences. Note that unlike regular best response, which is a set-valued function, quantal best response returns a single mixed strategy.


Another key idea from behavioral game theory is that humans can perform only a bounded number of iterations of strategic reasoning. The level-k model [12] captures this idea by associating each agent i with a level , corresponding to the number of iterations of reasoning the agent is able to perform.

  • A level-0 agent plays randomly, choosing uniformly at random from his possible actions.
  • A level-k agent, for , best responds to the strategy played by level agents. If a level-k agent has more than one best response, he mixes uniformly over them.

Here we consider a particular level-k model, dubbed , which assumes that all agents belong to levels and . Each agent with level > has an associated probability of making an “error”, i.e., of playing an action that is not a best response to the level- () strategy. However, the agents do not account for these errors when forming their beliefs about how lower-level agents will act.

Poisson Cognitive Hierarchy

The cognitive hierarchy model [6], like QLk, aims to model agents with heterogeneous bounds on iterated reasoning. It differs from the QLk model in two ways:

  • First, agents use standard best response, rather than quantal response.
  • Second, agents best respond to the full distribution of lower-level types, rather than only to the strategy one level below.

More formally, every agent again has an associated level . Let be the cumulative distribution of the levels in the population. Level-0 agents play (typically uniformly) at random. Level-m agents best respond to the strategies that would be played in a population described by the cumulative distribution . (Camerer, Ho, and Chong 2004)[6] advocate a single parameter restriction of the cognitive hierarchy model called Poisson-CH, in which the levels of agents in the population are distributed according to a Poisson distribution.

Poisson Cognitive Hierarchy[1]

Quantal Level-k

The Quantal Level K (QLK) model combines elements of the QRE and level-k models; we refer to it as the quantal level-k model. In QLk, agents have one of three levels, as in Lk. Each agent responds to its beliefs quantally, as in QRE. Like Lk, agents believe that the rest of the population has the next-lower type. The main difference between QLk and Lk is in the error structure. In Lk, higher-level agents believe that all lowerlevel agents best respond perfectly, although in fact every agent has some probability of making an error. In contrast, in QLk, agents are aware of the quantal nature of the lowerlevel agents’ responses, and have a (possibly-incorrect) belief about the lower-level agents’ precision.

Quantal Level-k[1]

Quantal Cognitive Hierarchy (QCH)

QCH is a hybrid behavioral model which combines the best elements of the QLk and Cognitive Hierrachy models.
In quantal cognitive hierarchy model (QCH), all agent levels:

  • Respond quantally (as in QLk).
  • Respond to truncated, true distribution of lower levels (as incognitive hierarchy).
  • Agents have the same precision λ.
  • Agents are aware of the true precision of lower levels.

The results from the previous assignment indicate that QCH outperforms both QLk and the Cognitive Hierarchy models.

Model Variations and Analysis


We use nine large-scale, publicly-available sets of human-subject experimental data. Each observation of an action by an experimental subject is represented as a pair , where is the action that the subject took when playing as player in game . All games were two player, so each single play of a game generated two observations. Data from six experimental studies, plus a combined dataset was used:

  • SW94: 400 observations from [Stahl & Wilson 1994][13]
  • SW95: 576 observations from [Stahl & Wilson 1995][10]
  • CGCB98: 1296 observations from [Costa-Gomes et al. 1998][14]
  • GH01: 500 observations from [Goeree & Holt 2001][15]
  • CVH03: 2992 observations from [Cooper & Van Huyck 2003][16]
  • RPC09: 1210 observations from [Rogers et al. 2009][17]
  • ALL6: All 6974 observations

Subjects played 2-player normal form games once each. Each action by an individual player is a single observation.

To see how the original models (before modification) performed against each other, refer to Predicting Human Behavior in Normal-Form Games.

Model Variations

We investigate the properties of the QLk model by evaluating the predictive power of a family of systematic variations of this model. In the end, we identify a simpler model that dominated QLk on our data. In other words, we seek to improve the predictive performance QLk and QCH models by tweaking their properties.

Specifically, we construct a broad family of models by modifying the QLk model along four different axes.

  1. First, QLk assumes a maximum level of 2; we evaluated QLk with maximum levels of 3 to 7 to observe performance. What we observe is that prediction performance increases by around 12% when we choose k=3 (other things being equal) and up to 20% with k=4. With k>4, there is not a significant increase in performance.
  2. Second, QLk assumes inhomogeneous precisions in that it allows each level to have a different precision; we varied this by also considering homogeneous precision models. We observe that having inhomogeneous precisions leads to a noticeable increase in performance (at the cost of having a more complex model i.e. more parameters.)
  3. Third, QLk allows general precision beliefs that can differ from lower-level agents’ true precisions; we also constructed models that make the simplifying assumption that all agents have accurate precision beliefs about lower-level agents. We again see that assuming that agents have accurate precision beliefs about lower-level agents leads to about a 10% fall in performance across most of the model variants.

Figure 2 shows model variations with prediction performance on the combined dataset. The models with max level of ∗ used a Poisson distribution. Models are named according to precision beliefs, precision homogeneity, population beliefs, and type of level distribution. E.g., ah-QCH3 is the model with accurate precision beliefs, homogeneous precisions, cognitive hierarchy population beliefs, and a discrete distribution over levels 0–3.

Figure 2: Model Variations[1]

Evaluating the Model Variants

We evaluate the predictive performance of each model variant on the combined dataset using 10 rounds of 10-fold cross-validation. Specifically, for each round, each dataset was randomly divided into 10 parts. For each of the 10 ways of selecting 9 parts from the 10, the maximum likelihood estimate of the model’s parameters were computed based on those 9 parts, using the Nelder-Mead simplex algorithm.[18][19]Then the log likelihood of the remaining part was computed given the prediction. We call the average of this quantity across all 10 parts, the cross-validated log likelihood.

Simplicity Versus Predictive Performance

Figure 3: Model simplicity vs. prediction performance.[1]

Other things being equal, a model with higher performance is more desirable, as is a model with fewer parameters. We can plot an “efficient frontier” of those models that achieved the best performance for a given number of parameters or fewer; see Figure 3. The original QLk model (gi-QLk2) is not efficient in this sense; it is dominated by ah-QCH3[1], which has significantly better predictive performance and fewer parameters (because it restricts agents to homogeneous precisions and accurate beliefs). We could argue that the flexibility added by inhomogeneous precisions and general precision beliefs is less important than the number of levels and the choice of population belief. On the other hand, the poor performance of the Poisson variants relative to ah-QCH3 may indicate that flexibility in describing the level distribution is more important than the total number of levels modeled.
There is a noticeable pattern in the models along the efficient frontier: this set consists exclusively of models with accurate precision beliefs, homogeneous precisions, and cognitive hierarchy beliefs. This suggests that the most parsimonious way to model human behavior in normal-form games is to use a model of this form, with the trade-off between simplicity (i.e., number of parameters) and predictive power determined solely by the number of levels modeled. For the combined dataset, adding additional levels yielded small increases in predictive power until level 5, after which it yielded no further, statistically significant improvements.[1]

Spike-Poisson Variant

A Poisson distribution might be better able to fit our data if the proportion of level-0 agents were specified separately. The intuition behind the Spike-Poisson variant is to have bi-modal distributions instead of unimodal ones (specifiy level 0 agents separately) while still offering the advantage of representing higher-level agents without needing a separate parameter for each level.[1] In this section, we evaluate an ah-QCH model that uses just such a distribution: a mixture of a deterministic distribution of level-0 agents and a standard Poisson distribution. We refer to this mixture as a “spike-Poisson” distribution. We define the full Spike-Poisson QCH model here.

Spike-Poisson Quantal Cognitive Hierarchy (QCH) model'[1]'. Let πSP i,m ∈ Π(Ai) be the distribution over actions predicted for an agent i with level m by the Spike-Poisson QCH model. Let:


Let G(s_{−i}; λ) denote i’s quantal best response in game to the strategy profile s_{−i} (of the n-1 other agents), given precision parameter λ. Let


be the “truncated” distribution over actions predicted for an agent conditional on that agent’s having level 0 ≤L ≤ m. Then πSP is defined as


The overall predicted distribution of actions is a weighted sum of the distributions for each level:


The model thus has three parameters: the mean of the Poisson distribution τ, the spike probability ∈, and the precision λ.

Figure 4: ah-QCH vs Qlk vs ah-QCH-sp.[1]

Figure 4 compares the performance of the spike-poissoin variant (ah-QCH-sp) to the ah-QCH model variants discussed above. The three-parameter ah-QCH-sp model outperformed every model except for ah-QCH5. In particular, it outperformed both ah-QCH3 and ah-QCH4, despite having fewer parameters than either of them.
The modeling of very high-level agents (level-5 and higher) seems to determine the performance differences between ah-QCH-sp and the other ah-QCH models. Hence ah-QCH-sp, which includes level-5 agents, outperforms models that do not; but precisely tuning the proportions of levels 5 and below is more important for prediction performance than including levels 6 and above. This is surprising, as agents higher than level-3 are not generally believed to exist in any substantial numbers for instance, Arad and Rubinstein (2012)[20] found no evidence for beliefs of fourth order or higher. One possible explanation for the influence of very high-level responses is that they may be correcting for our overly simple specification of level-0 behavior as uniform randomization. An interesting direction for future work would be to investigate richer specifications of level-0 behavior. Overall, the results recommend the use of Spike-Poisson QCH variant for predicting human play in unrepeated normal-form games. It was the best performing model variant except for ah QCH5, and that model required twice as many parameters to achieve slightly better predictive performance.

How would the models perform on Unseen Games?[1]

Figure 5: Performance on seen vs unseen data[1]

In the performance comparisons made thus far, we have used cross-validation at the level of individual data points . This means that with very high probability, every game in every testing fold also appeared in the corresponding training set. What this essentially means is that, we never evaluate a model’s predictions on an entirely unseen game. This may lead you to question the models' predictive performance on completely unseen data.
We could ameliorate this situation by performing an alternative analysis in which we compared the performance of the original models to some of the variants, using a modified cross-validation procedure.[1] In this procedure, we divide our combined dataset into folds containing equal numbers of games, with all of the data points for a given game belonging to a single fold. Hence, we evaluate each model entirely using games that were absent from the training set. We report the average of 10 splits into folds to reduce variance in our estimates. Figure 5 shows the performance of the QRE, Poisson-CH, Lk, QLk, ah-QCHp, ah-QCH-sp, and ah-QCH5 models on the combined dataset under both cross-validation procedures. Overall, performance was virtually identical under the two procedures, suggesting that the models we studied generalize well to unseen games. For the efficient-frontier models we observed small but statistically significant degradations in performance on unseen games; the other models had indistinguishable performance[1].

Figure 7: Does Payoff Scaling Matter?[21]

Does Payoff Scaling Matter?[21]

It is important to note that the datasets used do not use the same scale across the board. So were the payoffs in the different games in the dataset in appropriate units? Unlike the level-k and Poisson-CH models, both QRE and quantal level-k depend on the units used to represent payoffs in a game. When considering a single setting this is not something to worry about, because the precision parameter can scale a game to appropriately-sized units. However, when data is combined from multiple studies in which payoffs are expressed on different scales (as is the case with the combined dataset), a single precision parameter may be insufficient to compensate for QRE’s scale dependence.[21]
To investigate this issue, we propose two hypotheses. Firstly, were the subjects concerned only with relative scales of payoff differences within individual games? To test this hypothesis, we constructed a model (NQRE) [21]that normalizes a game’s payoffs to lie in the interval [0, 1], and then predicts based on the QRE of the normalized game. Secondly, we hypothesize that the subjects were concerned only with the expected monetary value of their payoffs. To test this hypothesis, we constructed a model (CNQRE)[21] that normalizes payoffs so that they are denominated in expected cents. Figure 7 reports the likelihood ratio between the modified QRE models and QRE. Both NQRE and CNQRE performed worse than the original unnormalized QRE on every disaggregated dataset except for SW94 and SW95, where the improvements were very small (although significant). We conclude that subjects responded to the raw payoff numbers, not to the actual values behind those payoff numbers, and not solely to the relative size of the payoff differences. We can argue that this is indeed plausible based on the “money illusion” effect (Shafir, Diamond, and Tversky 1997)[22], in which people focus on nominal rather than real monetary values. However, on the aggregated ALL6 dataset, the situation was quite different, with NQRE performing well and CNQRE performing very poorly. This suggests that normalization can yield a better-performing QRE estimate for aggregated experimental data, but that expected monetary value is not a helpful normalization to use[21].

Future Directions

So far we have only looked at data sets consisting of simultaneous-move games. We still don’t know how these behavioral models would perform on more richer game types. This includes sequential games (like chess where agents choose actions sequentially rather than simultaneously) or games that take into account repeated interactions or learning. Investigating richer specifications of level-0 behavior would also be an interesting direction.

Annotated Bibliography

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 Predicting Human Behavior in Unrepeated, Simultaneous-Move Games. James R. Wright and Kevin Leyton-Brown. [arXiv preprint] Games and Economic Behavior, Revisions requested.
  2. 2.0 2.1 2.2 2.3 Predicting Human Behavior in Normal-Form Games wiki page from Course:CPSC522/Predicting_Human_Behavior_in_Normal-Form_Games
  3. Game Theory CPSC522 wiki from
  4. Camerer, C., Nunnari, S., and Palfrey, T. R. (2011). Quantal response and nonequilibrium beliefs explain overbidding in maximum-value auctions. Working paper, California Institute of Technology.
  5. Crawford, V. and Iriberri, N. (2007b). Level-k auctions: Can a nonequilibrium model of strategic thinking explain the winner’s curse and overbidding in private-value auctions? Econometrica, 75(6):1721–1770.
  6. 6.0 6.1 6.2 Camerer, C.; Ho, T.; and Chong, J. 2004. A cognitive hierarchy model of games. QJE 119(3):861–898.
  7. Camerer, C., Ho, T., and Chong, J. (2004). A cognitive hierarchy model of games. Quarterly Journal of Economics, 119(3):861–898.
  8. Morgan, J. and Sefton, M. (2002). An experimental investigation of unprofitable games. Games and Economic Behavior, 40(1):123–146.
  9. Hahn, P. R., Lum, K., and Mela, C. (2010). A semiparametric model for assessing cognitive hierarchy theories of beauty contest games. Working paper, Duke University.
  10. 10.0 10.1 Stahl, D., and Wilson, P. 1995. On players’ models of other players: Theory and experimental evidence. GEB 10(1):218– 254.
  11. McKelvey, R., and Palfrey, T. 1995. Quantal response equilibria for normal form games. GEB 10(1):6–38.
  12. Costa-Gomes, M.; Crawford, V.; and Broseta, B. 2001. Cognition and behavior in normal-form games: An experimental study. Econometrica 69(5):1193–1235.
  13. Stahl, D., and Wilson, P. 1994. Experimental evidence on players’ models of other players. JEBO 25(3):309–327.
  14. Costa-Gomes, M.; Crawford, V.; and Broseta, B. 1998. Cognition and behavior in normal form games: an experimental study. Discussion paper 98-22, UCSD
  15. Goeree, J. K., and Holt, C. A. 2001. Ten little treasures of game theory and ten intuitive contradictions. AER 91(5):1402–1422.
  16. Cooper, D., and Van Huyck, J. 2003. Evidence on the equivalence of the strategic and extensive form representation of games. JET 110(2):290–308.
  17. Rogers, B. W.; Palfrey, T. R.; and Camerer, C. F. 2009. Heterogeneous quantal response equilibrium and cognitive hierarchies. JET 144(4):1440–1467.
  18. Nelder, J. A., and Mead, R. 1965. A simplex method for function minimization. Computer Journal 7(4):308–313.
  19. Nelder Mead Simplex Algorithm Code in python from
  20. Arad, A. and Rubinstein, A. (2012). The 11-20 money request game: A level-k reasoning study. American Economic Review, 102(7):3561–3573.
  21. 21.0 21.1 21.2 21.3 21.4 21.5 Beyond Equilibrium: Predicting Human Behavior in Normal Form Games. J. Wright, K. Leyton-Brown. Conference of the Association for the Advancement of Artificial Intelligence (AAAI-10), 2010., from
  22. Shafir, E.; Diamond, P.; and Tversky, A. 1997. Money

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0