In the final year of my degree in Computer Games Programming (BscH) at the University of Derby I undertook an independent dissertation project under the supervision of Dr. Tommy Thompson. Here I’ll attempt a brief summary of the project (bearing in mind the paper was approaching one-hundred pages long).
The full paper resulting from my research, titled ‘Investigating Approaches to AI for Trust-lbased, Multi-agent Board Games with Imperfect Information: With Don Eskridge’s “The Resistance”‘ can be found published in the university’s online journal, here, and the associated python scripts can be found in my branch of the GitHub project, here.
This work dealt with the challenge of developing game-playing artificial intelligence for modern board games, specifically Don Eskridge’s The Resistance – a hidden identities game for five-to-ten players. Modern board games is a loose category of board and card games developed in relatively recent years which usually differ from regular board games by including additional players, asymmetrical mechanics, and problems such as non-determinism, imperfect information, or variable initial setup.
While the primary objective of this project was to help direct and guide future development by providing a general introduction to important concepts in the game, related games and AI techniques, and examining the current state of work on The Resistance, some new tools and bots were created in order to aid in analysis of bots’ performance. This was based upon a foundation of Python scripts resulting from Dagstuhl Seminar 12191, “Artificial and Computational Intelligence in Games” (3.8 “AI for Modern Board Games”) and further developed by AiGameDev.com. A number of bots developed by other individuals for a competition hosted by AiGameDev.com were also used in this research.
The full paper includes analysis of the AI techniques used by many of these bots, as well as a breakdown of the rules for the 5-player games focussed on, and a few preliminary considerations from the game theoretic perspective. Here these are omitted for brevity.
Burst Competition Model
The first of the tools created for this project was a new competition model based upon suggestions from Alex Champandard of AiGameDev.com.
The original competition model pits permutations of five players from the given set against one-another, presenting statistics on their performance after a large number of games. The bots and number of games are entered as command line arguments. By selecting the right number of games you can ensure that a complete, comprehensive set of possible permutations of games is played several times, ensuring that the statistics given are representative of the general quality of the competitors (or at least it’s capability of dealing with those opponents). However, such a model allows for bots which conduct individual opponent modelling to build an understanding of other bots’ behaviours over a large number of games (thousands!) – a tactic which is unlikely to work well against human opponents, who will not participate in enough games for such a model to be effective.
The burst competition model I created modifies the original competition model by assigning pseudonyms to bots. After a given number of games are played (specified as a command line argument) these pseudonyms are regenerated (guaranteed unique), so that opponent modelling bots can no-longer recognise them. In this way we are still able to run a large number of games to ensure statistical significance, but the inability to model individual opponents as effectively should encourage the development of rapidly adapting behaviours or strong, generalized rule-sets. I believe that the results should therefore give a stronger indication of which bots are most suited for human opponents.
The second analysis tool created was based upon a behaviour I found in a bot named Trusty. Like a number of other bots, Trusty maintains suspicion scores for each opponent over the course missions in each individual game, and also conducts opponent modelling – collecting behavioural statistics for each player over the course of a competition. However, trusty also tracks how accurate it’s suspicion scores are when identities are revealed at the each game, and uses these suspicion accuracy scores to adjust how much suspicion scores in future games are based upon the opponent modelling. This would allow Trusty to, for example, ignore opponent modelling data for pure random players after a number of incorrect assessments.
I created a simple API through which any bot which maintains suspicion scores in some form can record suspicion inaccuracy scores for each opponent, in each game, over the course of a competition. Each mission in each game is recorded separately. This allow us to examine the effectiveness of techniques by observing the way that suspicion inaccuracies change over the course of each game’s missions and, for opponent modelling bots, the way that they change over the course of a competition. This would also, in theory, serve to show the impact of the burst competition model on opponent modelling bots.
Because browsing over 12,000 games’ worth of numerical output is probably the most tedious, least intuitive, and least efficient way to examine these logs, I created another module which can read in these logs and generate graphs, using the matplotlib library. This module is fairly flexible, allowing for command line arguments to set the number of games to average over for each point, groups of bots to average over, whether to display spy and resistance plots together, or separately, etc. This flexibility turned out to be vital in understanding the results of further experimentation. Averaging over numerous games was absolutely essential due to statistical noise resulting from non-determinism in some bots’ logic, the random order in which permutations are played coupled with variable initial setup, and the instability of opponent models. Graphs produced resemble the following:
Bots were created with two main goals in mind:
Firstly, I wanted to investigate the effect of so-called ‘expert rules’ on bot performance. These are hard-coded conditions found in many bots – certain circumstances under which a simple decision can be made, often without any reference to suspicion scores, past plays, etc. For example, in a five-player game, a resistance member should not approve any team of three not featuring himself, as it clearly contains a spy. Similarly, a spy player might decide never to approve a team with no spies on it. For this purpose, and to get used to the bot API, I wrote Ruru, a simple bot which makes all of its decisions based only upon such rules. A list of the rules used can be found in the full paper.
Secondly, I wanted to investigate the effect of opponent modelling techniques on bot performance. A number of the more successful existing bots (Clymily, Sceptic, Magi…) use opponent modelling, but each also employs expert rules, configuration tracking, and deductive logic specific to The Resistance (in fact, specific to the simplest, five-player variant), so it’s unclear how much OM contributes to their strength.
In the end it was easier to build my own OM bot from the ground up than to try and strip all of the non-OM components from the unfamiliar, complex code of the existing implementations. I therefore created Stalker, a bot capable of reporting individual suspicion scores, per opponent, based only on how well they fit models built from their previous behaviour as a spy or resistance member. The problem of acting upon reported suspicion scores is isolated somewhat from that of calculating them. I make no attempt to investigate the latter – Stalker makes decisions based upon the same code as Ruru – but the OM in place is enough to examine via suspicion inaccuracy tracking. On top of Stalker I implemented another OM bot, Nosey, which maintains only one spy and one resistance opponent model, and fits all players against them. This was explored as a possible method of speeding up adaptation to opponent strategies. Details of how models are maintained and fitted against can be found in the full paper.
Stalker was implemented in a modular fashion so that it was trivial to test a number of different configurations (that is, groups of behaviours modelled and evaluated against). During investigation of behaviours tracked by existing bots and the implementation of this, several distinct categories of behaviour became apparent, which were a key consideration in the selection of configurations for testing, and the analysis of my results. These are:
- Category 1, Perfect Information Behaviours: Often related to expert rules; behaviours we can fully observe and model with certainty, where the subject has full knowledge of the parameters considered, and full control over the outcome of his action. Eg. The frequency with which an agent votes positively for three-man teams not including itself.
- Category 2, Competence Based (“Sketchy”) Behaviours: Where, if it is a resistance member, the subject must make a decision without knowing the relevant parameters. This is often related to the success or failure of missions. Eg. The frequency with which an agent votes for a team which then fails a mission.. Fitting to these behaviours may be unproductive at first, but the high frequency of occurrence in the resistance model should counteract over-suspicion of poor players.
- Category 3, Partially Observable Action Behaviours: Where we cannot fully observe the behaviour during the course of a game, but can, in retrospect, once roles are revealed. Specifically, this refers to the act of sabotaging or not sabotaging a mission. While it is easy to construct accurate models of this behaviour, comparison to behaviour in the current game is complicated by uncertainty.
- Category 4, Twice Partially Observable Action Behaviours: This describes the case where there are two spies and a resistance member on a mission, and one spy sabotages. Unless we are one of those spies, we cannot, even in retrospect with roles revealed, know for certain which spy sabotaged the mission.
Further discussion of these categories and specific behaviours tracked by Stalker and Nosey can be found in the full paper. Most behaviours investigated fall into categories 1 and 2, with one contentious category 2 behaviour included as an attempt to compensate for neglecting categories 3 and 4: teamOnIsUnsuccessful tracks the frequency of teams featuring the subject being successful. The idea here is that the OM agent may “learn” the correct amount of suspicion to allocate for presence on failed missions. Overall though, I expected category 1 behaviours to be most powerful, category 2 may go either way, and that teamOnIsUnsuccessful may go either way.
Most of the experiments conducted involved running Ruru, Stalker, Nosey, or one of the existing opponent modelling agents against two opponent groupings: Beginners, consisting of four of the simplest existing agents (Jammer, Deciever, Hippie, and Paranoid), and Advanced, consisting of four of the strongest competitors from the Vienna competition (PandSBot, GarboA, Rebounder, Bot5Players), deliberately avoiding OM agents.
My expert rules bot, Ruru was assessed simply in terms of win percentage, compared to that of a much simpler, rule based bot (RuleFollower) and a simple bot which combines expert rules and some logical deduction (LogicalBot). Against the Beginners grouping, Ruru was outperformed by a small margin, with LogicalBot coming out on top, likely due to logical deduction enabling better resistance play. Against the Advanced grouping, all three bots were severely outclassed. Interestingly however, Ruru was able to outperform RuleFollower by making up for poor resistance play with significantly stronger spy play even than LogicalBot. As a spy Ruru was selected and approved for a far greater number of missions, likely due to expert rules designed to avoid suspicion.
Three existing OM agents (Clymily, ScepticBot, and Magi) were modified to log suspicion innacuracies and tested against both opponent groupings, as were my own OM agents, Stalker and Nosey. For Stalker, a number of configurations were tested, and these informed the configurations later used in Nosey’s experiments. Results from Magi’s and ScepticBot’s experiments were disappointing, showing no clear impact of opponent modelling, but Clymily’s suspicion inaccuracy graphs showed definite signs of improvement over the course of the 12,000 games played, with a curve apparent near the start:
A further test with Clymily informed by the opponent model of a previous experiment produced a graph which sits roughly level with the final inaccuracies of the graph above, providing further evidence of the opponent model’s refinement over the course of the original experiment:
Experiments with my own bots showed pure opponent modelling to be completely ineffective against the Beginners grouping, but the strong results of Clymily against this grouping suggest that this is due either to my implementation or the configurations of tracked behaviours tested. Tests against the Advanced grouping, however, show a strong indication of opponent modelling’s effectiveness in identifying spies without the support of other techniques.
What’s interesting though is the change in results between configurations: despite my near-certainty of strong results from category 1, modelling based only on these perfect information behaviours falls almost as flat on its face as any configuration did against the beginners. It is surprising to find that the category 2, competence based behaviours, indicate the strongest performance when we regard graphs of spy and resistance inaccuracies averaged together, and under the same conditions the second most effective configuration appears to be the one using only teamOnIsUnsuccessful. Regarding the same results with spy and resistance inaccuracies graphed separately, however, it is clear that this configuration’s low spy inaccuracies are hindered in the early missions by dangerously high resistance inaccuracies.
Nosey’s performance against Advanced and Beginners groupings was indistinguishable from that of Stalker, which seems to support the idea of grouping opponents into one model, though further experimentation is needed.
Finally, experiments were conducted which assessed the impact of the burst competition model on Clymily and on the results of the Vienna competition. The graph of suspicion inaccuracies for Clymily (below) was averaged such that the peak in inaccuracies after each name scramble can clearly be seen. The impact on bots’ performance, however, was largely unpredictable and inconclusive, probably due to the influence of other techniques.
For a full analysis of these results, including considerations of the noise found in suspicion inaccuracy tracking, please refer to the full paper.
From research of previous works, experiments conducted, and analysis of results carried out, much of which is omitted here, the paper draws a number of conclusions relating mostly focussed on the subject of assessing bots’ performance, rather than identifying specific, effective AI techniques.
My investigation into expert rules was not thorough, but the results suggest that while such rules can be effective against simple opponents, the presence of certain rules may interfere with an agent’s ability to exploit them. For example, some rules which pass up opportunities to sabotage in order to avoid suspicion are detrimental when the opponents faced do not pay attention to mission results. However, Ruru‘s results indicate that such rules are effective against more advanced opponents. Competitions against simple opponents should not, therefore, be considered an effective method of assessing the value of expert rules, nor competence of an agent in general.
Implementation considerations and experimental results demonstrate clearly the challenge of allocating trust or suspicion amongst agents whose own knowledge of the game state may not be complete, and whose actions are not fully observable. This is one of the biggest challenges in The Resistance. While opponent modelling of such “competence based” behaviours seems effective at a glance, further analysis of the suspicion inaccuracy logs reveals a tendency to allocate too much suspicion to resistance members. Further investigation into the effectiveness of individual opponent modelling behaviours is necessary, as well as how to respond to suspicion scores, and the idea of using some behaviours to exploit opponents, rather than contribute to suspicion scores. For example, identification of bots which do not conduct any logical deduction or suspicion allocation seems plausible, and would allow less cautious, more effective play.
I believe that examination of suspicion inaccuracy graphs allows for interesting insights into agent performance, though interpretation is hindered by one’s own understanding of the game, and the ever-present noise. While I do not think the presence of noise invalidates analyses of graphs averaged over large intervals, there is much room for further investigation on the interpretation of these graphs. Especially considering the noise, it may be useful to consider the mean and standard deviation of each group of games averaged over. Future work should also seek to determine whether this noise is uniquely a product of opponent model instability, or just a component of the game. Development of methods which reduce it may lead to better overall performance.
Finally, the burst competition model has not proved especially useful in this work; the impact on Clymily and the Vienna competition results is unclear. It may simply be that the usefulness of OM is not a strong factor in the performance of existing agents. If so, then burst competitions may yet prove useful in future work, possibly in assessing agents’ capacity for rapid adaptation.