Sample Envy

An Empirical Bayes approach to Kill Team win rate statistics.

November 25, 2024

This is the first article in a three part series on Empirical Bayes inference for Kill Team Statistics.

There was a time when Games Workshop would share Kill Team win rate data in their Metawatch articles.

That era ended quickly, no doubt for many reasons.

Right after the stats went inconspicuously missing, I remember the lead rules developer Eliot saying something along the lines of:

Kill Team has a variety of factions to choose from, but unfortunately, its player base is a bunch of sweaty meta chasers ruining everything.

At least that’s what I remember hearing. It’s also possible he said Kill Team has a lot of variability between faction sample data.

Tomato, tamato.

The biggest complaint I hear from people in the community over the stats is that there are too few samples. This isn’t a baseless concern. If you compare Kill Team to its big brother, Warhammer 40k, the numbers aren’t even close. 40k cranks out around 5 to 7 times the number of games from GT+ events that Kill Team does. Even with its 26+ factions, there’s plenty of games to go around for everyone.

Bighammer is big; water is wet. But suppose we set aside our sample envy for a moment and face this issue objectively:

How many samples do we really need to do serious analysis?
Does every faction in our dataset need a lot of samples?
How do we properly handle our uncertainty?

For the rest of this article, along with the next two, I'm going to attempt to answer these questions statistically.

Not a Professional, Just Pretentious

I’ll be honest with you all and make it clear that I’m no statistician. But while doing some research on this topic (my life is incredibly exciting), I stumbled upon an excellent and very approachable tutorial on Empirical Bayes inference by a data scientist named David Robinson. He wrote a series of blog posts on the subject; and even converted that effort into a book.

Bayesian statistics approaches data and hypotheses differently than traditional statistical methods (often called frequentist methods). I’m probably not the best source on the internet to explain Baye’s Theorem, so feel free to google it if you want. But my interest in Empirical Bayes was piqued by the fact that it is well suited for datasets with disproportionate sampling among subgroups within that set.

Sound appropriate? Let's see how it works.

Grim Dark Bayesians

Imagine we collect some Kill Team data about halfway through the quarter. We’re excited to know how the meta is shaping up, so we calculate win rates for each faction and check out the best performers:

Faction	Games	Win Rate
A	`35`	`69.7%`
B	`68`	`58.6%`
C	`403`	`57.4%`

Awesome. Faction A is the best, Faction B is second, and Faction C is third. Our work here is done.

Ehm. Obviously not...

Faction C is clearly better than Faction B; and although it is possible Faction A is better than Faction C, we’re not so sure about that. In fact, we know Faction A’s win rate will likely go down as more games are played, it’s just a question of how much.

How do we know all this? Because we have prior assumptions about how win rates ought to behave. The more games a faction plays, the more reliable the win rate becomes. The more reliable the win rate becomes, the more that win rate moves towards the average.

We know that when a faction has a healthy number of games, their win rate is likely to be between 40% and 60% (new editions not withstanding). We also know that the expected value of every faction’s win rate is 50%, with the majority of factions existing within the 45% to 55% percent range.

Bayesian statistics is a way to model these assumptions in a mathematical, probabilistic way. We then use that model to improve the accuracy of our inference.

Samples are Power, Guard them Well

Let’s pull in some historic data to show this. I collected a dataset of nearly one and a half years’ worth of GT+ Kill Team data: all of Season 3 and most of Season 2 from the last edition. Six quarters total. I threw out Q3 of Season 2 for being too short due to an emergency data slate. That's 35541 games and 8354 picks (or you know, a single quarter of 40k games).

Anyway, I calculated the win rate of each faction in each quarter, such that each faction within each quarter is treated as its own data point (i.e. Kommandos in 2024 Q1 is a separate observation than Kommandos in 2024 Q2).

Now, check out the histogram of this dataset:

x-axis contains win rates, y-axis contains faction count density.

As you can see, Kill Team historic data produces a distribution that resembles the description I gave earlier. But instead of just relying on my own subjective assumptions, this distribution is fit by the data itself. It’s data-driven, that’s the Empirical part of Empirical Bayes.

What you're looking at is a Beta distribution. I’ll produce a detailed write-up of all the technical gore¹ at some point. If you're really chomping at the bit to see how all this is done, checkout David Robinson’s introduction on the subject. But for now, the simple explanation is that this beta distribution mathematically represents our prior beliefs and assumptions concerning win rates.

For each faction in the quarterly stats, we will take this beta distribution (or one like it), add in the faction’s individual data, then update to what’s called a Posterior Distribution. Each faction's posterior (hehe...) distribution is a probability dsitibution that represents the variable we are trying to infer: that faction’s true win rate.

Don’t feel too intimidated if most of this statistical jargon is over your head. Let’s just take a look at the process in action.

Please Just Give Me Tiers

Check out these insane win rates of the highest performing factions in the current meta (start of the New Edition up to 2024-11-25; including the Warhammer World Championship):

Faction	Games	Win Rate
Plague Marines	`42`	`67.86%`
Novitiates	`50`	`67.00%`
Hierotek	`97`	`63.92%`
Void-Dancers	`53`	`63.21%`
Warpcoven	`166`	`62.95%`
Inquisition	`93`	`60.22%`
Legionary	`258`	`59.3%`

It's worth noting that these GT+² samples are pretty small even by Kill Team standards (a lot of players seem to be avoiding competitive GT+ events at the moment; can't imagine why...). Still, if we try and treat these raw averages as a kind of faction ranking, it would look really off. Plague Marines and Novitiates are flying high, but only have <= 50 games. Do we seriously believe they are better than Warpcoven and Legionary?

Now, let’s check out these same stats but crunched through Empirical Bayes:

Faction	Games	EB Win Rate
Warpcoven	`166`	`59.56%`
Hierotek	`97`	`58.66%`
Novitiates	`50`	`57.81%`
Legionary	`258`	`57.57%`
Plague Marines	`42`	`57.43%`
Void-Dancers	`53`	`56.26%`
Inquisition	`93`	`56.25%`

Well, that's an improvement. Warpcoven rises to its rightful spot as overlord of the new edition. Meanwhile, Plague Marines and Novitiates drop a whopping 10% and become comparable to Legionary. A much more reasonable spot for them. Finally, in spite of our adjustments, all these factions are well above the 55% tolerance for faction balance; this suggests even when we do control for sample size, these factions are still overperforming.

If you noticed, all factions moved towards the global mean of 50%. However, the low sample factions moved much more than the data-rich factions. Why? Because the less evidence we have on a faction, the more we rely on our prior assumptions. The more evidence a faction has, the more we can trust that evidence.

To paraphrase Mr. Robinson:

If we have very little evidence (30 games) we move it a lot,
If we have a lot of evidence (500 games) we move it only a little.

This is called shrinkage, or regression to the mean. It presumes:

Extraordinary outliers require extraordinary evidence.

If you’re like me, you might feel this approach sounds unfair. By moving faction averages to the global mean, aren’t we just making them appear more balanced than they really are?

It can seem that way at first; but the truth is this process is much more fair and robust than naive raw averages. Should we really consider a faction with 100 games to be directly comparable to a faction with 600? How is that fair? If a faction has a meager 30 games but a 60% win rate, how seriously should we treat that result?

Empirical Bayes solves this problem for us. It provides us with a fairer way to compare subgroups within a dataset that have radically different sample sizes. Best of all, it’s data-driven and based on sound principles.

Next Time

That’s not to say these new estimates handle all uncertainty for us. Consider the handful of terrible factions that have been completely abandoned by the player base; won't they move heavily to the middle on account of shrinkage? That doesn't seem accurate; how do we handle that?

Fortunately, Empirical Bayes provides us with far more tools than just point estimates (which are simply a "best guess").

As I mentioned earlier, thanks to Bayesian witchcraft, every faction receives a posterior distribution (a distribution representing the full probability of a faction’s true win rate).

For my next couple of articles, I’m going to talk more about these distributions and how we can use them to:

Directly measure the probability that a faction’s true rate is above or below the 55% or 45% benchmarks.
Build an algorithm that can sort our factions into tiers using posterior probability.

But we've covered enough for now. I hope you found this interesing; if not, feel free to email me at smilesliesgunfire1999@gmail.com and tell me I'm a pretentious hack. After all, you made it through this article, you certainly earned the right!

Thanks for the read! 🧐🥃

Footnotes:

Okay, a few gory details (for Khorne's sake of course). This histogram contains all quarter/factions with 60 or more games in that quarter. However, the fitted beta distribution contains all quarter/factions weighted by their number of games (hence why the expected value is perfectly 50%, when one faction wins another must lose). Both α and β hyperparameters are 29.4479.

We like to focus our analysis on data from GT+ events (events with 16+ players and 4+ rounds). Although we can get many more samples if we include smaller events, such samples might not reflect a serious competitive environment; Large sample sizes are useless if we follow poor sampling methods.