How do Pokémon Fit in a Tiny Ball?

Using Science to Answer Pokémon Metagame Questions

Hey people out to get SixPrizes! I am back with a brief special report. My son’s first year in Seniors is going pretty well! He has gotten Championship Points at every single Regional we have attended this year, but we have attended so few that it hasn’t mattered—he has his invite, but we aren’t really pushing for Top 16 this year. This means that we will have a bit of a challenge – while he Top 16’d the last 2 years, I found making it to Day 2 via the grinder and success Day 1 previously to be super-stressful for a Poké-parent. So, not too much to mention here. We will be in Toronto and Virginia to round out our season. Make sure you say hi!

Liam at Charlotte Regionals!

So while I don’t have a lot to report from a tournament report perspective, I do have stuff to talk about! RK9 Labs and the inimitable Carlos Pero have hooked me up with some data from the Portland Regional Championship to help inform your “strategerizing” about decks.

Hypothesis

Looking at a large sample of games played by archetype can inform what deck is the BDIF. Further, can we determine: did decks that had good results over-perform because they played against favorable pairings relative to the meta or are they, in fact, simply well-positioned against the meta?

Data Set

SixPrizes’ own Christopher Schemanske and the RK9 team went through and categorized every deck played at the Portland Regional Championship tournament by archetype. Then, I was provided with the win/loss data for every matchup for Day One of the tournament in the form (Round X – Archetype One v Archetype Two – Outcome).

Representative Data Sample…

Analysis

In the interest of expeditious analysis, I began my analysis by deleting all the ties. (Because the way the data set was constructed, I would have probably wanted to add .5 wins and .5 losses per tie and I didn’t want to code that in my first pass.)

From there, I built a model that showed win rate by deck versus every archetype in the model. This rolled up to an average win rate for each deck. Now, to determine how it performed against the meta, I looked at the total incidence of each deck and turned that into a meta projection. I then ran the win rate for each deck against the model meta and summed it to reveal the win/rate versus the meta.

The challenging situation here is when you have gaps in the model. E.g. Espeon/Garbodor never had a win or loss against an Empoleon deck. A bigger sample size (we could combine this with Toronto, for example, offering an identical format but different meta mix) would fill in these slots, but in the absence of that, we have holes in the model. Rather than speculate on what the win percentage could be, I somewhat arbitrarily set it to 50%.

The advantage of this is that it naturally compensates for the fact that I eliminated ties (which would further populate some pockets of the model with 50/50 outcomes). Also, rather than assuming that the win rate approximated the historical win rate puts pressure on the model in an interesting way—i.e., if we set the win rate for non-played matchups at the historical win rate for the deck, the win rate for the deck inclusive of played and unplayed matchups would be the same as the win rate for played matchups, by definition. By contrast, setting the win rate to 50/50 naturally lowers the win rate for the top performing decks, the most interesting decks in the model, and it lowers the win rate of the decks that played fewer matchups more than decks that played a broad cross-section of matchups and still performed well.

Finally, I am sure many of you will recognize that we are missing two elements:

1) We don’t recognize that “the meta” is not “the meta”. It might be more interesting to try to model how the top table meta evolves. It is not so important that a deck perform well against the entire cross section of decks at the tournament as it is to perform very well against the decks that had a winning record, for example. Then, you merely need to navigate one round of random outcomes, after which you may have a very different meta mix. This analysis was not possible because we did not maintain state in our data set from Round to Round, so looking at any given matchup one was unable to determine the record of a given table after round 1.

2) We don’t control for player skill. Obviously decks like Sylveon and Bulu were played by many of the top players in the game. Mapping Championship Points to players to allow us to attempt to control for player skill (see my earlier SixPrizes article) would give us insight into how much the fact that the top players played a deck determines that deck’s outcomes. I feel like anecdotally, watching the performance of players like Alex Hill, Jimmy Pendarvis, Pablo Meza, etc., there can be little doubt that Pokémon is a fairly skill-based game where they can select decks and cause them to perform well seemingly almost regardless of the quality of the deck.

3) We don’t control for techs. I would love to enhance this model a tiny bit by looking at things that cause a bifurcation of results. How did Greninja do vs Giratina promo vs decks that didn’t? Or decks that ran Oranguru UPR vs mill decks. I think there are a bunch of techs that you could single out to determine their impact on outcomes.

Results

Actual Portland Day 1 Win Rate, excluding ties…

Click for zoomable image.

When you look at straight outcomes, the mill decks were the most successful decks at Portland. Sylveon and Hoopa were the only decks that won more than 65% of their games. Bulu, Espeon/Garbodor, Lucario/Lycanroc, and Buzzwole were the other well-known decks that won more than 57% of their games. But there was one deck that won 60% of its games and was not on the radar: Solgaleo!

Another interesting aspect of this data is how poorly, net/net, even the strong decks did. Think about it: Every article you read here at SixPrizes features matchup analysis where they characterize matchups as being 60/40, 50/50, 40/60 or something thereabout. Looking at this analysis, if you found a deck that was 60/40 against the most popular decks in the meta, you would have a deck that is clearly right there amongst the best four or five decks in the format. Also, it was interesting how the distribution is far from normal.

A big part of the bias here is the drop rate for decks – decks that had below 50% win rates played ~700 games, while decks with win rates about 50% played more than 1,100 games. Another interesting aspect of distribution is that decks that had win rates below 50% performed generally quite poorly.

A universal truth that is truly driven home by the data: While there are very few very good decks, there are lots of bad decks. While only 4 decks had a win rate 60% or above, 12 decks had win rates of 40% or below.

Win Rate Controlling for Matchups

When you control for matchups, the effect of our model development is that decks that played fewer games tend to revert to the mean, telling you that there is less evidence of their positiveness or horribleness—it could be that they simply had bad matchups. Whereas more widely played decks have more stable outcomes and are influenced less by this effect.

So, once the control is implemented, interestingly, Lucario/Zoroark becomes a top-performer as statistically less significant decks get pulled out of the top of the standings. Solgaleo, for example, only playing 8 games in our model, almost completely regresses to the mean. Espeon/Garbodor becomes a much worse deck, but possibly the most interesting result is that “Garbodor Variants” becomes a top-tier deck. The only way a deck that had a below 50% win percentage in the real data could vault to a +50% win rate in our model is exactly what we were looking for: They lost because they hit bad matchups. The more you model their matchups as representing the meta, the better they perform. Garbodor is a strong card.

Conversely, Golisopod/Zoroark had a below-50% win rate, but when controlled for the meta, their win/rate became even worse. For a deck that many think of as conventionally 50/50 against much of the meta, it is a rare deck that appeared to be a worse and worse play the longer you looked at it, despite the fact that it was the third most commonly played deck at the tournament. While many discussed how poorly it performed in Portland, this really reveals the depths of its sorrow.

On the bottom end of the model, “Other” decks drift to the bottom, as the decks that were identified and performed more poorly than “Other” played so few games they drifted toward the mean. The interesting implication here is that, at Portland, people that tried to go really rogue (with the exception of Solgaleo) were punished.

Finally, I thought I would mention Greninja for a moment. How likely are Greninja hands? Pretty likely, I guess. I think one looks at matchups and thinks Greninja is in an ok space in the meta. Despite that, it had a win rate of only 53% and despite being one of the 10 most popular decks in Portland, when projected against the meta, its win rate falls to ~50%. While some of that is the regression to the mean that the model is prone to, the distribution of games played by Greninja is also reflected in the outcome. Don’t play Greninja.

Conclusion

Science shows: the best way to test is the “coffee shop testing”

People will obviously take the need to counter mill decks more seriously for the rest of this format. Also, I think this demonstrates that the more transparency there is in publishing tournament outcomes, the better understood the nature of Pokémon will become. This model has many flaws, but identifies the opportunity that exists to delve deeply into the analytics to understand the value proposition of archetypes, the importance of skill versus deck selection, and other meaningful tidbits of knowledge.

I want to thank Carlos and the whole RK9 team for the chance to dig into this data, and the contribution that RK9 and Regional organizers have made to increasing transparency in Pokémon.

See everyone in Toronto!

Reader Interactions

Leave a Reply

You are logged out. Register. Log in. Legacy discussion: 0