Note: this post is about how we compute the aggregated scores (out of 100) we show for each game on our ratings pages. Our written reviews also have ratings (out of 10) that reflect our personal evaluation of each game; we’ll discuss our criteria for those ratings in a future post. (We’ll also work on making that distinction more clear on the website.)
Like mobile board games? Sign up for our mailing list here!
Before we get into the nitty-gritty, one might ask: why are we going to the trouble of calculating our own scores?
These games already have player ratings in the app stores, and many are rated on BoardGameGeek as well. Individual critics review these games and often provide their own numerical ratings. There are even a few other aggregator sites that try to combine player ratings with critic reviews to come up with overall scores. We aren’t data scientists with some fancy scoring algorithm, and don’t (in theory) have access to any information that these other sites don’t have. What do we have to offer?
Two things, we think:
Using more sources: Mobile board games are still a niche category, which means that many of the critic reviews are written by smaller bloggers who focus specifically on these games. These often get overlooked by the large aggregators whose bread and butter are large categories like movies or more mainstream videogames, for which there are many critics associated with larger publications.
Adjusting for “grade inflation”: some critics are harsher than others. As far as we can tell, other aggregators do not adjust for this fact. In large categories with a high volume of critic reviews where everyone reviews everything that doesn’t really matter, but in a niche category like mobile board games it means that a game’s score is too dependent on which critics happened to write about it.
Selfishly, we were struggling with which mobile games to try next in our games group spread across the US, and we wanted to find a better way to discover good games (especially since they often cost a fair bit upfront just to try them out). We didn’t feel that any of the sites out there gave us a solid understanding of the consensus view on how good a particular mobile board game was, and we wanted to change that. And that’s how we found Roll for the Galaxy, which hadn’t been on our radar but is currently the #2 highest-rated game on our site (after Through the Ages, which we’d already played and loved).
We hope you find these scores useful as well!
The goal of our “Loodo Score” — the headline number on each game’s rating page (example below) — is to combine all the ratings we’ve been able to find for each game into a single number. If you’re familiar with sites like Rotten Tomatoes or Metacritic then you get the basic idea.
Update Feb 9, 2022: We've rebranded from Lowkey to Loodo! You can still see our old branding in the screenshot below and some references to Lowkey in some of the other visuals on this page.
More specifically, we aim for Loodo Scores to be comparable across games and intuitive for readers. Games with higher scores should generally be preferred to games with lower scores, and a score that “sounds high” should indicate a good game.
This might seem straightforward: one could simply take all of the ratings for each game, convert them to some common total (we show scores out of 100), and then average the scores together.
Unfortunately, that strategy does not lead to comparable scores. There are a couple of key problems to solve:
Games with few ratings: Games with very few ratings may have unusually high or low ratings essentially by chance.
Different ratings scales: 4 out of 5 stars means something very different depending on the source.
The Short Answers
For games with few ratings, the concept is pretty straightforward: if a game doesn’t have very many ratings from players and critics, we adjust its score towards the average. The goal is to avoid having games with very few ratings end up at the high or low extremes of our ratings distribution — in particular, we don’t want to give a game a top score unless we have solid evidence that it’s worth your time.
It’s a bit harder to concisely explain how we handle different ratings scales. Essentially, we measure the distribution of each source’s ratings (including how high and low they go, but also how often they give different scores) and then “grade on a curve” so that each source has about the same average score and about the same percentage of high, medium, and low ratings as every other source. This doesn’t mean a particular game’s score is adjusted to be the same everywhere — the variance in opinions is exactly what we’re trying to capture — but if one website routinely awards perfect scores and another rarely goes above 80 out of 100, we adjust both of those so that a “top score” is considered something like a 90 out of 100.
The Long Answers
For games with few ratings, we adjust towards the average
We take what’s called a Bayesian approach.
The basic intuition is to start with a “default belief” and then adjust that belief based on new information. You update your belief based on how much new information you receive and how much that information differs from your default.
In this situation, our default belief is that each game has an average score, and we adjust that based on the new information we get from the ratings we find for each game — the more above-average ratings we get for a particular game, the more strongly we will believe that it is in fact an above-average game.
The trickiest part of Bayesian approaches is usually trying to figure out how quickly you should change your beliefs in response to new information. While there are methods to estimate those parameters, we’ve taken a simpler heuristic approach for now (again, we are definitely not data scientists).
For each source, we consider games to have “not very many” ratings if they have fewer ratings than three-quarters of the other games we looked at from that source (put another way, games that are at the 25th percentile or lower in terms of number of ratings). For the Apple App Store, for example, we make adjustments to the games in our database that have 16 user ratings or fewer.
Similarly, we determine the average score for each source independently (we adjust for different ratings scales in different sources in the next section). The games we looked at had an average rating of 4.1 stars in the App Store.
Putting those together: we adjust scores for games with few ratings by averaging in some average-level ratings. If a game has “n” ratings in the App Store (where n < 16) and its average rating is “x”, we adjust its rating to be:
[x * n + 4.1 * (16 – n)]/16
Let's assume a game has a 4.5 rating but only 8 players have rated it so far; x = 4.5 and n = 8. Plugging the numbers into the formula above yields:
[4.5 * 8 + 4.1 * (16 – 8)]/16 = 4.3
Connecting this back to the intuition above: we started with a default belief that this game would have an average rating of 4.1. Its actual rating so far has been 4.5, but since we only have half as many ratings as our threshold, we only adjust our default belief halfway to 4.5, landing at 4.3.
For the more visual learners out there, the diagram below may be useful (if this already makes sense you can skip ahead to the last paragraph in this section). We start on the left with 16 “default” boxes (gray) that are each worth 4.1. Each player rating (colored) we get replaces one of the default boxes, so that after eight player ratings we’ve replaced half of the default boxes with real ratings. The combined average of player ratings and remaining default boxes is what we use.
After making these adjustments for each source, we also make a similar adjustment for games with only one or no critic review (so a game with a high app store rating but no reviews will be somewhat penalized). We see some evidence in the data for critic reviews being a positive signal — and that makes sense intuitively — so in our scoring the absence of critic reviews can only bring a game’s score down toward the average, not up if it is below average.
We adjust ratings from different sources to account for “easy” or “harsh” graders
Different sources have vastly different ratings scales.
After converting all scores to a 100-point scale (so, 4 out of 5 stars becomes 80 out of 100) and adjusting games with few ratings, we see that more than 6 in 10 games in the App Store receive at least an 80; fewer than 1 in 10 (digital) games rated by players on BoardGameGeek (BGG) get a score that high. Furthermore, a quarter of games in the App Store score at least a 90 — while none do on BGG. Many critics are even more generous than the App Store.
This yields vastly different interpretations of a score of “80” depending on where that score comes from — on BGG, that makes a game one of the very best, while on the App Store that’s actually below average. That means that if we took the “raw” average of scores we’d end up rewarding or penalizing games simply based on where they were reviewed.
To fix that, we adjust scores from each source to have similar overall distributions. The dashed lines below are BGG and the App Store before adjustments, and the solid lines are after:
Notice that we don’t just move the scores for a particular source up or down by a fixed amount. The original distribution of scores for BGG is not just lower than the App Store, but also narrower. We both raise the BGG scores and “spread them out” in order for the distribution to be similar to the App Store. The final distributions are closer to the App Store than to BGG because BGG is a bit of an outlier — as you can see on the column chart above, other sources are closer to the App Store than to BGG.
The average game’s Loodo Score ends up being 74, a decidedly average-sounding number in our view. Loodo Scores roughly form a bell curve:
Thanks for reading!