Show full content
This year’s NCAA basketball tournament has just ended, which means a lot of people are currently cursing their bracket picks. That also means it’s a great time for a retrospective: How could one go about picking a better bracket? There are all kinds of considerations here, but in this post we’re going to focus on just one feature: team seedings.
In the NCAA tournament all 64 teams are placed in one of four regional tournaments, and the teams in each regional tournament are given seeds numbered 1 through 16. The regional tournaments are set up so that #1 seeds play #16 seeds, #2 seeds play #15 seeds, and so forth, for the first round. In the second round, the winner of 1 vs. 16 plays the winner of 8 vs. 9, the winner of 2 vs. 15 plays the winner of 7 vs. 10, and so on, continuing through four rounds. After the fourth round, fifteen teams from each region have been defeated, and the remaining team is the winner of the region and goes to the Final Four to play against the other regional winners.
The NCAA men’s tournament expanded to 64 teams in 1985. This gives us 40 years’ worth of information on how teams actually perform relative to their seeds. In this series of blog posts, we will use this information to investigate how you can improve your chances of winning your tournament bracket pool using information from the team seeds.
(For this purposes of this project, we are ignoring the play-in teams in the men’s tournament. Also, we will not include information from the NCAA women’s tournament, since the dynamics appear to be different: The top seeds and a small number of programs tend to dominate in a way that does not happen with the men’s tournament.)
Historical Winning PercentagesThe first thing to do is figure out the historical win rate for the various seed matchups. Kaggle’s Machine Learning Mania 2026 contains the all the data we need (much more than we need, in fact). We extract the data and keep only the yearly seed information from 1985 on, including which seeds won which matchups.
After all the data extraction, cleaning, and computation involved, we end up with a 16 x 16 matrix for seed vs. seed win/loss records from 1985 to 2025. The full matrix is given in the file below. A few details are worth pointing out.
First, the #1 seed has a winning record of 98.8% against the #16 seed. From 1985 to 2025, inclusive, there have been 40 NCAA tournaments (41 years less one for Covid-19), which means 160 games that have pitted a #1 seed against a #16. We know that a #16 seed has only won twice (UMBC over Virginia in 2018 and Fairleigh Dickinson over Purdue in 2023), giving a 158/160, or 98.75%, winning percentage for the #1 seed here. The 0.988 that appears in the table serves as a sanity check on our calculations.
In addition, several higher-numbered seeds have winning records against lower-numbered seeds (noted in orange). Many of these are based on only one or a few games, but one in particular is not. With 160 matchups, the #9 seeds actually have a winning record against the #8 seeds: 51.9%. There are also many, many seed vs. seed matchups that have never occurred in the tournament (noted in light yellow). Finally, several seeds have 100% winning records against particular other seeds (noted in light blue if not already colored orange). While historically accurate, these are based on only a few games and are suspicious from a modeling/analysis standpoint.
win_rate_matrixDownload Smoothing Winning Percentages with Bradley-TerryThe sparsity of the historical win rate matrix and the nonzero entries based on only few matchups are problems for us if we’re trying to do a full seed-vs-seed analysis. All is not lost, though: We can smooth the matrix in some fashion, using the historical information we do have to estimate the information we do not.
There are a few options available to us here, but perhaps the simplest robust choice is to use the Bradley-Terry model. In our terms, this model says that the probability that seed i defeats seed j in the tournament is
where and
are the overall strengths (in some sense) of teams i and j, respectively. And we have a good way to estimate the
value for each seed i: Use maximum likelihood estimation and the historical seed vs. seed win data so that the ratio formula above best explains all observed matchups simultaneously. (This process is described in detail in the Bradley-Terry model page.)
Applying the Bradley-Terry model gives us the updated win-rate matrix below. This matrix looks much more reasonable: Every matchup has a win probability, and the probabilities seem reasonable (the #1 vs. #16 win rate of 98.8% is even preserved). The oddly high percentages below the diagonal have been smoothed out, leaving only three win rates in the orange, all of which seem plausible.
bt_win_rate_matrixDownloadNow, how do we use this model to pick a good bracket? We will continue the discussion in the next post, where we discuss how to maximize your expected bracket score.
