How did Split Ticket’s Models Perform in 2024?

At Split Ticket, we try to let data drive all of our decisions and forecasts. That’s why we transitioned from holistic ratings, like those used at the Crystal Ball and the Cook Political Report, to a fully and strictly quantitative model, like FiveThirtyEight’s approach, in 2024.

As the cycle completes, we owe it to readers to conduct a full postmortem evaluation of our performance. Are these models and ratings trustworthy, and was our decision to switch methods validated?

Let’s take a look.

There were 525 federal elections in the United States this year. Of those, our forecast will have called all but 12 correctly, for a 97.7% accuracy rate. We missed 8 out of 435 races in the House, 1 out of 34 races in the Senate, and 3 out of 56 races in the presidential election. This is significantly better than our accuracy rate in 2022, where we missed 20 races in the House, and places us second in the industry this cycle, behind only FiveThirtyEight.

Of course, call accuracy rates can be misleading. For instance, let’s say that there are two models rating a race. One rates the race at Tilt Republican, giving the GOP a 51% chance of winning the election. The second rates the race at Safe Democratic, giving the Democrats a 98% chance of winning. In the end, the Democrat wins, but by only 0.1%. Who was more correct?

A blind call accuracy metric would say that the model rating it at safe Democratic was correct, but we’re not quite sure we agree. Getting the winner is obviously critical, but falsely conveying certainty should be discouraged. That’s why Ethan Chen’s new Bucket Score metric is a useful tool to evaluate how well-calibrated our forecasts are to the actual results, and how they compare to other forecasts and models in performance.

By this metric, our quantitative model has an error score of 89. This puts us only behind the Crystal Ball, the Cook Political Report, and Inside Elections, and marks it as the best-performing quantitative model in the industry this cycle, among those who made both congressional and presidential predictions.

Even by other metrics, the model grades out fairly well — for instance, Jack Kersting scores models through the Brier score metric, and in each of the three categories (House, Senate, and Presidency), Split Ticket’s models scored above industry average.

Broadly, we feel that our model was a success this time, and we’ll be sticking with a fully quantitative approach wherever possible for the foreseeable future. But that doesn’t mean we can’t improve, and we owe it to you to discuss some of these points in detail.

Areas of Improvement

In general, we did emphasize that the race was a tossup for both the presidency and for the House, with the Senate leaning Republican. But while these topline forecasts were fine, the major thing that we’re quite unsatisfied with is the model’s performance in safe states and heavily Hispanic areas; though it remained accurate in predicting the winner (mostly because it doesn’t take a genius to figure out New York and New Jersey were going blue, or that Florida was going red), the model missed the margin badly in several cases.

Florida was among the most egregious areas of error here. Our model grossly overrated Democratic chances in the Sunshine State, giving them a 19% chance of winning the state presidentially and a 22% chance of winning the Senate election. We also thought that Florida’s 13th, with Whitney Fox and Anna Paulina Luna, was a tossup, and that Florida’s 4th was on the edge of the board.

In all cases, the forecast was driven by polling, and in all cases, it missed by a non-negligible amount; none of these races turned out to be especially close, and certainly not to the degree that the model suggested. And we wouldn’t normally beat ourselves up about this, but polling in Florida now has a consistent and rather predictable pattern of massively overestimating Democrats, as this is now the fifth cycle in a row that this has happened. In fact, if we had stuck with qualitative ratings, we actually would have done better in Florida specifically.

We’re not quite sure what to do about this, but it’s worth further exploring how to tackle states that consistently overrate one side in polling — that type of error isn’t necessarily random, and we wonder whether it’s worth building that into future forecasts. Unskewing is a fool’s errand with an awful track record, and we won’t be doing that, but we may explore and test things like weighting polling by historical accuracy at a state level (so polls in Georgia would carry more weight than polls in Florida).

Another major area we feel we could have improved in is related to coverage, specifically around the popular vote. While our model only predicted a Harris +2.1 popular vote (and internally gave her a 70% chance to win it), we never seriously discussed the probability of a Trump popular vote win. This is because presidents aren’t elected by popular vote, and so we felt it didn’t matter much. As a result, we didn’t even display the probability of this on our site.

In hindsight, while it’s true that the national popular vote is not definitionally important for victory, a lot of the post-election narrative has been dominated by shock over Trump becoming the first Republican to win the popular vote in 20 years. If we could redo our coverage, we would have talked extensively about the non-trivial probability of Trump winning the popular vote, the paths through which it would happen, and the implications it could carry.

Ultimately, forecasters exist to inform the public of the range of outcomes, and while we feel like we generally did a good job of that this cycle, this specific point is one where we think we fell short.

President

Our final forecast pegged the race as a tossup, with Harris given a 53% chance, and Trump a 47% chance. We also correctly modeled that the modal outcome for the election was one candidate sweeping all 7 swing states, due to the correlation inherent in polling error, and this time, that broke in Trump’s favor. We called four of the seven swing states correctly, and marginally overestimated Harris in the rust belt.

There isn’t too much to say about the presidential forecast that hasn’t been said already. In general, our model said something virtually indistinguishable from every other good presidential model (FiveThirtyEight, Silver Bulletin etc), and every quantitative model got the election broadly “correct”. This is because we all rely heavily on polling, which was fairly solid this year, and make sure to model state and district correlations, with everyone having learned from the dangers of 2016.

The House and Senate is generally where quantitative models tend to disagree the most.

Senate

This cycle, our model correctly predicted that the GOP would retake control of the Senate. On top of getting the outcome right, the forecast predicted the Senate composition almost perfectly, with one miss. The model had Republicans favored to flip Montana, Ohio, and West Virginia — enough to take a 52–48 Senate majority.

In the end, the GOP secured 53 Senate seats, which is just one more than our model projected. The “upset” came in Pennsylvania, where businessman Dave McCormick narrowly defeated Democratic incumbent Bob Casey. McCormick won by just 16,000 votes out of nearly 7 million cast, a margin of 0.2 percentage points. 

President-elect Trump won a majority of the vote in the Keystone State, while McCormick finished just shy of 49%. While Casey outperformed Harris in much of the state, he didn’t do so by enough to win and received fewer raw votes than the Democratic presidential nominee.

While Casey’s loss may have come as a surprise to lay observers, our model showed the possibility of a Republican victory throughout the cycle. Pennsylvania was rated Leans Democratic, with McCormick having a one-in-three chance of upsetting Casey in our final forecast. 

Republicans Mike Rogers and Eric Hovde also did better than the polls suggested in the Michigan and Wisconsin races, respectively. Democrat Elissa Slotkin won by just 0.3 percentage points in Michigan (only 20,000 votes). In Wisconsin, incumbent Democrat Tammy Baldwin won by a slightly more comfortable 30,000 votes. Our model correctly predicted that these two races would be a bit closer than polling suggested, and consistently gave Republicans a decent chance at pulling upsets that almost materialized.

Arizona and Nevada, two races where Republican Senate nominees struggled substantially in the polls, ended up being closer than expected. Democratic Senator Jacky Rosen won reelection by just under 2 points in Nevada and Congressman Ruben Gallego defeated Republican Kari Lake by about the same margin in Arizona. The model correctly forecast the delta between the senatorial and presidential races, as Trump ended up carrying both of these states.

In safer states, it was more of a mixed bag. The model nailed Nebraska, Minnesota, Virginia and New Mexico in both margin and outcome. But while it got the outcome correct in Texas, New York, New Jersey, and Florida, we underestimated the Republican performances there. The polling in these states was broadly quite poor, but we still wonder if we could have done a bit better in modeling the margins there — and we’ll be investigating how to do that for future cycles.

House

Going into the election, we predicted a tossup House, and our final model predicted that Democrats would have retaken the House with a razor-thin 219–216 House majority. In the end, we weren’t too far off the mark: Republicans are currently projected to control 220 House seats, while Democrats have won 215.

We “missed” just eight seats out of 435, for a call accuracy of 98.3%: PA-07, PA-08, and NE-02, WA-03, AK-AL, CA-22, CO-08 and CA-45. Importantly, the model correctly considered all of these seats competitive throughout the cycle; in fact, all but PA-07 and PA-08 were tossups going into election day.

As we said the entire cycle, there is not much of a probabilistic difference between a 51% chance for Republicans and a 51% chance for Democrats, and a focus on binary outcome can mask what the prediction is saying. For instance, in CA-45, we predicted Steel would win by less than a tenth of a percentage point; she ended up losing by two-tenths of a percent, making the prediction exceptionally accurate (within 0.3% on margin) even though we fell fractionally on the wrong side of 50%.

Two seats that we missed on (CA-22 and AK-AL) probably would have been correct if we had used “expert ratings”, like some other modelers did. Using this would entail factoring in the assessments from more qualitative ratings outlets, such as the Cook Political Report, Sabato’s Crystal Ball, and Inside Elections. At the moment, we choose to ignore these ratings, meaning that our model is most similar to the FiveThirtyEight “Classic” model under Nate Silver.

There is nothing wrong with using expert ratings; in fact, FiveThirtyEight used this under both Nate Silver and G. Elliott Morris. This actually has empirically improved model accuracy in many prior cycles. But while we’ll explore this option in the future, we still are hesitant to go with it, for two main reasons: firstly, many ratings outlets tend to look at models when rating races, creating a recursive feedback loop of sorts, and secondly, what makes our forecast unique is that it stands on its own, working with raw data and independent of inputs from other experts.

A Final Word

Since Split Ticket’s creation, we’ve strived to bring you a unique, data-centric, and straightforward approach to forecasting and covering elections, where we properly emphasize how strange elections can get and show all the possible quirks. This cycle marked many firsts for us with respect to modeling and coverage. We were fortunate enough to engage and collaborate with outlets like Politico, MSNBC and The New York Times, and were honored to be featured on places like BBC and CNN, and many other partners.

None of this could have happened without the support of our readers. Thank you for consistently reading and holding us accountable for typos, errors, and blind spots. If you’ve ever read, interacted with, or shared any of our articles, please know that it is appreciated, and that it’s a big reason we do what we do. We hope we’ve helped you become a little bit more informed about elections, free from beltway punditry and spin — just as you’ve kept us grounded and cognizant of the dangers of false confidence.

See you all soon.

I’m a computer scientist who has an interest in machine learning, politics, and electoral data. I’m a cofounder and partner at Split Ticket and make many kinds of election models. I graduated from UC Berkeley and work as a software & AI engineer. You can contact me at lakshya@splitticket.org

My name is Harrison Lavelle and I am a co-founder and partner at Split Ticket. I write about a variety of electoral topics and handle our Datawrapper visuals.

Contact me at @HWLavelleMaps or harrison@splitticket.org

I am an analyst specializing in elections and demography, as well as a student studying political science, sociology, and data science at Vanderbilt University. I use election data to make maps and graphics. In my spare time, you can usually find me somewhere on the Chesapeake Bay. You can find me at @maxtmcc on Twitter.

I make election maps! If you’re reading a Split Ticket article, then odds are you’ve seen one of them. I’m an engineering student at UCLA and electoral politics are a great way for me to exercise creativity away from schoolwork. I also run and love the outdoors!

You can contact me @politicsmaps on Twitter.

Discover more from Split Ticket

Subscribe now to keep reading and get access to the full archive.

Continue reading