Tuesday, July 9, 2013

The quick and easy (and effective) turnout model

After the polling debacle in the British Columbia election earlier this year, many pollsters identified turnout as a major issue in the error, and not in terms of the number of people who turned out. Rather, the problem was with who turned out. I looked at this issue in an article for The Globe and Mail shortly after the B.C. election, and a new article by Angus Reid for Maclean's identifies the same problem. In that article, he mentions that if Angus-Reid had used a turnout model based on who went to the polls in 2009, they would have given the NDP a three-point lead instead of the nine-point lead they actually awarded the party.

While that still would have been off, it would have been a lot closer and would have suggested that the race was getting too close to call. The egg on the faces of the pollsters (and those who base their work on polls, such as myself) would have been a little thinner.

Ever since the federal and provincial elections in 2011, I have played around with trying to guess at how the polls were going to miss turnout. This was employed in the manner of adjustments to the projection based on how the polls had been wrong in similar situations before. This helped me get very close to the results for the B.C. Greens and Conservatives in the election, and got me a little closer to the truth in Alberta (I probably would have had a Wildrose majority instead of a minority without it).

But the results haven't been satisfactory, and the B.C. election showed how necessary it was (Alberta had some other dynamics going on, not related to turnout). The 2011 federal election was also influenced by turnout, as it turned a Conservative minority government into a majority one.

I could try to design some complicated model for estimating turnout, but in most cases the simpler a forecasting model can be the better it is. One of the criticisms of Nate Silver's model is that it is overly complicated for only a slightly better performance than simpler alternatives. He has alluded to that himself, and discusses the benefits of simplicity in The Signal and the Noise (required reading).

I've determined that a very simple turnout model can, in fact, be quite effective. That model has to do entirely with age. Younger Canadians do not go to the polls in as large numbers as middle-aged Canadians, who do not vote in as large proportions as seniors. And, it seems, those who do vote tend to vote in similar ways to those who are older than them.

In order to reflect this, the simple turnout model I will be using in future elections (when the data is available, and as a separate calculation from the polling aggregation) employs the following formula: discard the results of those under the age of 35 and double the results from respondents 55 and older. That's it.

This is easy to do as most pollsters tend to report these age groups. When they use other definitions, a little tweaking needs to be employed. When they do not report these results, I will have to estimate them based on what other polls are showing or ignore them entirely. The results of this method are surprisingly good.

The chart below shows the difference between the final projection and what that final projection would have been if using this simple turnout model (ignoring polls without age breakdowns) in four recent elections. A few notes: in Alberta, I have compared the turnout projection to the unadjusted final projection - I was employing turnout adjustments for every party in that election already. And in Quebec, since Léger and CROP do not report age breakdowns, I have only applied the turnout model to Forum Research's final poll.

As you can see, in every case the turnout model performs better. In some cases, the difference is dramatic: total error would have been reduced by about a third in British Columbia and almost entirely erased in Ontario. The results in that province are particularly striking - no party would have been missed by more than 0.3 percentage points!

For British Columbia, the difference is the most consequential since it would have changed the narrative from an easy NDP victory to a lead of only 2.3 points. In the case of Angus-Reid, if the firm would have used my simple turnout model instead of a more complicated one based on 2009's turnout figures, they would have had the NDP's lead at only two points (42% to 40%) instead of the three reported in the Maclean's article. Ipsos-Reid would have also had a two-point lead (44% to 42%) instead of the seven-point gap they actually had, while Forum Research would have given the Liberals a two-point edge (43% to 41%) instead of the reverse.

In Quebec, Forum's adjusted numbers would have tied Léger's for the best performance, while in Alberta their final poll would have given Wildrose a one-point advantage instead of two points. The final polls of Angus-Reid and Abacus Data would have been made worse, however. Turnout was not the issue in Alberta, it seems.

UPDATE: I had forgotten that Léger did indeed release age breakdowns with their final Quebec poll (they normally do not release this information). As you can see in the discussion in the comments, applying the turnout model to the Léger and Forum polls and averaging them gives a very good result: 33% PQ, 30.5% PLQ, 27% CAQ, and 6% QS for a total error of only 1.5 points!

For Ontario, the results would have been simply astounding. The polls by EKOS Research, Angus-Reid, Abacus Data, and Ipsos-Reid would have all been closer with this very simple turnout model.

This demonstrates how Canadian election polling needs to rely on more dramatic turnout models in order to get closer to election results. This is problematic, however, since it implies that the pollster with the best model, and not the best methodology, would get the plaudits. In order to be acclaimed for having the best methods, pollsters should use a turnout model based on the questions in the poll itself, like Ipsos-Reid did in their most recent federal survey. If a pollster can estimate turnout correctly, as well as the results of the vote, they will have proven themselves exceedingly competent. This requires full reporting, however, as otherwise the public will not be able to determine if the model or the methodology won the day.

For ThreeHundredEight, this simple turnout model should provide readers with some decent expectations of results. At the very least, it should help prevent some surprises. But because the model is so simple, and verges more on a gimmick than anything based on voting behaviour research, I will only be including it as an extra piece of information. The site will abandon turnout modelling for the projection itself, and instead focus on polling averages and ranges based on how polls have been wrong before. Hopefully, this will give readers the best possible understanding of the dynamics of an on-going campaign.