Friday, October 12, 2012

Bridging the gap between polls and results

The two elections in Alberta and Quebec in 2012 provided a very good demonstration of the limitations of polling, particularly in the current media landscape. While the polls themselves have had issues, the media has difficulty affording the kind of costly, in-depth polling that is necessary to have a clearer and more accurate picture of what is going on.

It can be frustrating for poll-watchers and it is especially frustrating when trying to track the trends during a campaign and forecast an outcome. The greatest utility of a site like ThreeHundredEight.com is not prediction but rather a glimpse of what is going on during an election campaign, providing a good idea of what might happen, and, afterwards, some clues as to what actually happened.

Nevertheless, in addition to tracking what the polls are saying the site has always tried to forecast what the result of an election will be according to those polls. When the polling is good (for example in the 2008 Quebec and 2011 Manitoba elections) the forecast is good. But forecasting can be exceedingly difficult and nearly impossible when polls do not reflect the final outcome.

Bridging the gap between what the polls say and what voters actually do is a challenge that needs to be tackled.

In both the Alberta and Quebec campaigns, I adjusted the polling results in order to try to bridge that gap. The system used was one that would work in the majority of cases but not every time, meaning the model was making the safest bet but was still gambling. In the case of the Quebec election, betting that the polls would under-estimate the Liberals and over-estimate Québec Solidaire was correct. Betting that they would under-estimate the Parti Québécois was not. In Alberta, the adjustments for the Progressive Conservatives and Wildrose were correct, but they needed to have been amped up to ridiculous (and, at the time, inexplicable) degrees.

Rather than continue with this sort of uniform adjustment, I will be using a little more subtlety and only making the kind of bets that will prove correct in the vast majority of cases. For example, in the 17 provincial and federal elections I have analyzed a party without seats in the legislature was greatly over-estimated in 16 of them. That is the kind of safe bet I am willing to make. Other adjustments, especially when a jurisdiction's history shows a stronger skew than the national average, might be made as well. Any adjustments that will be made for future projections will be clearly laid out from the outset.

In addition to this, there are the other kind of adjustments that simply must be made. For example, fringe parties are very predictable: their share of the vote in ridings where they present candidates does not change by more than a couple tenths of a percentage point (in Alberta and Quebec, the projected vote for "other" parties was off by 0.1 point). It is thus very easy to estimate how much of the vote fringe parties and independents will get province- or nation-wide based on the number of candidates they have on the ballot. When polls give the "Others" 2% of the vote instead of 1%, as the number of candidates might dictate, an adjustment is automatically made to distribute that extra one percentage point proportionately.

In the future, as was the case during the Quebec election, the poll average will be reported as well as the projection. But the poll average will not diverge from the projection to a significant degree, as the major parties are only likely to be gaining a few tenths of a percentage point due to corrections for fringe parties and larger parties without a seat in the legislature.

But there needs to be more than just this - a recognition of what we know (what the polls are saying) and what we don't know (how opinion will change, and what kind of error is creeping into the polling).

As my longtime readers know, I have always admired the work done by Nate Silver at FiveThirtyEight and his methods have always been a big inspiration for what I do. Our electoral systems are incredibly different so there is a big difference in how to do forecasting in our respective countries (538 struggled when they ventured into the 2010 British election, which uses the same sort of system as here in Canada). For one, because of the similarities between US elections (i.e. they take place at the same time every four years, there is (almost) always just two major candidates from two parties with relatively stable voting blocks, and there are economic issues that influence the ability of an incumbent or challenger to win) Silver has always focused on forecasting a future event, whereas this site has always been about translating what polls are saying would happen if an election took place "today".

One new feature that Silver has implemented is a "now-cast", using the same kind of idea of what kind of outcome could be expected if the US election were held today rather than on November 6. I like the idea of doing a "now-cast" as well as a forecast, so that is what this site will be doing as well.

But how to forecast a future event with polls that are often a few days old as soon as they are published? Unlike in the United States, where voting intentions can only go one way or the other and past elections provide a guide on how to forecast that, in Canada there are multiple parties and votes can go any which way. Whereas the US convention schedule at the end of the summer gives forecasters a good idea of what kind of surges to expect in August, Canadian elections can take place at any time of the year. What changes can we expect three months prior to an election in February compared to another held in September? And what about the rise of new parties and the disappearance of old ones?

All of this means that to forecast for a future event in Canada a different approach needs to be taken. The question of how to bridge the gap between the polls today and the election weeks or months from now remains.

I first investigated the number of undecideds in a poll as a measure for forecasting. Taking into account the number of undecideds may give a clue as to what kind of swing is still possible between any given day and the election.

But the examples of the Quebec and Alberta elections suggest that this sort of approach is not feasible. To begin with, different polling methodologies result in widely divergent numbers of undecideds. The final IVR Forum poll in the Quebec campaign had only 3% undecided, while Léger's online poll had 9% and CROP's telephone poll had 20%. Determining the true number of undecideds is, thus, quite difficult.

And what can be done with those numbers? There is no way to come to the election's results by swinging those 3% of undecideds in Forum's poll. With CROP, 50% of undecideds need to be given to the Liberals while that number rises to 70% in the Léger poll.

In Alberta, the same sort of discrepancy between numbers of undecideds occurred. One poll requires a 16-point swing on 9% undecided to get to the true result, while the final Léger poll (live-callers) needs 71% of its undecideds to go the Tories.

Clearly, undecideds aren't the source of all of the error between final polls and final results. Using it as a base to make forecasts will, thus, not be fruitful.

That brings me to a different approach for making forecasts, which is based on the simple (and intuitive) idea that polls can be expected to swing as much in the future as they have in the past. That is to say, if a poll shows that a party's support can swing by 10 points in a matter of two months, we can assume that the party's support could swing by 10 points in the next two months. This method is not dissimilar to the one I have used to calculate the projection ranges - it is based on past volatility to predict future volatility.

Forecasting the Quebec election
This method would have had good results in the Quebec election. The forecasted range for the PQ from June through to the September election would have always included the party's eventual result. The same would have occurred for the CAQ throughout almost all of the campaign, while the PLQ would have only slightly dipped below its forecasted range in the final days. But that kind of polling error is the sort that simply cannot be corrected without making baseless guesses.

As you can see on the chart, this sort of forecasting method generally narrows as the election approaches, but does widen if things start to get more volatile. I think this is a strong representation of both what can be expected in the future and the kind of uncertainty that exists during a campaign. It might be a very wide bar, but it is a recognition of what we don't know and should give a very good indication of likely outcomes.

Forecasting the Alberta election
The polling failures of the Alberta campaign would have still put the forecast off, but the error would have been far smaller than was the case at the time. But in the case of Alberta, it has to be accepted that no amount of reasonable adjustments or corrections could have called the Alberta election correctly with the information available.

The forecast range will be determined using the greatest difference between two polls going back as far as the amount of time that remains before the election. In other words, if the election is two weeks away the forecast range will be calculated using the largest discrepancy between two polls over the preceding two weeks. A simple mechanism to ensure that the results aren't too outlandish (or that they dip below 0%) will be to establish a floor and ceiling equal to 50% above or below the projected result or the highest or lowest poll in the allotted time period, whichever is higher (for the ceiling) or lower (for the floor).

This will prevent the forecast from swinging extraordinarily widely based on two outlier polls, but will also allow the forecast to move accordingly when other polls back up a new state of affairs.

In addition to this method for making a forecast, a "now-cast" will be calculated using ThreeHundredEight.com's vote projection methodology that has been in place for some time. A "now-cast" range will be calculated using the margin of error of the average sample size of the projection. This will provide two useful measures: one an indication of what might be expected to change before the vote takes place, and the other a measurement of what the polls say will occur, taking into account standard sampling error. How this fluctuates in the run-up to a campaign should provide a good picture of how things are unfolding.

The projection model for the upcoming election in British Columbia will be launched some time next month, so in the interim I welcome questions, suggestions, and input before it is finalized for the launch. In the next few months, I will also be presenting some other changes to how polls are handled and how the projection will be calculated.