Can we trust the polls?

Isn't that the question? People in BC most likely remember the 2013 election where literally every poll had the BC NDP ahead (sometimes by a large margin) and the BC Liberals ultimately won by almost 5 points. With the Alberta election of 2012, this remains one of the biggest polling mistakes in Canadian history. More recently, we also remember the Brexit (polls where mostly predicting the "Stay" to win) or, of course, the election of Trump (again, an average error much smaller than for Alberta of BC). With that said, the polls in France last Sunday were spot on, which resulted in my projections also being spot on (yay!).

But how accurate are polls in average? In this post, I'm looking at that exactly. The idea is to find what I call the real margins of error, the ones accounting not only for sampling variation -the plus or minus 3% 19 times out of 20, as reported by the polls - but also for the estimation bias and the other sources of uncertainty.

Before we start, let's notice that the margins of errors reported by the polls are pretty much useless. First of all, with a lot of polls being conducted online, the samples aren't truly random and the basic theory of statistics don't really apply (note: it doesn't mean online polls don't work. The French polls were all done online for instance). Secondly, the margins reported are the theoretical ones for a party at 50% level of support. Yes, that's right, the margins actually vary with the level of support, so a party at 40% will have larger margins than a party at 5%. But literally no pollster in this country would ever tell you that. Finally, as mentioned above, they measured the sampling error. But I'll argue that we don't really care about that, not when we have many polls that we can aggregate. Fact is, if the only source of error was indeed the sampling one, polls aggregators like me should be able to nail every single election. But we clearly don't (sorry...). Why? Because measuring voting intentions has other sources of variations. People can change their mind, they can lie, they can refuse to answer, etc. All of these create a potential bias.

So what I did was take the latest polls from a few recent elections in this country (the last 3 federal elections, Alberta 2012, BC 2013, Ontario 2014, Quebec 2014, Alberta 2015). It's not a complete sample - I could have added the 2012 Quebec election where the polls were also off; I'm also missing some elections in smaller provinces. But hey, it's already a good source of data. It should be enough to give us a good idea of the average accuracy of Canadian polls.

For each election, I calculated the poll average for polls conducted during the last week of the campaign (without any adjustment from me) and I compared it to the actual results. Then I calculated the Mean Square Error. This is a statistical measure of the average error (it's technically the variance for an unbiased estimator). Taking the square root and multiplying it by 1.96 gives us our effective margins of error at 95%. Note that I only looked at the error for the top parties (usually the top 4 or 5 parties in province, in other words, the ones included in the polls). I also calculated the average absolute error (if the polls had a party at 40% but it got 42%, the error is 2 points).

Results are in the table below.

Source: own calculations based on the polls and elections results in Alberta 2012, BC 2013, Ontario and Quebec 2014 as well as the last three federal elections.

As you can see, the actual margins of error of a typical Canadian poll are relatively large. Much larger than what the standard margins of error would predict (again, if there was only sampling error, taking the average of 6-7 polls should give us almost perfect estimates). Even if we exclude the last two obvious mistakes (Alberta 2012, BC 2013), polls aren't that accurate - although they appear more accurate than French polls (that aren't done through random sampling).

What the margins above mean is that even after aggregating the polls, your estimates are still likely off. Another way to put it is that in average, polls are off by 2.09 points for each party. Make no mistake, that's a good level of accuracy. But when we transpose these polls into seats, if the gap between two parties is over or underestimated by 4 points (2 x 2pts), this can make huge differences. Each point can represent as many as 5-6 seats if a party is in the "paying zone" (above 25% usually).

I had written an article a couple of years ago using data from Abacus (whose CEO, David Coletto , was kind enough to me to give me access to their raw sampling data). I showed that if we accounted for the fact that people could change their mind, the actual margins of error should be closer to 7.5%. This seems to line up relatively well with my findings here. And this is consistent with the findings in the US.

So why are polls sometimes off? Well if only we knew for sure! However, David Coletto and myself looked into it and found that polls had a tendency to be more wrong when there was a big change to the turnout between elections. Neither the polling method nor the sample sizes had a significant impact of the overall polling accuracy, just the change in turnout.

So will it happen again this year in BC? We can't tell. I personally don't think this campaign will generate an increase in turnout the same way the last federal election did -I'm not even sure the turnout will go up at all, there just isn't the same enthusiasm. But we'll know better once we get the turnout for the advanced voting that begins this weekend. In any case, this article should show why using simulations is so important and why uncertainty will always be present while forecasting an election. Not only for the shares of vote, but even more so for the seat projections.

By the way, for the projections and simulations, I go with margins of error of 4.8pts. It's slightly less than the estimates here. I do it mostly for two reasons. The first one is that my own polling average is typically better than the raw, pure polling average used here. The second reason is that the numbers in this post should be seen as the upper bound. In particular I calculated the Mean Square Error for all the year and then I took the square root of this average. But I could have taken the square root for each year and then have done the average. As you may know, the average of the square root is less than the square root of the average. Doing so would yield margins of error of 4.96% (or 2.98% without Alberta and BC).