I just uploaded the version 2.0 of the 2021 simulator. The big change is the addition of demographic variables. I have also made some modifications for situations such as Sloan running as an independent or JWR not running again. But those are details really.

Why adding demographic variables?

Some of you might actually be surprised that the previous model wasn't using any demographic variables. That's understandable. But you need to realize that adding such variables isn't trivial. Yes it makes sense on paper to want to be able to calibrate the model based on the gender gap or the evolution of the voting intentions of the 18-34 years old, but it's quite a lot more complicated in practice. And most models out there don't use any (308/CBC) or don't disclose much of what they do (338/Lean Tossup).

The main challenge is that we don't have results by demographics, only by ridings. We don't actually know how women voted in 2019, or how the 18-34 voted. Yes we have polls but polls are not perfectly accurate. Also, most polls only include gender and age and nothing else. Very few pollsters include information about education or income. Literally none will include breakdown by home ownership even though it is an important predictor of the vote.

Given that my model (like every single one out there I believe) used the past election results as the base and apply the swing observed in the polls since then, we really need to find a way to get the results broken down by demographics. Thankfully, the Canadian Election study (CES) can help us there. This massive survey actually has a post-election component when they ask people how they voted. It's still not perfect as people can still lie or misremember, but it's an improvement. More importantly, I can get the results by income, education or any other variable I want. I ultimately limited myself to age, gender, education and income as we get this info in some polls. It's useless for me to add variables that are never included in any polls, it won't be useful for projections.

Because even such a survey isn't 100% accurate, I decided to set up my demographic adjustments as differentials. I'm not using the actual levels for (say) the Liberals among (say) men and women, I'm using the differential. For 2019, Canada-wide, the Liberals got 34% of the vote but were at 35% among women and 33% among men, so differentials of +1 and -1. Doing so allows me to use polls and the CES even though they aren't 100% accurate (in the post-election CES, the Liberals were overestimated). Surprisingly, I discovered that polls were pretty good at estimating such differentials. I guess that while polls can make mistakes (over or underestimating a party, probably due to turnout), they are better at estimating the relative levels. That is a very important piece of good news as it means polls can be reliable to use such variables for predictions.

I also use the data from the census where we get detailed breakdowns for each demographics (for instance what percentage of voters in Rosemont are aged 18-34). Notice, however, that the census doesn't give us joint distributions, only marginals ones. It's a potentially serious limitation but there isn't much that I can do about it.

How is it done?

A very quick summary would be:

1) The province-wide percentages that you enter will determine how many votes a party will get

If you believe the Liberals are up in one province by (say) 4 points, then input them at 46% in Ontario.

2) The demographic adjustments will influence who and where those voters are.

If you believe this swing is mostly concentrated among women and educated voters, then increase their differentials.

I decided to take a two-steps approach. The first step is the same as it has always been: we use polls (or we guess) the province-wide percentages for each party. Let's say, for instance, that the Liberals are up 5 points in Ontario since 2019. Once we input this, it has to be that the average swing adds up to 5 points. It doesn't matter if the swing is coming from the Liberals doing better among women or among older people, it has to be 5 points. That's what we told the model to do!

The demographic variables are only used to allocate those 5 points across ridings. If you believe the Liberals are up by 5 mostly thanks to an increase among women voters, then the model will need to increase the Liberals more in ridings with more women. That makes sense right?

The model first applies the province-wide swings (currently uniformly but regional adjustments could come later). It then apply the demographic adjustments, one by one (since I don't have the joint distribution). If the adjustments don't sum to zero, the model takes the necessary steps to insure the final, net average swing is consistent with the provincial one. See the following example to understand the last part.

Let's imagine that you believe that the Liberals are up 4 points among women. That's the only change. In every other demographics, they are stable compared to 2019. If you only enter '+4' in the simulator for LPC-women, the model will increase the Liberals by "share of voters that are women * 4 points" in every riding. In average this is obviously 2 points (ridings don't vary that much in term of the men to women ratio and the average share is obviously almost 50%; Let's assume it's 50% for the sake of illustration). But, and this is important, if you didn't increase the Liberals by 2 points province wide (i.e: overall, they are still at 42%), then the model will apply a uniform swing of minus 2 everywhere. That is logical. The model ahs to do this to keep the province-wide swing consistent with the province-wide numbers you entered. So, overall the Liberals remain at 42% but their vote is now skewing towards women, meaning the riding numbers have changed (some have increased, some have decreased).

What you need to do if you really believe the one change is Liberals +4 among women and everything else is the same is to enter Liberals at +2 province-wide (so Liberals at 44% in Ontario) and +4 among women. This will work and achieve your objectives. [Note: before some smartass on Reddit notices the error and declares me as illiterate with numbers, yes the true correct way would be the following: increase the Liberals by 2 points in the province, then put LPC-women at +2 and LPC-Men at -2. This is consistent with the Liberals being at 42% among men, 46% among women, thus 44 overall. But it's getting complicated and, for all intents and purposes, simply putting women at +4 will do the exact same job.]

The most important numbers remain the province-wide percentages, not the demographic ones. I spent a lot of time thinking about it and it is the best solution in my opinion. Why? Well because the key numbers you need to get right in order to be accurate with your seat projections are the province-wide percentages. In 2019, if you knew that the Liberals would be at 42% in Ontario and the NDP only at 17% (so a much bigger gap than what the polls showed), I can guarantee you that any model would have given you mostly the right results. You wouldn't have needed to look at whether the Liberals were doing better or worse among the 18-34 or if they increased their lead among the university educated. No, get the province-wide percentages right and you are 95% there. This is just a fact. You can spin your model as being super sophisticated and what not but the sad truth is: whoever guesses the province-wide percentages best will have the best seat projections. Yes your model would have been even more accurate if you took into account that the Liberals were gaining more among the 55+ and the university educated, but the improvements are very marginal compared to having the correct province-wide percentages. Not even close.

Keeping that in mind, I wanted to keep having my model mostly be dependent on those province-wide percentages. The demographic variables provide adjustments but nothing else. When you are using the simulator, I'd suggest that you first enter the percentages by provinces (based on polls or what you believe will happen). Then you scroll down and you make adjustments based on demographics (i.e: you think one party is doing better or worse among some demo). Or just play with the demographic adjustments and see if you can make a party's vote more or less efficient (if you are a Conservative looking to make gains in the GTA, I suspect that improving your score with university educated might help).

What if I don't do the math right?

What happens if you enter the Liberals at 38% (drop of 4 points province-wide) but your only change in the demographic adjustments is LPC-women at +4? Well, this is clearly inconsistent but I also don't expect you to do a ton of math before using the simulator. That kind of defeats the purpose. If the LPC is up by 4 among women, this party should be up by 2 province-wide, not down by 4. That's an inconsistency gap of 6 points (+2 vs -4).

The model will first apply -4 in every riding. Then it'll apply +4 * share of women in every riding, resulting in a net average adjustment of +2. The model will thus apply another uniform swing of -2 everywhere. At the end, the objective is 1) to have the Liberals down by 4 overall and 2) to have the LPC electorate now skewing more towards women. As long as both objectives are reached, I'm happy. All of this done in a simulator that can be used without a ph.d in stats.

What does it mean? It means you don't have to do math before using the simulator. The model allows you to be inconsistent and will take care of it. Just remember that the actual, final provincial swing (and therefore average net swing) is given by the province-wide numbers you enter. As it should be.

At the end of the day, enter the province-wide numbers and then play with the demographic adjustments to see if gaining among the 18-34 would (for instance) make the NDP vote more or less efficient. And don't worry if your demographic adjustments don't sum to zero or don't average out to the provincial swing, it's all fine. If you enter numbers based on a polling average, those inconsistency should not be present anyway (in theory). Either way, the model will take care of this.

What if I leave all the adjustments at zero?

In this case, the model is assuming the composition of the electorate for each party won't have changed since 2019. The CPC, for instance, was at 33% in Ontario but at 37% among men and 29% among women (so +4 and -4 differentials). If you enter the CPC at 35% province-wide and don't change anything else, then the CPC is still at +4 among men, so 39%. The demographic adjustments are really just that, adjustments. And they are relative. In this case, the model increases the CPC by 4 points everywhere with no special adjustment for ridings with more women for instance. remember, first step decides how many votes a party receives, the demographic adjustments will change who those voters are. And everything is done with 2019 as the base. So if you leave the adjustments at zero, you are assuming the CPC voters will continue to skew older, more male with fewer university degree (but higher income!) while the NDP vote will keep skewing young and female. Leaving the numbers at zero doesn't mean you believe there is no gender (or education, etc) gap, just that the gap for this party is the same as it was in 2019.

What else?

The model currently uses a simple uniform swing model. I'll likely add regional effects soon but I was more focused on the demographic variables. Adding regional coefficients doesn't make a big difference anyway.

I'll also refine the demographic estimates from 2019 over the next few weeks. I'd like, in particular, to have different ones for Quebec and the ROC. But polls rarely provide such breakdowns.

The 2019 federal election wasn't particularly exciting with very little vision offered. That was particularly the case with the main two parties that had quite boring platforms. As a result, the campaign was mostly marked by scandals and tangentially related topics such as SNC-Lavalin or Doug Ford.

Polls don't usually allow us to measure the impact of one variable or event on the voting intentions. They can show us that, for instance, people didn't like Doug Ford, but they almost never provide any measure of how much it impacted the voting intentions.

Using the CES (Canadian Election Study) of 2019, I decided to try to answer some of those questions. This is the first post in a series. Today's topic: did Doug Ford cost Andrew Scheer the job of PM? And did SNC-Lavalin cost Trudeau his majority? I use the online sample that contained all the information I needed. I estimated the regressions below . My two variables of interest were, naturally, "premiergood" which is equal to 1 if the respondent expressed satisfaction with their provincial government (either "very satisfied" or "satisfied) and "sncgood" which measured the handling of the SNC-Lavalin story (again, either "Very well" or "well").

I naturally needed to control for many other variables susceptible to influence voting intentions. I included the usual (age, gender, having children, education, income), the satisfaction with the federal government as well as the ratings given to the LPC, CPC, NDP and Green and their respective leaders. There is also one variable measuring the left-right orientation of the respondent (self identified). I believe this is, overall, a pretty standard and robust specification. Would it pass peer-review for publications? Likely not. But as a simple test for a blog post? Sure, that'll do.

I estimated the regression for Ontario only, here is the table in Stata.

As you can see, for the Liberals, a few variables mattered. Income and university education are both significantly correlated with voting Liberals. Being satisfied with the federal government is by far the strongest impact (a whopping 18.5 percentage points!; Yes I'm aware I should be using a Logit or Probit here given my dependent variable, but I want to be able to have easily interpretable coefficients), so was giving a high rating to the Liberal party.

Being satisfied with the government of Doug Ford is associated with a 6 points drop in voting intentions for the Liberals. On the other hand, people who thought Trudeau handled the SNC-Lavalin story well were voting for him 6 points more.

Estimating the same regression for the Conservatives gave the following results:

Being satisfied with the Ford government increased voting intentions for the CPC by 12 points! the SNC-Lavalin affair had the same impact as for the Liberals (just in the opposite direction obviously).

The variable cps19_lr_scale_bef measures the left-right orientation. So right-wing people were more likely to vote CPC. Interestingly, this is not a significant variable for the Liberals or NDP (note: I tweeted yesterday that it was significant for the NDP and not the other two. That was indeed the case with a different regression specification, a simpler one)

Finally, here is for the NDP:

Interesting to see the negative and significant impact of being university educated. You'd think the NDP would do better among this demographic but it really seems that Trudeau is too strong among this group. As for being satisfied with the Ford government, the effect was -6 points and -5 points for SNC (the impact of this variable is very consistent across parties).

So, both stories mattered. What would have happened if the Ford government wasn't that unpopular? Looking across the provinces, the percentage of people satisfied was only 27%! Compared to an average of 54% in the other provinces. Had the Ford government been as popular as the other provincial governments, the coefficients above indicate that the results in Ontario could have been (actual results in bracket):

LPC: 40% (41.3)

CPC: 36.5% (33.1)

NDP: 15.2%

Essentially, the Conservatives would have been 3.4 pts higher, taken from the Liberals and NDP. What impact in terms of seats? Hard to say but a quick look with my model would indicate that it could have cost the Liberals between 5 and 10 seats. Not enough to win the GTA or for Scheer to become Premier, but not insignificant either. The impact could be higher if the people not satisfied with Ford were concentrated in the GTA. The CES dataset doesn't allow me to look into that. But it does partially explain why the Tories increased everywhere in Canada except in Quebec (unique situation) and Ontario where they literally did worse, in percentages, than in 2015.

On the other hand, only 15% thought that Trudeau handled SNC-Lavalin well. That likely cost him as much as 5 points! It is quite massive and likely cost him a majority.

Both those estimates are very, very rough and should be taken with a large grain of salt. But I believe they showed that both had a significant impact on the vote in Ontario. Does it mean that Trudeau can now get a majority if people have forgotten SNC-Lavalin? Or that O'Toole can expect gains in Ontario now that Ford is, surprisingly, popular? I wouldn't say so, this is one causal link too far for such a basic analysis.