Canada 2021 simulator, version 2.0 with demographic variables

I just uploaded the version 2.0 of the 2021 simulator. The big change is the addition of demographic variables. I have also made some modifications for situations such as Sloan running as an independent or JWR not running again. But those are details really.


Why adding demographic variables?

Some of you might actually be surprised that the previous model wasn't using any demographic variables. That's understandable. But you need to realize that adding such variables isn't trivial. Yes it makes sense on paper to want to be able to calibrate the model based on the gender gap or the evolution of the voting intentions of the 18-34 years old, but it's quite a lot more complicated in practice. And most models out there don't use any (308/CBC) or don't disclose much of what they do (338/Lean Tossup).


The main challenge is that we don't have results by demographics, only by ridings. We don't actually know how women voted in 2019, or how the 18-34 voted. Yes we have polls but polls are not perfectly accurate. Also, most polls only include gender and age and nothing else. Very few pollsters include information about education or income. Literally none will include breakdown by home ownership even though it is an important predictor of the vote.


Given that my model (like every single one out there I believe) used the past election results as the base and apply the swing observed in the polls since then, we really need to find a way to get the results broken down by demographics. Thankfully, the Canadian Election study (CES) can help us there. This massive survey actually has a post-election component when they ask people how they voted. It's still not perfect as people can still lie or misremember, but it's an improvement. More importantly, I can get the results by income, education or any other variable I want. I ultimately limited myself to age, gender, education and income as we get this info in some polls. It's useless for me to add variables that are never included in any polls, it won't be useful for projections.


Because even such a survey isn't 100% accurate, I decided to set up my demographic adjustments as differentials. I'm not using the actual levels for (say) the Liberals among (say) men and women, I'm using the differential. For 2019, Canada-wide, the Liberals got 34% of the vote but were at 35% among women and 33% among men, so differentials of +1 and -1. Doing so allows me to use polls and the CES even though they aren't 100% accurate (in the post-election CES, the Liberals were overestimated). Surprisingly, I discovered that polls were pretty good at estimating such differentials. I guess that while polls can make mistakes (over or underestimating a party, probably due to turnout), they are better at estimating the relative levels. That is a very important piece of good news as it means polls can be reliable to use such variables for predictions.


I also use the data from the census where we get detailed breakdowns for each demographics (for instance what percentage of voters in Rosemont are aged 18-34). Notice, however, that the census doesn't give us joint distributions, only marginals ones. It's a potentially serious limitation but there isn't much that I can do about it.


How is it done?

A very quick summary would be:

1) The province-wide percentages that you enter will determine how many votes a party will get

If you believe the Liberals are up in one province by (say) 4 points, then input them at 46% in Ontario.

2) The demographic adjustments will influence who and where those voters are.

If you believe this swing is mostly concentrated among women and educated voters, then increase their differentials.


I decided to take a two-steps approach. The first step is the same as it has always been: we use polls (or we guess) the province-wide percentages for each party. Let's say, for instance, that the Liberals are up 5 points in Ontario since 2019. Once we input this, it has to be that the average swing adds up to 5 points. It doesn't matter if the swing is coming from the Liberals doing better among women or among older people, it has to be 5 points. That's what we told the model to do!


The demographic variables are only used to allocate those 5 points across ridings. If you believe the Liberals are up by 5 mostly thanks to an increase among women voters, then the model will need to increase the Liberals more in ridings with more women. That makes sense right?


The model first applies the province-wide swings (currently uniformly but regional adjustments could come later). It then apply the demographic adjustments, one by one (since I don't have the joint distribution). If the adjustments don't sum to zero, the model takes the necessary steps to insure the final, net average swing is consistent with the provincial one. See the following example to understand the last part.


Let's imagine that you believe that the Liberals are up 4 points among women. That's the only change. In every other demographics, they are stable compared to 2019. If you only enter '+4' in the simulator for LPC-women, the model will increase the Liberals by "share of voters that are women * 4 points" in every riding. In average this is obviously 2 points (ridings don't vary that much in term of the men to women ratio and the average share is obviously almost 50%; Let's assume it's 50% for the sake of illustration). But, and this is important, if you didn't increase the Liberals by 2 points province wide (i.e: overall, they are still at 42%), then the model will apply a uniform swing of minus 2 everywhere. That is logical. The model ahs to do this to keep the province-wide swing consistent with the province-wide numbers you entered. So, overall the Liberals remain at 42% but their vote is now skewing towards women, meaning the riding numbers have changed (some have increased, some have decreased).


What you need to do if you really believe the one change is Liberals +4 among women and everything else is the same is to enter Liberals at +2 province-wide (so Liberals at 44% in Ontario) and +4 among women. This will work and achieve your objectives. [Note: before some smartass on Reddit notices the error and declares me as illiterate with numbers, yes the true correct way would be the following: increase the Liberals by 2 points in the province, then put LPC-women at +2 and LPC-Men at -2. This is consistent with the Liberals being at 42% among men, 46% among women, thus 44 overall. But it's getting complicated and, for all intents and purposes, simply putting women at +4 will do the exact same job.]


The most important numbers remain the province-wide percentages, not the demographic ones. I spent a lot of time thinking about it and it is the best solution in my opinion. Why? Well because the key numbers you need to get right in order to be accurate with your seat projections are the province-wide percentages. In 2019, if you knew that the Liberals would be at 42% in Ontario and the NDP only at 17% (so a much bigger gap than what the polls showed), I can guarantee you that any model would have given you mostly the right results. You wouldn't have needed to look at whether the Liberals were doing better or worse among the 18-34 or if they increased their lead among the university educated. No, get the province-wide percentages right and you are 95% there. This is just a fact. You can spin your model as being super sophisticated and what not but the sad truth is: whoever guesses the province-wide percentages best will have the best seat projections. Yes your model would have been even more accurate if you took into account that the Liberals were gaining more among the 55+ and the university educated, but the improvements are very marginal compared to having the correct province-wide percentages. Not even close.


Keeping that in mind, I wanted to keep having my model mostly be dependent on those province-wide percentages. The demographic variables provide adjustments but nothing else. When you are using the simulator, I'd suggest that you first enter the percentages by provinces (based on polls or what you believe will happen). Then you scroll down and you make adjustments based on demographics (i.e: you think one party is doing better or worse among some demo). Or just play with the demographic adjustments and see if you can make a party's vote more or less efficient (if you are a Conservative looking to make gains in the GTA, I suspect that improving your score with university educated might help).


What if I don't do the math right?

What happens if you enter the Liberals at 38% (drop of 4 points province-wide) but your only change in the demographic adjustments is LPC-women at +4? Well, this is clearly inconsistent but I also don't expect you to do a ton of math before using the simulator. That kind of defeats the purpose. If the LPC is up by 4 among women, this party should be up by 2 province-wide, not down by 4. That's an inconsistency gap of 6 points (+2 vs -4).


The model will first apply -4 in every riding. Then it'll apply +4 * share of women in every riding, resulting in a net average adjustment of +2. The model will thus apply another uniform swing of -2 everywhere. At the end, the objective is 1) to have the Liberals down by 4 overall and 2) to have the LPC electorate now skewing more towards women. As long as both objectives are reached, I'm happy. All of this done in a simulator that can be used without a ph.d in stats.


What does it mean? It means you don't have to do math before using the simulator. The model allows you to be inconsistent and will take care of it. Just remember that the actual, final provincial swing (and therefore average net swing) is given by the province-wide numbers you enter. As it should be.


At the end of the day, enter the province-wide numbers and then play with the demographic adjustments to see if gaining among the 18-34 would (for instance) make the NDP vote more or less efficient. And don't worry if your demographic adjustments don't sum to zero or don't average out to the provincial swing, it's all fine. If you enter numbers based on a polling average, those inconsistency should not be present anyway (in theory). Either way, the model will take care of this.


What if I leave all the adjustments at zero?

In this case, the model is assuming the composition of the electorate for each party won't have changed since 2019. The CPC, for instance, was at 33% in Ontario but at 37% among men and 29% among women (so +4 and -4 differentials). If you enter the CPC at 35% province-wide and don't change anything else, then the CPC is still at +4 among men, so 39%. The demographic adjustments are really just that, adjustments. And they are relative. In this case, the model increases the CPC by 4 points everywhere with no special adjustment for ridings with more women for instance. remember, first step decides how many votes a party receives, the demographic adjustments will change who those voters are. And everything is done with 2019 as the base. So if you leave the adjustments at zero, you are assuming the CPC voters will continue to skew older, more male with fewer university degree (but higher income!) while the NDP vote will keep skewing young and female. Leaving the numbers at zero doesn't mean you believe there is no gender (or education, etc) gap, just that the gap for this party is the same as it was in 2019.


What else?

The model currently uses a simple uniform swing model. I'll likely add regional effects soon but I was more focused on the demographic variables. Adding regional coefficients doesn't make a big difference anyway.


I'll also refine the demographic estimates from 2019 over the next few weeks. I'd like, in particular, to have different ones for Quebec and the ROC. But polls rarely provide such breakdowns.