We already knew that the liberal party won the federal election in 2019. I am interested in predicting which party would win the election if all ages citizens in Canada had voted for the 2019 Canadian Federal Election. Models will be built based on the online survey results of Canadian Election Study in 2019. Next, apply the model to the 2016 census data to find which party will have the highest votes. Finally, comparing the predicted result with the actual election result of 2019 to find if there will exist any difference.
In the model, "age","province", "education" and "gender" will be considered as independent variables and "vote choice" will be the only response variable. Additionally, eight vote choices are included in the outcome variable and their distributions are shown in Figure 1. It reveals that the majority chose "Liberal Party"; "Conservative Party" was the second popular choice; "People's Party" had the lowest vote. Next, using the mentioned variables to build a logistic regression, and then implement this model to the 2016 census data to predict if all ages citizens had voted then what would happen to the election result.
The survey data consists of a lot of multi-choice questions, and we would like to transfer them to multi-binary model by using one v.s rest method. To explain, if we have ๐ levels in outcome, then ๐ outcome will be obtained. For each vote choice, it would be a binary question. In this project, the first vote choice would be choosing "Liberal Party" or not, and other vote choice works in the same way.
For binary questions, logistic regression model will be trained by using the equation below. In addition, categorical variables with ๐ levels will contain ๐ − 1 dummy variables. Next, each choice will train one logistic regression model, and then implement the model to census data to obtain the predicted probability of choosing the corresponding party.
After that, one combined features group will have six predicted probability on different party. Then, we use softmax method to convert the probability into a range of [0~1]. The formula of softmax method is shown below.
One of the model results is shown below. It reflects that all features are significant when significance level is set as 0.1, but if we set the significance level to 0.05, then variable "education" would not be significant any more. Moreover, for the coefficient of each variable, let's take "gender" as the example, there are three levels in variable “gender”(Male, Female, Others). Specifically, female level is treated as the reference; the other two will compare with it. In addition, the coefficient of "genderMale" is -0.34731, it means that the odds ratio is exp(−0.34731) = 0.7066.
As we can see from the above plot, "Liberal Party" has the biggest predicted probability, which is approximately 0.26. "Conservative Party" is the second highest. Thus, we suggest that Liberal Party would win the election if all ages citizens in Canada had voted for the 2019 Canadian Federal Election. Our predicted result is exactly same as the actual election result of 2019 Canadian Federal Election.