Problem Definition
The objective of this project is to predict voter preference (Labour vs Conservative) using demographic, economic perception, political leadership ratings, and political awareness variables.
This is a binary classification problem, where the target variable is:
vote_Labour (1 = Labour, 0 = Conservative)
The analysis aims to:
Understand data structure and distributions
Identify relationships between predictors and voting behavior
Build and compare multiple classification models
Select the best model based on performance metric
Dataset Overview
Rows: 1,525 voters
Columns: 9 features + 1 target
Data Types:
Numerical: Age, economic conditions, leader ratings, political knowledge
Categorical: Vote, Gender
Missing Values: None
Duplicates: 8 (not materially impactful)
Target Variable Distribution
Labour voters: ~70%
Conservative voters: ~30%
➡️ Dataset is moderately imbalanced, which makes recall and AUC important evaluation metrics in addition to accuracy.
Univariate Analysis – Key Observations
Age
Minimum age: 24
Average age: ~54
Maximum age: 93
Distribution slightly right-skewed, indicating more middle-aged and senior voters.
Economic Conditions
Most voters rate national and household economic conditions between 3–4.
Suggests generally neutral to positive economic sentiment.
Leadership Ratings
Blair ratings skew higher than Hague, indicating stronger preference for Blair.
Leadership perception shows potential influence on voting behavior.
Political Knowledge
Majority fall in medium knowledge categories (1–2).
Very few voters have extremely high political awareness.
Gender
Slightly more females (53%) than males (47).
Gender alone does not strongly separate voting behavior.
Multivariate & Bivariate Analysis
Correlation Analysis
Moderate correlations observed between:
Economic household & national conditions
Leadership ratings and vote choice
No strong multicollinearity detected.
Vote-Wise Distribution Insights
Labour voters generally:
Rate Blair higher
Show slightly better economic sentiment
Have marginally higher political knowledge
Visual Techniques Used
Violin plots (vote vs numeric variables)
Pair plots for interaction patterns
Heatmaps for correlation
Strip plots to capture density and overlap
➡️ These patterns indicate that economic perception and leadership ratings are key predictors.
Data Preprocessing
Dropped irrelevant ID column
Converted categorical variables using one-hot encoding
Applied Min-Max scaling to numerical variables
Train-test split: 70% training / 30% testing
Model Building & Evaluation
Evaluation Metrics Chosen
Accuracy: Overall correctness
Recall: Important due to class imbalance
F1-Score: Balance between precision and recall
ROC-AUC: Measures discrimination capability
Model Performance Summary (Test Data)
| Model | Accuracy | AUC |
|---|---|---|
| Naive Bayes | 0.83 | 0.885 |
| Logistic Regression | 0.82 | 0.883 |
| KNN | 0.82 | 0.864 |
| AdaBoost | 0.82 | 0.879 |
| Bagging | 0.80 | 0.878 |
| Random Forest | 0.82 | 0.888 |
| Gradient Boosting | 0.83 | 0.904 |
| Decision Tree | 0.76 | 0.732 |
🏆 Best Model: Gradient Boosting
Why Gradient Boosting?
Highest ROC-AUC (0.904) on test data
Strong balance between bias and variance
Good generalization (no overfitting)
Consistent recall for Labour voters
Interpretation
Gradient Boosting effectively captures non-linear interactions between:
Economic sentiment
Leadership perception
Political awareness
Model Improvement (Bagging & Boosting)
Hyperparameter tuning applied to Bagging:
Tree depth
Minimum samples per leaf
Feature sampling
Result:
Slight improvement in recall
No significant test accuracy gain
Boosting models showed better learning efficiency than Bagging.
➡️ Boosting proved more suitable for this dataset.
🧠 Key Business & Analytical Insights
Leadership perception (Blair vs Hague) strongly influences vote choice
Economic outlook at household level matters more than national perception
Political knowledge improves prediction confidence
Ensemble models outperform single classifiers
Accuracy alone is insufficient—AUC and recall are critical
🛠️ Tools & Skills Demonstrated
Languages & Libraries
Python (Pandas, NumPy, Scikit-learn)
Matplotlib, Seaborn
Techniques
EDA & visualization
Feature scaling & encoding
Classification modeling
ROC-AUC analysis
Ensemble learning (Bagging, Boosting)
Final Recommendation
For real-world voter prediction systems:
Use Gradient Boosting for deployment
Monitor AUC and recall, not just accuracy
Regularly retrain as political sentiment shifts
Comments
Post a Comment