Skip to main content

Election Data Classification Project – End-to-End Analysis


Problem Definition

The objective of this project is to predict voter preference (Labour vs Conservative) using demographic, economic perception, political leadership ratings, and political awareness variables.

This is a binary classification problem, where the target variable is:

  • vote_Labour (1 = Labour, 0 = Conservative)

The analysis aims to:

  • Understand data structure and distributions

  • Identify relationships between predictors and voting behavior

  • Build and compare multiple classification models

  • Select the best model based on performance metric

Dataset Overview

  • Rows: 1,525 voters

  • Columns: 9 features + 1 target

  • Data Types:

    • Numerical: Age, economic conditions, leader ratings, political knowledge

    • Categorical: Vote, Gender

  • Missing Values: None

  • Duplicates: 8 (not materially impactful)

Target Variable Distribution

  • Labour voters: ~70%

  • Conservative voters: ~30%

➡️ Dataset is moderately imbalanced, which makes recall and AUC important evaluation metrics in addition to accuracy.


Univariate Analysis – Key Observations

Age

  • Minimum age: 24

  • Average age: ~54

  • Maximum age: 93

  • Distribution slightly right-skewed, indicating more middle-aged and senior voters.

Economic Conditions

  • Most voters rate national and household economic conditions between 3–4.

  • Suggests generally neutral to positive economic sentiment.

Leadership Ratings

  • Blair ratings skew higher than Hague, indicating stronger preference for Blair.

  • Leadership perception shows potential influence on voting behavior.

Political Knowledge

  • Majority fall in medium knowledge categories (1–2).

  • Very few voters have extremely high political awareness.

Gender

  • Slightly more females (53%) than males (47).

  • Gender alone does not strongly separate voting behavior.


Multivariate & Bivariate Analysis

Correlation Analysis

  • Moderate correlations observed between:

    • Economic household & national conditions

    • Leadership ratings and vote choice

  • No strong multicollinearity detected.

Vote-Wise Distribution Insights

  • Labour voters generally:

    • Rate Blair higher

    • Show slightly better economic sentiment

    • Have marginally higher political knowledge

Visual Techniques Used

  • Violin plots (vote vs numeric variables)

  • Pair plots for interaction patterns

  • Heatmaps for correlation

  • Strip plots to capture density and overlap

➡️ These patterns indicate that economic perception and leadership ratings are key predictors.


Data Preprocessing

  • Dropped irrelevant ID column

  • Converted categorical variables using one-hot encoding

  • Applied Min-Max scaling to numerical variables

  • Train-test split: 70% training / 30% testing


Model Building & Evaluation

Evaluation Metrics Chosen

  • Accuracy: Overall correctness

  • Recall: Important due to class imbalance

  • F1-Score: Balance between precision and recall

  • ROC-AUC: Measures discrimination capability

Model Performance Summary (Test Data)

ModelAccuracyAUC
Naive Bayes0.830.885
Logistic Regression0.820.883
KNN0.820.864
AdaBoost0.820.879
Bagging0.800.878
Random Forest0.820.888
Gradient Boosting0.830.904
Decision Tree0.760.732

🏆 Best Model: Gradient Boosting

Why Gradient Boosting?

  • Highest ROC-AUC (0.904) on test data

  • Strong balance between bias and variance

  • Good generalization (no overfitting)

  • Consistent recall for Labour voters

Interpretation

Gradient Boosting effectively captures non-linear interactions between:

  • Economic sentiment

  • Leadership perception

  • Political awareness


Model Improvement (Bagging & Boosting)

  • Hyperparameter tuning applied to Bagging:

    • Tree depth

    • Minimum samples per leaf

    • Feature sampling

  • Result:

    • Slight improvement in recall

    • No significant test accuracy gain

  • Boosting models showed better learning efficiency than Bagging.

➡️ Boosting proved more suitable for this dataset.


🧠 Key Business & Analytical Insights

  1. Leadership perception (Blair vs Hague) strongly influences vote choice

  2. Economic outlook at household level matters more than national perception

  3. Political knowledge improves prediction confidence

  4. Ensemble models outperform single classifiers

  5. Accuracy alone is insufficient—AUC and recall are critical


🛠️ Tools & Skills Demonstrated

Languages & Libraries

  • Python (Pandas, NumPy, Scikit-learn)

  • Matplotlib, Seaborn

Techniques

  • EDA & visualization

  • Feature scaling & encoding

  • Classification modeling

  • ROC-AUC analysis

  • Ensemble learning (Bagging, Boosting)

Final Recommendation

For real-world voter prediction systems:

  • Use Gradient Boosting for deployment

  • Monitor AUC and recall, not just accuracy

  • Regularly retrain as political sentiment shifts



















Comments

Popular posts from this blog

Data Analysis and Visualization with Matplotlib and Seaborn | TOP 10 code snippets for practice

Data visualization is an essential aspect of data analysis. It enables us to better understand the underlying patterns, trends, and insights within a dataset. Two of the most popular Python libraries for data visualization are Matplotlib and Seaborn . Both libraries are highly powerful, and they can be used to create a wide variety of plots to help researchers, analysts, and data scientists present data visually. In this article, we will discuss the basics of both libraries, followed by the top 10 most used code snippets for visualization. We'll also provide links to free resources and documentation to help you dive deeper into these libraries. Matplotlib and Seaborn: A Quick Overview Matplotlib Matplotlib is a low-level plotting library in Python. It allows you to create static, animated, and interactive plots. It provides a lot of flexibility but may require more code to create complex plots compared to Seaborn. Matplotlib is especially useful when you need full control ove...

Guide to Performing ETL (Extract, Transform, Load) Using SQL in Oracle and Other Databases

  In the world of data engineering, ETL (Extract, Transform, Load) is a key process that allows you to efficiently extract data from various sources, transform it into a suitable format for analysis, and then load it into a target database or data warehouse. This blog will guide you through the ETL process using SQL, with code examples applicable to Oracle and other relational databases such as MySQL, PostgreSQL, and SQL Server. What is ETL? ETL stands for Extract, Transform, Load , which refers to the three key steps involved in moving data from one system to another, typically from source databases to a data warehouse. Here’s a breakdown: Extract : This step involves retrieving data from source systems such as relational databases, flat files, APIs, or cloud services. Transform : The extracted data often needs to be cleaned, formatted, aggregated, or enriched to meet the specific needs of the destination system or analytics process. Load : Finally, the transformed data is l...

Stochastic Gradient Descent: A Cornerstone of Machine Learning and Data Science

In the world of machine learning and data science, optimizing models to make accurate predictions is crucial. One of the most important optimization algorithms used to train models is Stochastic Gradient Descent (SGD) . But what exactly is SGD, and why is it so widely used in machine learning tasks? Let’s dive into this powerful technique and explore its role in building more efficient and accurate models. What is Stochastic Gradient Descent (SGD)? At its core, Stochastic Gradient Descent is an optimization algorithm used to minimize a function, most commonly a loss function in machine learning models. The goal is to adjust the parameters of the model (like weights in a neural network) in order to reduce the error between the model's predictions and the actual outcomes (i.e., the ground truth). The "gradient" in SGD refers to the derivative of the loss function with respect to the parameters. It tells us the direction and rate of change needed to move towards the min...