Skip to main content

Text Analytics on U.S. Presidential Inaugural Speeches

Project Overview

In this project, I performed text analytics and natural language processing (NLP) on three historic U.S. Presidential inaugural speeches to understand their linguistic structure, vocabulary usage, and dominant themes.

Speeches Analyzed

  • Franklin D. Roosevelt – 1941

  • John F. Kennedy – 1961

  • Richard Nixon – 1973

The goal was not political analysis, but language analysis using Python and NLP libraries.

Git Link


Problem Definition

The objectives of this analysis were:

  1. Compute text statistics for each speech:

    • Number of characters

    • Number of words

    • Number of sentences

    • Average word length

  2. Perform text preprocessing:

    • Lowercasing

    • Removing punctuation, numbers, and special characters

    • Stopword removal

    • Stemming

  3. Identify the most frequently used words across all three speeches

  4. Visualize dominant themes using a Word Cloud

Data Source

The speeches were sourced from the NLTK Inaugural Corpus, which contains official U.S. presidential inaugural addresses dating back to 1789.

from nltk.corpus import inaugural

An additional Excel file was used to organize the speeches into a structured tabular format.

Exploratory Data Analysis (EDA)

Dataset Structure

ColumnDescription
NamePresident name
SpeechFull speech text
  • Rows: 3

  • Columns: 2

  • Data Type: Text (object)

No missing values or duplicates were found.


Text Statistics – Key Findings

1️⃣ Character Count

  • Nixon’s speech was the longest, exceeding 10,000 characters

  • Roosevelt and Kennedy speeches were similar in length (~7,600 characters)

2️⃣ Word Count

  • Nixon: 1,769 words

  • Kennedy: 1,364 words

  • Roosevelt: 1,323 words

➡️ Indicates a shift toward longer, more detailed addresses over time.

3️⃣ Average Word Length

All speeches had similar average word lengths:

  • Roosevelt: 4.78

  • Kennedy: 4.62

  • Nixon: 4.71

➡️ Suggests consistent linguistic complexity across decades.

4️⃣ Sentence Count

Sentence tokenization revealed:

  • An average of 60–70 sentences per speech

  • Balanced sentence structures with rhetorical emphasis


Text Preprocessing Pipeline

To prepare the text for analysis, the following steps were applied:

✔ Lowercasing

Ensures uniformity (Americaamerica)

✔ Special Character & Number Removal

Removed:

  • Punctuation

  • Line breaks

  • Digits

  • Symbols

✔ Stopword Removal

  • Removed common English stopwords (e.g., the, is, and)

  • Extended stopword list to remove context-specific words like “mr”

✔ Stemming

Applied Porter Stemmer to reduce words to root form:

  • running → run

  • freedom → freedom

This improved frequency analysis consistency.

Frequency Analysis – Most Common Words

After preprocessing, the most frequently occurring words across all speeches were:

RankWord
1new
2world
3america
4peace
5nation
6freedom
7people

Interpretation

  • “America” & “nation” reflect national identity focus

  • “Peace” & “freedom” dominate Cold War–era rhetoric

  • “New” & “world” indicate optimism and global outlook

Rare words were intentionally retained to preserve contextual meaning.


☁️ Word Cloud Visualization

A Word Cloud was generated to visually represent dominant themes across all speeches.

Insights from Word Cloud

  • Large prominence of peace, freedom, democracy, nation

  • Strong emphasis on global responsibility and unity

  • Consistent ideological messaging across different administrations

➡️ Visual analysis complements numerical frequency counts and improves interpretability.


🧠 Key Learnings & Insights

  1. Presidential speeches maintain consistent linguistic complexity

  2. Themes of freedom, peace, and national responsibility dominate across eras

  3. Text preprocessing dramatically improves signal clarity

  4. Word clouds are effective for quick thematic exploration

  5. NLP techniques can extract meaningful insights from unstructured text


🛠️ Skills & Tools Demonstrated

Technical Skills

  • Natural Language Processing (NLP)

  • Text preprocessing & cleaning

  • Tokenization & stemming

  • Frequency analysis

  • Data visualization

Tools & Libraries

  • Python

  • Pandas

  • NLTK

  • Matplotlib

  • WordCloud


Final Recommendation

This project can be extended further by:

  • Sentiment analysis across speeches

  • TF-IDF based keyword extraction

  • Topic modeling (LDA)

  • Speech comparison by political era







Comments

Popular posts from this blog

Data Analysis and Visualization with Matplotlib and Seaborn | TOP 10 code snippets for practice

Data visualization is an essential aspect of data analysis. It enables us to better understand the underlying patterns, trends, and insights within a dataset. Two of the most popular Python libraries for data visualization are Matplotlib and Seaborn . Both libraries are highly powerful, and they can be used to create a wide variety of plots to help researchers, analysts, and data scientists present data visually. In this article, we will discuss the basics of both libraries, followed by the top 10 most used code snippets for visualization. We'll also provide links to free resources and documentation to help you dive deeper into these libraries. Matplotlib and Seaborn: A Quick Overview Matplotlib Matplotlib is a low-level plotting library in Python. It allows you to create static, animated, and interactive plots. It provides a lot of flexibility but may require more code to create complex plots compared to Seaborn. Matplotlib is especially useful when you need full control ove...

Guide to Performing ETL (Extract, Transform, Load) Using SQL in Oracle and Other Databases

  In the world of data engineering, ETL (Extract, Transform, Load) is a key process that allows you to efficiently extract data from various sources, transform it into a suitable format for analysis, and then load it into a target database or data warehouse. This blog will guide you through the ETL process using SQL, with code examples applicable to Oracle and other relational databases such as MySQL, PostgreSQL, and SQL Server. What is ETL? ETL stands for Extract, Transform, Load , which refers to the three key steps involved in moving data from one system to another, typically from source databases to a data warehouse. Here’s a breakdown: Extract : This step involves retrieving data from source systems such as relational databases, flat files, APIs, or cloud services. Transform : The extracted data often needs to be cleaned, formatted, aggregated, or enriched to meet the specific needs of the destination system or analytics process. Load : Finally, the transformed data is l...

Stochastic Gradient Descent: A Cornerstone of Machine Learning and Data Science

In the world of machine learning and data science, optimizing models to make accurate predictions is crucial. One of the most important optimization algorithms used to train models is Stochastic Gradient Descent (SGD) . But what exactly is SGD, and why is it so widely used in machine learning tasks? Let’s dive into this powerful technique and explore its role in building more efficient and accurate models. What is Stochastic Gradient Descent (SGD)? At its core, Stochastic Gradient Descent is an optimization algorithm used to minimize a function, most commonly a loss function in machine learning models. The goal is to adjust the parameters of the model (like weights in a neural network) in order to reduce the error between the model's predictions and the actual outcomes (i.e., the ground truth). The "gradient" in SGD refers to the derivative of the loss function with respect to the parameters. It tells us the direction and rate of change needed to move towards the min...