Text Analytics on U.S. Presidential Inaugural Speeches

Project Overview

In this project, I performed text analytics and natural language processing (NLP) on three historic U.S. Presidential inaugural speeches to understand their linguistic structure, vocabulary usage, and dominant themes.

Speeches Analyzed

Franklin D. Roosevelt – 1941
John F. Kennedy – 1961
Richard Nixon – 1973

The goal was not political analysis, but language analysis using Python and NLP libraries.

Git Link

Problem Definition

The objectives of this analysis were:

Compute text statistics for each speech:
- Number of characters
- Number of words
- Number of sentences
- Average word length
Perform text preprocessing:
- Lowercasing
- Removing punctuation, numbers, and special characters
- Stopword removal
- Stemming
Identify the most frequently used words across all three speeches
Visualize dominant themes using a Word Cloud

Data Source

The speeches were sourced from the NLTK Inaugural Corpus, which contains official U.S. presidential inaugural addresses dating back to 1789.

from nltk.corpus import inaugural

An additional Excel file was used to organize the speeches into a structured tabular format.

Exploratory Data Analysis (EDA)

Dataset Structure

Column	Description
Name	President name
Speech	Full speech text

Rows: 3
Columns: 2
Data Type: Text (object)

No missing values or duplicates were found.

Text Statistics – Key Findings

1️⃣ Character Count

Nixon’s speech was the longest, exceeding 10,000 characters
Roosevelt and Kennedy speeches were similar in length (~7,600 characters)

2️⃣ Word Count

Nixon: 1,769 words
Kennedy: 1,364 words
Roosevelt: 1,323 words

➡️ Indicates a shift toward longer, more detailed addresses over time.

3️⃣ Average Word Length

All speeches had similar average word lengths:

Roosevelt: 4.78
Kennedy: 4.62
Nixon: 4.71

➡️ Suggests consistent linguistic complexity across decades.

4️⃣ Sentence Count

Sentence tokenization revealed:

An average of 60–70 sentences per speech
Balanced sentence structures with rhetorical emphasis

Text Preprocessing Pipeline

To prepare the text for analysis, the following steps were applied:

✔ Lowercasing

Ensures uniformity (America → america)

✔ Special Character & Number Removal

Removed:

Punctuation
Line breaks
Digits
Symbols

✔ Stopword Removal

Removed common English stopwords (e.g., the, is, and)
Extended stopword list to remove context-specific words like “mr”

✔ Stemming

Applied Porter Stemmer to reduce words to root form:

running → run
freedom → freedom

This improved frequency analysis consistency.

Frequency Analysis – Most Common Words

After preprocessing, the most frequently occurring words across all speeches were:

Rank	Word
1	new
2	world
3	america
4	peace
5	nation
6	freedom
7	people

Interpretation

“America” & “nation” reflect national identity focus
“Peace” & “freedom” dominate Cold War–era rhetoric
“New” & “world” indicate optimism and global outlook

Rare words were intentionally retained to preserve contextual meaning.

☁️ Word Cloud Visualization

A Word Cloud was generated to visually represent dominant themes across all speeches.

Insights from Word Cloud

Large prominence of peace, freedom, democracy, nation
Strong emphasis on global responsibility and unity
Consistent ideological messaging across different administrations

➡️ Visual analysis complements numerical frequency counts and improves interpretability.

🧠 Key Learnings & Insights

Presidential speeches maintain consistent linguistic complexity
Themes of freedom, peace, and national responsibility dominate across eras
Text preprocessing dramatically improves signal clarity
Word clouds are effective for quick thematic exploration
NLP techniques can extract meaningful insights from unstructured text

🛠️ Skills & Tools Demonstrated

Technical Skills

Natural Language Processing (NLP)
Text preprocessing & cleaning
Tokenization & stemming
Frequency analysis
Data visualization

Tools & Libraries

Python
Pandas
NLTK
Matplotlib
WordCloud

Final Recommendation

This project can be extended further by:

Sentiment analysis across speeches
TF-IDF based keyword extraction
Topic modeling (LDA)
Speech comparison by political era

AI Councel Lab

Search This Blog