Project Overview
In this project, I performed text analytics and natural language processing (NLP) on three historic U.S. Presidential inaugural speeches to understand their linguistic structure, vocabulary usage, and dominant themes.
Speeches Analyzed
Franklin D. Roosevelt – 1941
John F. Kennedy – 1961
Richard Nixon – 1973
The goal was not political analysis, but language analysis using Python and NLP libraries.
Problem Definition
The objectives of this analysis were:
Compute text statistics for each speech:
Number of characters
Number of words
Number of sentences
Average word length
Perform text preprocessing:
Lowercasing
Removing punctuation, numbers, and special characters
Stopword removal
Stemming
Identify the most frequently used words across all three speeches
Visualize dominant themes using a Word Cloud
Data Source
The speeches were sourced from the NLTK Inaugural Corpus, which contains official U.S. presidential inaugural addresses dating back to 1789.
from nltk.corpus import inaugural
An additional Excel file was used to organize the speeches into a structured tabular format.
Exploratory Data Analysis (EDA)
Dataset Structure
| Column | Description |
|---|---|
| Name | President name |
| Speech | Full speech text |
Rows: 3
Columns: 2
Data Type: Text (object)
No missing values or duplicates were found.
Text Statistics – Key Findings
1️⃣ Character Count
Nixon’s speech was the longest, exceeding 10,000 characters
Roosevelt and Kennedy speeches were similar in length (~7,600 characters)
2️⃣ Word Count
Nixon: 1,769 words
Kennedy: 1,364 words
Roosevelt: 1,323 words
➡️ Indicates a shift toward longer, more detailed addresses over time.
3️⃣ Average Word Length
All speeches had similar average word lengths:
Roosevelt: 4.78
Kennedy: 4.62
Nixon: 4.71
➡️ Suggests consistent linguistic complexity across decades.
4️⃣ Sentence Count
Sentence tokenization revealed:
An average of 60–70 sentences per speech
Balanced sentence structures with rhetorical emphasis
Text Preprocessing Pipeline
To prepare the text for analysis, the following steps were applied:
✔ Lowercasing
Ensures uniformity (America → america)
✔ Special Character & Number Removal
Removed:
Punctuation
Line breaks
Digits
Symbols
✔ Stopword Removal
Removed common English stopwords (e.g., the, is, and)
Extended stopword list to remove context-specific words like “mr”
✔ Stemming
Applied Porter Stemmer to reduce words to root form:
running → run
freedom → freedom
This improved frequency analysis consistency.
Frequency Analysis – Most Common Words
After preprocessing, the most frequently occurring words across all speeches were:
| Rank | Word |
|---|---|
| 1 | new |
| 2 | world |
| 3 | america |
| 4 | peace |
| 5 | nation |
| 6 | freedom |
| 7 | people |
Interpretation
“America” & “nation” reflect national identity focus
“Peace” & “freedom” dominate Cold War–era rhetoric
“New” & “world” indicate optimism and global outlook
Rare words were intentionally retained to preserve contextual meaning.
☁️ Word Cloud Visualization
A Word Cloud was generated to visually represent dominant themes across all speeches.
Insights from Word Cloud
Large prominence of peace, freedom, democracy, nation
Strong emphasis on global responsibility and unity
Consistent ideological messaging across different administrations
➡️ Visual analysis complements numerical frequency counts and improves interpretability.
🧠Key Learnings & Insights
Presidential speeches maintain consistent linguistic complexity
Themes of freedom, peace, and national responsibility dominate across eras
Text preprocessing dramatically improves signal clarity
Word clouds are effective for quick thematic exploration
NLP techniques can extract meaningful insights from unstructured text
🛠️ Skills & Tools Demonstrated
Technical Skills
Natural Language Processing (NLP)
Text preprocessing & cleaning
Tokenization & stemming
Frequency analysis
Data visualization
Tools & Libraries
Python
Pandas
NLTK
Matplotlib
WordCloud
Final Recommendation
This project can be extended further by:
Sentiment analysis across speeches
TF-IDF based keyword extraction
Topic modeling (LDA)
Speech comparison by political era
Comments
Post a Comment