Skip to main content

Text Analytics on U.S. Presidential Inaugural Speeches

Project Overview

In this project, I performed text analytics and natural language processing (NLP) on three historic U.S. Presidential inaugural speeches to understand their linguistic structure, vocabulary usage, and dominant themes.

Speeches Analyzed

  • Franklin D. Roosevelt – 1941

  • John F. Kennedy – 1961

  • Richard Nixon – 1973

The goal was not political analysis, but language analysis using Python and NLP libraries.

Git Link


Problem Definition

The objectives of this analysis were:

  1. Compute text statistics for each speech:

    • Number of characters

    • Number of words

    • Number of sentences

    • Average word length

  2. Perform text preprocessing:

    • Lowercasing

    • Removing punctuation, numbers, and special characters

    • Stopword removal

    • Stemming

  3. Identify the most frequently used words across all three speeches

  4. Visualize dominant themes using a Word Cloud

Data Source

The speeches were sourced from the NLTK Inaugural Corpus, which contains official U.S. presidential inaugural addresses dating back to 1789.

from nltk.corpus import inaugural

An additional Excel file was used to organize the speeches into a structured tabular format.

Exploratory Data Analysis (EDA)

Dataset Structure

ColumnDescription
NamePresident name
SpeechFull speech text
  • Rows: 3

  • Columns: 2

  • Data Type: Text (object)

No missing values or duplicates were found.


Text Statistics – Key Findings

1️⃣ Character Count

  • Nixon’s speech was the longest, exceeding 10,000 characters

  • Roosevelt and Kennedy speeches were similar in length (~7,600 characters)

2️⃣ Word Count

  • Nixon: 1,769 words

  • Kennedy: 1,364 words

  • Roosevelt: 1,323 words

➡️ Indicates a shift toward longer, more detailed addresses over time.

3️⃣ Average Word Length

All speeches had similar average word lengths:

  • Roosevelt: 4.78

  • Kennedy: 4.62

  • Nixon: 4.71

➡️ Suggests consistent linguistic complexity across decades.

4️⃣ Sentence Count

Sentence tokenization revealed:

  • An average of 60–70 sentences per speech

  • Balanced sentence structures with rhetorical emphasis


Text Preprocessing Pipeline

To prepare the text for analysis, the following steps were applied:

✔ Lowercasing

Ensures uniformity (Americaamerica)

✔ Special Character & Number Removal

Removed:

  • Punctuation

  • Line breaks

  • Digits

  • Symbols

✔ Stopword Removal

  • Removed common English stopwords (e.g., the, is, and)

  • Extended stopword list to remove context-specific words like “mr”

✔ Stemming

Applied Porter Stemmer to reduce words to root form:

  • running → run

  • freedom → freedom

This improved frequency analysis consistency.

Frequency Analysis – Most Common Words

After preprocessing, the most frequently occurring words across all speeches were:

RankWord
1new
2world
3america
4peace
5nation
6freedom
7people

Interpretation

  • “America” & “nation” reflect national identity focus

  • “Peace” & “freedom” dominate Cold War–era rhetoric

  • “New” & “world” indicate optimism and global outlook

Rare words were intentionally retained to preserve contextual meaning.


☁️ Word Cloud Visualization

A Word Cloud was generated to visually represent dominant themes across all speeches.

Insights from Word Cloud

  • Large prominence of peace, freedom, democracy, nation

  • Strong emphasis on global responsibility and unity

  • Consistent ideological messaging across different administrations

➡️ Visual analysis complements numerical frequency counts and improves interpretability.


🧠 Key Learnings & Insights

  1. Presidential speeches maintain consistent linguistic complexity

  2. Themes of freedom, peace, and national responsibility dominate across eras

  3. Text preprocessing dramatically improves signal clarity

  4. Word clouds are effective for quick thematic exploration

  5. NLP techniques can extract meaningful insights from unstructured text


🛠️ Skills & Tools Demonstrated

Technical Skills

  • Natural Language Processing (NLP)

  • Text preprocessing & cleaning

  • Tokenization & stemming

  • Frequency analysis

  • Data visualization

Tools & Libraries

  • Python

  • Pandas

  • NLTK

  • Matplotlib

  • WordCloud


Final Recommendation

This project can be extended further by:

  • Sentiment analysis across speeches

  • TF-IDF based keyword extraction

  • Topic modeling (LDA)

  • Speech comparison by political era







Comments

Popular posts from this blog

Raghvendra Singh Portfolio

  I’m Raghvendra Singh Business Analytics & Data Science Professional I help businesses make data-driven decisions using analytics, dashboards and data science techniques across Ecommerce, Retail, Finance and Marketing . I specialize in converting raw data into clear insights, measurable impact and actionable recommendations for business leaders and teams. Profile Links Github LinkedIn Portfolio  Below are selected projects showcasing my work in analytics, data science and business problem-solving . 1. Digital Marketing Ads Clustering for Ads24x7 2. Inferential statistics: Probability to ANOVA 3. Power BI Sales & Invetory forecasting using SARIMA, SQL, Python 4. Power BI/ Looker/ Tableu- Neerus Dashboards - Myntra payments dashboard 5. Text Analytics using NLP on political speeches analysis 6.  Election Data Classification: End to end analysis 7.  📬 Let’s Connect 📧 Email: raghavsingh0027 @gmail.com 🔗 LinkedIn: https://www.linkedin.com/in/raghvendra0...

A Comprehensive Guide to Statistical Techniques and Analysis for Data Science

  In the field of data science, statistical analysis plays a critical role in making sense of large datasets, uncovering patterns, and drawing actionable insights. Data wrangling, or the process of cleaning and transforming raw data into a usable format, is equally essential to prepare data for statistical analysis. This blog will provide an overview of key statistical techniques for data analysis, along with practical code snippets to apply them using Python. What is Data Wrangling? Data wrangling involves cleaning, restructuring, and transforming raw data into a format that is easier to analyze. This process may include handling missing data, dealing with inconsistent formatting, or aggregating data. Python libraries such as Pandas and NumPy are commonly used for this purpose. Basic Data Wrangling Techniques Before diving into statistical analysis, it’s important to ensure the data is properly cleaned and prepared. Below are some common data wrangling techniques, along wit...

What tools do you need to start your Data Science journey?

  Welcome back to AI Councel Lab ! If you're reading this, you're probably eager to start your journey into the world of Data Science . It's an exciting field, but the vast array of tools and technologies can sometimes feel overwhelming. Don't worry, I’ve got you covered! In this blog, we’ll explore the essential tools you’ll need to begin your Data Science adventure. 1. Programming Languages: Python and R The first step in your Data Science journey is learning how to code. Python is widely regarded as the most popular language in Data Science due to its simplicity and vast libraries. Libraries like NumPy , Pandas , Matplotlib , and SciPy make Python the go-to tool for data manipulation, analysis, and visualization. R is another great language, especially for statistical analysis and visualization. It's commonly used by statisticians and data scientists who need to work with complex data and models. Recommendation: Start with Python , as it has broader appli...