Skip to main content

Using NLP for Text Analytics with HTML Links, Stop Words, and Sentiment Analysis in Python

 

In the world of data science, text analytics plays a crucial role in deriving insights from large volumes of unstructured text data. Whether you're analyzing customer feedback, social media posts, or web articles, natural language processing (NLP) can help you extract meaningful information. One interesting challenge in text analysis involves handling HTML content, extracting meaningful text, and performing sentiment analysis based on predefined positive and negative word lists. In this blog post, we will dive into how to use Python and NLP techniques to analyze text data from HTML links, filter out stop words, and calculate various metrics such as positive/negative ratings, article length, and average sentence length.

Prerequisites

To follow along with the examples in this article, you need to have the following Python packages installed:

  • requests (to fetch HTML content)
  • beautifulsoup4 (for parsing HTML)
  • nltk (for natural language processing tasks)
  • re (for regular expressions)

You can install these dependencies using pip:

pip install requests beautifulsoup4 nltk

1. Fetching and Parsing HTML Content

The first step is to extract raw text from a webpage. We will use the requests library to fetch HTML content from the web, and BeautifulSoup from the beautifulsoup4 library to parse and extract text.

import requests
from bs4 import BeautifulSoup

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

url = "https://example.com/article"
html_content = fetch_html_content(url)
text = extract_text_from_html(html_content)
print(text[:500])  # Print the first 500 characters of the extracted text

In this code:

  • The fetch_html_content() function retrieves the HTML from the provided URL.
  • extract_text_from_html() uses BeautifulSoup to parse the HTML and extract text from all <p> tags (typically where article text is located).

2. Removing Stop Words

In text analytics, stop words are commonly used words (such as "the", "is", "and", etc.) that don't provide significant value in understanding the text. We'll use the nltk library to remove stop words from the text.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

filtered_text = remove_stop_words(text)
print(filtered_text[:500])  # Print the first 500 characters of the filtered text

In this function:

  • We use nltk.word_tokenize() to break the text into words.
  • We filter out stop words by checking if the word is present in the NLTK stop words list.

3. Sentiment Analysis with Positive and Negative Word Lists

Sentiment analysis involves determining the emotional tone behind a piece of text. A common approach is to use predefined lists of positive and negative words. We'll use these lists to calculate the positive and negative ratings for the text.

positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
print(f"Positive Rating: {positive_rating:.2f}")
print(f"Negative Rating: {negative_rating:.2f}")

4. Calculating Article Length and Average Sentence Length

Analyzing the article's structure is another important step in text analytics. We can calculate:

  • Article Length: The number of words in the article.
  • Average Sentence Length: The average number of words per sentence.

Here’s how you can calculate these metrics:

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

article_length, avg_sentence_length = calculate_text_metrics(filtered_text)
print(f"Article Length (Word Count): {article_length}")
print(f"Average Sentence Length: {avg_sentence_length:.2f}")

In this code:

  • We use nltk.sent_tokenize() to split the text into sentences.
  • We use nltk.word_tokenize() to count the total number of words in the article.
  • The average sentence length is simply the word count divided by the number of sentences.

5. Bringing It All Together

Let’s combine everything into a single Python script that analyzes a webpage's content. Here's the complete code:

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

# Main analysis function
def analyze_article(url):
    html_content = fetch_html_content(url)
    if not html_content:
        print("Error fetching the article.")
        return

    text = extract_text_from_html(html_content)
    filtered_text = remove_stop_words(text)

    positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
    article_length, avg_sentence_length = calculate_text_metrics(filtered_text)

    print(f"Article Length (Word Count): {article_length}")
    print(f"Average Sentence Length: {avg_sentence_length:.2f}")
    print(f"Positive Rating: {positive_rating:.2f}")
    print(f"Negative Rating: {negative_rating:.2f}")

# Example URL
url = "https://example.com/article"
analyze_article(url)

Conclusion

In this article, we’ve walked through how to:

  1. Fetch and parse HTML content using requests and BeautifulSoup.
  2. Clean and filter text by removing stop words using NLTK.
  3. Calculate sentiment ratings (positive/negative) based on predefined lists.
  4. Calculate article length and average sentence length to understand the structure of the text.

This basic framework can be easily extended to analyze a wide range of web content, and with more advanced NLP techniques, you can perform deeper analyses such as topic modeling, entity recognition, and more. NLP is a powerful tool for text analytics, and Python provides a rich ecosystem of libraries to make it easier than ever to unlock insights from textual data.

Comments

Popular posts from this blog

Stochastic Gradient Descent: A Cornerstone of Machine Learning and Data Science

In the world of machine learning and data science, optimizing models to make accurate predictions is crucial. One of the most important optimization algorithms used to train models is Stochastic Gradient Descent (SGD) . But what exactly is SGD, and why is it so widely used in machine learning tasks? Let’s dive into this powerful technique and explore its role in building more efficient and accurate models. What is Stochastic Gradient Descent (SGD)? At its core, Stochastic Gradient Descent is an optimization algorithm used to minimize a function, most commonly a loss function in machine learning models. The goal is to adjust the parameters of the model (like weights in a neural network) in order to reduce the error between the model's predictions and the actual outcomes (i.e., the ground truth). The "gradient" in SGD refers to the derivative of the loss function with respect to the parameters. It tells us the direction and rate of change needed to move towards the min...

Election Data Classification Project – End-to-End Analysis

Problem Definition The objective of this project is to predict voter preference (Labour vs Conservative) using demographic, economic perception, political leadership ratings, and political awareness variables. This is a binary classification problem , where the target variable is: vote_Labour (1 = Labour, 0 = Conservative) The analysis aims to: Understand data structure and distributions Identify relationships between predictors and voting behavior Build and compare multiple classification models Select the best model based on performance metric Git Link Dataset Overview Rows: 1,525 voters Columns: 9 features + 1 target Data Types: Numerical: Age, economic conditions, leader ratings, political knowledge Categorical: Vote, Gender Missing Values: None Duplicates: 8 (not materially impactful) Target Variable Distribution Labour voters: ~70% Conservative voters: ~30% ➡️ Dataset is moderately imbalanced , which makes recall and AUC important evaluation metrics in addition to accuracy...

Data Analysis and Visualization with Matplotlib and Seaborn | TOP 10 code snippets for practice

Data visualization is an essential aspect of data analysis. It enables us to better understand the underlying patterns, trends, and insights within a dataset. Two of the most popular Python libraries for data visualization are Matplotlib and Seaborn . Both libraries are highly powerful, and they can be used to create a wide variety of plots to help researchers, analysts, and data scientists present data visually. In this article, we will discuss the basics of both libraries, followed by the top 10 most used code snippets for visualization. We'll also provide links to free resources and documentation to help you dive deeper into these libraries. Matplotlib and Seaborn: A Quick Overview Matplotlib Matplotlib is a low-level plotting library in Python. It allows you to create static, animated, and interactive plots. It provides a lot of flexibility but may require more code to create complex plots compared to Seaborn. Matplotlib is especially useful when you need full control ove...