Using NLP for Text Analytics with HTML Links, Stop Words, and Sentiment Analysis in Python

In the world of data science, text analytics plays a crucial role in deriving insights from large volumes of unstructured text data. Whether you're analyzing customer feedback, social media posts, or web articles, natural language processing (NLP) can help you extract meaningful information. One interesting challenge in text analysis involves handling HTML content, extracting meaningful text, and performing sentiment analysis based on predefined positive and negative word lists. In this blog post, we will dive into how to use Python and NLP techniques to analyze text data from HTML links, filter out stop words, and calculate various metrics such as positive/negative ratings, article length, and average sentence length.

Prerequisites

To follow along with the examples in this article, you need to have the following Python packages installed:

requests (to fetch HTML content)
beautifulsoup4 (for parsing HTML)
nltk (for natural language processing tasks)
re (for regular expressions)

You can install these dependencies using pip:

pip install requests beautifulsoup4 nltk

1. Fetching and Parsing HTML Content

The first step is to extract raw text from a webpage. We will use the requests library to fetch HTML content from the web, and BeautifulSoup from the beautifulsoup4 library to parse and extract text.

import requests
from bs4 import BeautifulSoup

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

url = "https://example.com/article"
html_content = fetch_html_content(url)
text = extract_text_from_html(html_content)
print(text[:500])  # Print the first 500 characters of the extracted text

In this code:

The fetch_html_content() function retrieves the HTML from the provided URL.
extract_text_from_html() uses BeautifulSoup to parse the HTML and extract text from all <p> tags (typically where article text is located).

2. Removing Stop Words

In text analytics, stop words are commonly used words (such as "the", "is", "and", etc.) that don't provide significant value in understanding the text. We'll use the nltk library to remove stop words from the text.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

filtered_text = remove_stop_words(text)
print(filtered_text[:500])  # Print the first 500 characters of the filtered text

In this function:

We use nltk.word_tokenize() to break the text into words.
We filter out stop words by checking if the word is present in the NLTK stop words list.

3. Sentiment Analysis with Positive and Negative Word Lists

Sentiment analysis involves determining the emotional tone behind a piece of text. A common approach is to use predefined lists of positive and negative words. We'll use these lists to calculate the positive and negative ratings for the text.

positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
print(f"Positive Rating: {positive_rating:.2f}")
print(f"Negative Rating: {negative_rating:.2f}")

4. Calculating Article Length and Average Sentence Length

Analyzing the article's structure is another important step in text analytics. We can calculate:

Article Length: The number of words in the article.
Average Sentence Length: The average number of words per sentence.

Here’s how you can calculate these metrics:

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

article_length, avg_sentence_length = calculate_text_metrics(filtered_text)
print(f"Article Length (Word Count): {article_length}")
print(f"Average Sentence Length: {avg_sentence_length:.2f}")

In this code:

We use nltk.sent_tokenize() to split the text into sentences.
We use nltk.word_tokenize() to count the total number of words in the article.
The average sentence length is simply the word count divided by the number of sentences.

5. Bringing It All Together

Let’s combine everything into a single Python script that analyzes a webpage's content. Here's the complete code:

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

# Main analysis function
def analyze_article(url):
    html_content = fetch_html_content(url)
    if not html_content:
        print("Error fetching the article.")
        return

    text = extract_text_from_html(html_content)
    filtered_text = remove_stop_words(text)

    positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
    article_length, avg_sentence_length = calculate_text_metrics(filtered_text)

    print(f"Article Length (Word Count): {article_length}")
    print(f"Average Sentence Length: {avg_sentence_length:.2f}")
    print(f"Positive Rating: {positive_rating:.2f}")
    print(f"Negative Rating: {negative_rating:.2f}")

# Example URL
url = "https://example.com/article"
analyze_article(url)

Conclusion

In this article, we’ve walked through how to:

Fetch and parse HTML content using requests and BeautifulSoup.
Clean and filter text by removing stop words using NLTK.
Calculate sentiment ratings (positive/negative) based on predefined lists.
Calculate article length and average sentence length to understand the structure of the text.

This basic framework can be easily extended to analyze a wide range of web content, and with more advanced NLP techniques, you can perform deeper analyses such as topic modeling, entity recognition, and more. NLP is a powerful tool for text analytics, and Python provides a rich ecosystem of libraries to make it easier than ever to unlock insights from textual data.

AI Councel Lab

Search This Blog