Skip to main content

Using NLP for Text Analytics with HTML Links, Stop Words, and Sentiment Analysis in Python

 

In the world of data science, text analytics plays a crucial role in deriving insights from large volumes of unstructured text data. Whether you're analyzing customer feedback, social media posts, or web articles, natural language processing (NLP) can help you extract meaningful information. One interesting challenge in text analysis involves handling HTML content, extracting meaningful text, and performing sentiment analysis based on predefined positive and negative word lists. In this blog post, we will dive into how to use Python and NLP techniques to analyze text data from HTML links, filter out stop words, and calculate various metrics such as positive/negative ratings, article length, and average sentence length.

Prerequisites

To follow along with the examples in this article, you need to have the following Python packages installed:

  • requests (to fetch HTML content)
  • beautifulsoup4 (for parsing HTML)
  • nltk (for natural language processing tasks)
  • re (for regular expressions)

You can install these dependencies using pip:

pip install requests beautifulsoup4 nltk

1. Fetching and Parsing HTML Content

The first step is to extract raw text from a webpage. We will use the requests library to fetch HTML content from the web, and BeautifulSoup from the beautifulsoup4 library to parse and extract text.

import requests
from bs4 import BeautifulSoup

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

url = "https://example.com/article"
html_content = fetch_html_content(url)
text = extract_text_from_html(html_content)
print(text[:500])  # Print the first 500 characters of the extracted text

In this code:

  • The fetch_html_content() function retrieves the HTML from the provided URL.
  • extract_text_from_html() uses BeautifulSoup to parse the HTML and extract text from all <p> tags (typically where article text is located).

2. Removing Stop Words

In text analytics, stop words are commonly used words (such as "the", "is", "and", etc.) that don't provide significant value in understanding the text. We'll use the nltk library to remove stop words from the text.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

filtered_text = remove_stop_words(text)
print(filtered_text[:500])  # Print the first 500 characters of the filtered text

In this function:

  • We use nltk.word_tokenize() to break the text into words.
  • We filter out stop words by checking if the word is present in the NLTK stop words list.

3. Sentiment Analysis with Positive and Negative Word Lists

Sentiment analysis involves determining the emotional tone behind a piece of text. A common approach is to use predefined lists of positive and negative words. We'll use these lists to calculate the positive and negative ratings for the text.

positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
print(f"Positive Rating: {positive_rating:.2f}")
print(f"Negative Rating: {negative_rating:.2f}")

4. Calculating Article Length and Average Sentence Length

Analyzing the article's structure is another important step in text analytics. We can calculate:

  • Article Length: The number of words in the article.
  • Average Sentence Length: The average number of words per sentence.

Here’s how you can calculate these metrics:

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

article_length, avg_sentence_length = calculate_text_metrics(filtered_text)
print(f"Article Length (Word Count): {article_length}")
print(f"Average Sentence Length: {avg_sentence_length:.2f}")

In this code:

  • We use nltk.sent_tokenize() to split the text into sentences.
  • We use nltk.word_tokenize() to count the total number of words in the article.
  • The average sentence length is simply the word count divided by the number of sentences.

5. Bringing It All Together

Let’s combine everything into a single Python script that analyzes a webpage's content. Here's the complete code:

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]

def fetch_html_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

def remove_stop_words(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered_words)

def calculate_sentiment(text, positive_words, negative_words):
    words = nltk.word_tokenize(text.lower())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    positive_rating = positive_count / len(words) if len(words) > 0 else 0
    negative_rating = negative_count / len(words) if len(words) > 0 else 0
    
    return positive_rating, negative_rating

def calculate_text_metrics(text):
    sentences = nltk.sent_tokenize(text)
    word_count = len(nltk.word_tokenize(text))
    sentence_count = len(sentences)
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    return word_count, avg_sentence_length

# Main analysis function
def analyze_article(url):
    html_content = fetch_html_content(url)
    if not html_content:
        print("Error fetching the article.")
        return

    text = extract_text_from_html(html_content)
    filtered_text = remove_stop_words(text)

    positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
    article_length, avg_sentence_length = calculate_text_metrics(filtered_text)

    print(f"Article Length (Word Count): {article_length}")
    print(f"Average Sentence Length: {avg_sentence_length:.2f}")
    print(f"Positive Rating: {positive_rating:.2f}")
    print(f"Negative Rating: {negative_rating:.2f}")

# Example URL
url = "https://example.com/article"
analyze_article(url)

Conclusion

In this article, we’ve walked through how to:

  1. Fetch and parse HTML content using requests and BeautifulSoup.
  2. Clean and filter text by removing stop words using NLTK.
  3. Calculate sentiment ratings (positive/negative) based on predefined lists.
  4. Calculate article length and average sentence length to understand the structure of the text.

This basic framework can be easily extended to analyze a wide range of web content, and with more advanced NLP techniques, you can perform deeper analyses such as topic modeling, entity recognition, and more. NLP is a powerful tool for text analytics, and Python provides a rich ecosystem of libraries to make it easier than ever to unlock insights from textual data.

Comments

Popular posts from this blog

Building the Best Product Recommender System using Data Science

In today’s fast-paced digital world, creating personalized experiences for customers is essential. One of the most effective ways to achieve this is through a Product Recommender System . By using Data Science , we can build systems that not only predict what users may like but also optimize sales and engagement. Here's how we can leverage ETL from Oracle , SQL , Python , and deploy on AWS to create an advanced recommender system. Steps to Build the Best Product Recommender System: 1. ETL Process with Oracle SQL The foundation of any data-driven model starts with collecting clean and structured data. ETL (Extract, Transform, Load) processes from an Oracle Database help us extract relevant product, customer, and transaction data. SQL Query Example to Extract Data: SELECT product_id, customer_id, purchase_date, product_category, price FROM sales_data WHERE purchase_date BETWEEN '2023-01-01' AND '2023-12-31'; This query fetches historical sales data, includin...

Building and Deploying Large Language Models (LLMs) with AWS, LangChain, Llama, and Hugging Face

Large Language Models (LLMs) have revolutionized the AI and machine learning landscape by enabling applications ranging from chatbots and virtual assistants to code generation and content creation. These models, which are typically built on architectures like GPT, BERT, and others, have become integral in industries that rely on natural language understanding and generation. In this blog post, we’ll walk you through the steps involved in building and deploying a large language model using popular tools and frameworks such as AWS Generative AI, LangChain, Llama, and Hugging Face. What Are Large Language Models (LLMs)? LLMs are deep learning models designed to process and generate human language. Trained on vast amounts of text data, they have the ability to understand context, answer questions, translate languages, and perform other text-based tasks. Some key attributes of LLMs: Transformers : LLMs are generally based on transformer architecture, which allows the model to focus o...