In the world of data science, text analytics plays a crucial role in deriving insights from large volumes of unstructured text data. Whether you're analyzing customer feedback, social media posts, or web articles, natural language processing (NLP) can help you extract meaningful information. One interesting challenge in text analysis involves handling HTML content, extracting meaningful text, and performing sentiment analysis based on predefined positive and negative word lists. In this blog post, we will dive into how to use Python and NLP techniques to analyze text data from HTML links, filter out stop words, and calculate various metrics such as positive/negative ratings, article length, and average sentence length.
Prerequisites
To follow along with the examples in this article, you need to have the following Python packages installed:
requests
(to fetch HTML content)beautifulsoup4
(for parsing HTML)nltk
(for natural language processing tasks)re
(for regular expressions)
You can install these dependencies using pip:
pip install requests beautifulsoup4 nltk
1. Fetching and Parsing HTML Content
The first step is to extract raw text from a webpage. We will use the requests
library to fetch HTML content from the web, and BeautifulSoup
from the beautifulsoup4
library to parse and extract text.
import requests
from bs4 import BeautifulSoup
def fetch_html_content(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
return None
def extract_text_from_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
paragraphs = soup.find_all('p')
text = ' '.join([para.get_text() for para in paragraphs])
return text
url = "https://example.com/article"
html_content = fetch_html_content(url)
text = extract_text_from_html(html_content)
print(text[:500]) # Print the first 500 characters of the extracted text
In this code:
- The
fetch_html_content()
function retrieves the HTML from the provided URL. extract_text_from_html()
uses BeautifulSoup to parse the HTML and extract text from all<p>
tags (typically where article text is located).
2. Removing Stop Words
In text analytics, stop words are commonly used words (such as "the", "is", "and", etc.) that don't provide significant value in understanding the text. We'll use the nltk
library to remove stop words from the text.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
def remove_stop_words(text):
words = nltk.word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
return ' '.join(filtered_words)
filtered_text = remove_stop_words(text)
print(filtered_text[:500]) # Print the first 500 characters of the filtered text
In this function:
- We use
nltk.word_tokenize()
to break the text into words. - We filter out stop words by checking if the word is present in the NLTK stop words list.
3. Sentiment Analysis with Positive and Negative Word Lists
Sentiment analysis involves determining the emotional tone behind a piece of text. A common approach is to use predefined lists of positive and negative words. We'll use these lists to calculate the positive and negative ratings for the text.
positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]
def calculate_sentiment(text, positive_words, negative_words):
words = nltk.word_tokenize(text.lower())
positive_count = sum(1 for word in words if word in positive_words)
negative_count = sum(1 for word in words if word in negative_words)
positive_rating = positive_count / len(words) if len(words) > 0 else 0
negative_rating = negative_count / len(words) if len(words) > 0 else 0
return positive_rating, negative_rating
positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
print(f"Positive Rating: {positive_rating:.2f}")
print(f"Negative Rating: {negative_rating:.2f}")
4. Calculating Article Length and Average Sentence Length
Analyzing the article's structure is another important step in text analytics. We can calculate:
- Article Length: The number of words in the article.
- Average Sentence Length: The average number of words per sentence.
Here’s how you can calculate these metrics:
def calculate_text_metrics(text):
sentences = nltk.sent_tokenize(text)
word_count = len(nltk.word_tokenize(text))
sentence_count = len(sentences)
avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
return word_count, avg_sentence_length
article_length, avg_sentence_length = calculate_text_metrics(filtered_text)
print(f"Article Length (Word Count): {article_length}")
print(f"Average Sentence Length: {avg_sentence_length:.2f}")
In this code:
- We use
nltk.sent_tokenize()
to split the text into sentences. - We use
nltk.word_tokenize()
to count the total number of words in the article. - The average sentence length is simply the word count divided by the number of sentences.
5. Bringing It All Together
Let’s combine everything into a single Python script that analyzes a webpage's content. Here's the complete code:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
positive_words = ["happy", "good", "great", "excellent", "positive", "love"]
negative_words = ["bad", "hate", "poor", "sad", "terrible", "negative"]
def fetch_html_content(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
return None
def extract_text_from_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
paragraphs = soup.find_all('p')
text = ' '.join([para.get_text() for para in paragraphs])
return text
def remove_stop_words(text):
words = nltk.word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
return ' '.join(filtered_words)
def calculate_sentiment(text, positive_words, negative_words):
words = nltk.word_tokenize(text.lower())
positive_count = sum(1 for word in words if word in positive_words)
negative_count = sum(1 for word in words if word in negative_words)
positive_rating = positive_count / len(words) if len(words) > 0 else 0
negative_rating = negative_count / len(words) if len(words) > 0 else 0
return positive_rating, negative_rating
def calculate_text_metrics(text):
sentences = nltk.sent_tokenize(text)
word_count = len(nltk.word_tokenize(text))
sentence_count = len(sentences)
avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
return word_count, avg_sentence_length
# Main analysis function
def analyze_article(url):
html_content = fetch_html_content(url)
if not html_content:
print("Error fetching the article.")
return
text = extract_text_from_html(html_content)
filtered_text = remove_stop_words(text)
positive_rating, negative_rating = calculate_sentiment(filtered_text, positive_words, negative_words)
article_length, avg_sentence_length = calculate_text_metrics(filtered_text)
print(f"Article Length (Word Count): {article_length}")
print(f"Average Sentence Length: {avg_sentence_length:.2f}")
print(f"Positive Rating: {positive_rating:.2f}")
print(f"Negative Rating: {negative_rating:.2f}")
# Example URL
url = "https://example.com/article"
analyze_article(url)
Conclusion
In this article, we’ve walked through how to:
- Fetch and parse HTML content using
requests
andBeautifulSoup
. - Clean and filter text by removing stop words using NLTK.
- Calculate sentiment ratings (positive/negative) based on predefined lists.
- Calculate article length and average sentence length to understand the structure of the text.
This basic framework can be easily extended to analyze a wide range of web content, and with more advanced NLP techniques, you can perform deeper analyses such as topic modeling, entity recognition, and more. NLP is a powerful tool for text analytics, and Python provides a rich ecosystem of libraries to make it easier than ever to unlock insights from textual data.
Comments
Post a Comment