Topic-Based Sentiment Analysis, also known as fine-grained opinion mining, focuses on identifying the topics discussed in the documents and then using the subjective terms in the topic, compute its sentiment score. Thus, topic modeling keeps track of specific topics across all the documents in the dataset while sentiment analysis aggregates the subjective feedback associated with these topics in the text.
The topics in topic modeling more often represent the aspects of the given domain e.g., for the hotel reviews dataset the topics may represent cleanliness, service, food quality etc. Therefore, analyzing the sentiments of the topics is often called, Aspect-Based Sentiment Analysis (ABSA).
Conventional sentiment analysis typically assigns an overall sentiment label (e.g., positive, negative, or neutral) to an entire text. While this is sufficient for many applications, it lacks the granularity needed in scenarios where sentiment varies across different aspects. For instance, in a restaurant review, a customer may rate the restaurant positively overall but criticize the service. In such cases, ABSA helps capture the sentiment towards individual aspects, such as identifying that the sentiment toward “service” is negative, despite the overall positive review.
Example: Hotel reviews
Hotel aspects: cleanliness, staff behavior, food quality, location, service and amenities
Aspect sentiment analysis: sentiment scores aggregated for each aspect
- cleanliness_tables_area: ⭐⭐⭐⭐⭐
- staff_behavior_serving_greeting: ⭐⭐⭐
- food_menu_taste_cousines: ⭐⭐⭐⭐⭐
- location_by main road_close to city: ⭐⭐
- service_waiter_table booking: ⭐⭐
Aspect-Based Sentiment Analysis (ABSA) is particularly useful in analyzing hotel reviews, where customers express opinions on multiple aspects of their stay, such as cleanliness, staff behavior, room quality, location, and amenities. Traditional sentiment analysis may label a review as positive or negative as a whole, but ABSA allows for a more nuanced understanding by identifying sentiment tied to specific aspects. For example, a guest might praise the hotel’s location and service but complain about the room’s cleanliness. By applying ABSA, hotel management can gain detailed insights into what aspects need improvement while maintaining strengths. Additionally, potential customers can make informed decisions based on sentiments about aspects that matter most to them. This fine-grained analysis helps hotels enhance customer experience and tailor their services to meet guest expectations more effectively.
Aspect-Based Sentiment Analysis (ABSA) is a challenging task as it involves both identifying relevant “aspects” within a text and assigning sentiment labels to them. Various approaches exist for ABSA, but a common strategy involves first detecting aspects in the text and then applying an ABSA model to determine the sentiment associated with each aspect.
Aspect identification can be performed using different techniques, including rule-based methods such as dictionary-based approaches. For instance, terms like “iPhone X” or “MacBook Pro” might be predefined as aspects.
After identifying aspects, an ABSA classifier is trained to assess sentiment in relation to the context of a sentence. For example, in the sentence, “We had a great experience at the restaurant, the food was delicious, but the service was kinda bad,” the classifier would determine that the sentiment towards “service” is negative, despite the overall positive tone of the review.
Topic Modeling
Topic modeling is an unsupervised machine learning technique used to identify hidden thematic structures in a large collection of text data. It helps discover topics that frequently occur in a dataset without requiring prior labeling or annotation. One of the most widely used topic modeling methods is Latent Dirichlet Allocation (LDA), which represents documents as mixtures of topics, with each topic consisting of a set of words with varying probabilities. Topic modeling is commonly applied in text mining, information retrieval, document classification, and content recommendation systems. It enables researchers and businesses to analyze vast amounts of textual data, uncover trends, and gain insights into discussions, making it a valuable tool in areas such as social media analysis, academic research, and customer feedback categorization.
Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text. It involves classifying text into categories such as positive, negative, or neutral, enabling businesses and researchers to analyze opinions, feedback, and trends. Sentiment analysis is widely applied in various domains, including social media monitoring, customer feedback analysis, brand reputation management, and market research. Advanced sentiment analysis techniques, such as deep learning and transformer-based models, enhance accuracy by capturing contextual nuances, sarcasm, and complex emotions within text data.
Tutorial Content
- Data preparation / preprocessing
- Integer encoding
- Topic modeling (Latent dirichlet allocation with collapsed Gibbs sampling)
- Performing sentiment analysis (using SentiWordNet)
- Separate neutral i.e., topic (aspect) presenting words and subjective words
- Aggregating scores of the subjective words against each topic
- Preparing output
# Its vanilla implementation of Topic modeling that only uses basic tools:
# json - to read from and write to files in json format
# numpy - for faster matrix operations
# pandas - to read csv data
# string - to only keep English letters, removing puntuations and other characters
# random - to generate random numbers for initializing Markov-chain monte carlo, and
# and during algorithm working to avoid local optima
import json
import pandas as pd
1. Data preparation
from data_preparation import *
from lda import LDA
1.1. Read, clean and tokenize textual data
- The method reads textual data from a CSV file having one column with the input texts.
- It cleans the text by removing punctuations and stopwords (generally handled by LDA as well, however, lags in performance)
- Tokenize text
# Read input data from data/input_dataset.csv having BBC news titles (for complete list visit https://github.com/vahadruya/Capstone-Project-Unsupervised-ML-Topic-Modelling/blob/main/Input_Data/input.csv)
with open('config.json', 'r') as file:
config = json.load(file)
dataset = read_data(config["text-doc-path"])
# Clean and tokenize
tokenized_documents = []
for document in dataset: # Considering only first 100 titles for the sake of demonstration
document = clean_tokenize(document)
if len(document) > 2:
tokenized_documents.append(document)
len(tokenized_documents)
605
1.2. Generate Integer encoding
It preserves both frequency and position related information. The process involves assigning each unique token a dedicated integer id, preserving it in a dictionary for later retrieval, while rewriting documents by replacing with with their integer ids.
It makes the operations a lot faster as numbers are much faster to read/store and compare as compared to strings.
The integer ids will be replaced with their original words at the end using stored dictionary files
# Create a dictionary of unique tokens and assign integers
# The two dictionaries are also created to maintain the generated mappings i.e., dictionary (word -> int) and revdictionary (int -> word)
encoded_documents, dictionary, revdictionary = integer_encode_dataset(tokenized_documents)
1.3. Storing intermediate data
The integer encoded documents are stored in files the word-to-id and id-to-word dictionaries are also stored
It will help to avoid these steps, each time topic modeling is performed under different settings
write_encoded_dataset(encoded_documents, config['integer-encoded-doc-path'])
write_dictionaries(dictionary, revdictionary, config['word_integer-dict_path'], config['integer-word-dict_path'])
#write dictionary to file
2. Topic Modeling (Latent Dirichlet Allocation)
Setting (in config.json)
numTopics: 10 - how much can we stretch the data? After manual exploration or domain knowledge having fewer topics more than the high level separation can give good meaningful topics. Having more topics beyond that can identify more specific topics, however there can me more topics that are incoherent and cannot be interpreted.
numAlpha (\(\alpha\)): 1.0 - We want natural representation of topics in documents. A higher value will push in more topics within documents while a lower value will only have fewer most dominant topics. \(\alpha\) is a hyper-parameter where a higher value (above 1) adds external bias to each topic within a document. In extreme case (a value of 1000 or above for example) will have equal representation of all topics within the document.
numBeta (\(\beta\)): 0.01 - We want fewer words to represent a topic, therefore, a value 0.01 (below 1) is used. Given the vocabulary size, a lower value will push the lower probability words in the topic further down, therefore, we will have few more prominent words to represent a topic. Pushing this value further down will results in increase in the probability of the prominent words while further drop in the probabilities of the background words for the topic.
Further, we set the number of iterations numGIterations: 1000 giving it enough time to settle, starting from a randomly initialized state.
There are some other performance related parameters, set to default values
LDA class main functions are: 1. Markov chain monte carlo initialization (giving the model a random inital state, expecting the model to converge for higher number of iterations. 2. Collapsed gibbs sampling inference: in each iteration
2.1 Iterates through all documents, all tokens/words in each document
2.2 For for each token computes its most suitable topic, given the current status of the model
2.3 Updates new topic if different from current topic, associated estimates update, so does the model state
3. Estimate document-topic distribution from the final state of the model 4. Estimate topic-word distribution (organized in decreasing order of probabilities) from the final state of the model 5. Other utility functions
Running the model
if __name__ == "__main__":
lda = LDA(config)
lda.getData(config["integer-encoded-doc-path"])
lda.randomMarkovChainInitialization()
lda.gibbsSampling()
Results: Getting Topics
with open(config["integer-word-dict_path"], 'r') as file:
revdictionary = json.load(file)
topic_words = lda.getWordsPerTopic(revdictionary)
4. Performing sentiment analysis (using SentiStrength)s
- We are using Senti-Strength in this tutorial for computing the sentiment score, using scale parameter which gives a score in range [-5, 5]
5. Separate neutral i.e., topic (aspect) presenting words and subjective words
- Words with Senti-score of 0 are considered Neutral or Objective, words with score below 0 are negatively subjective while words with score above 0 are positively subjective.
- It gives us a split of Neutral i.e., topic or aspect presentable words and subjective (both positive and negative together) words
6. Aggregating scores of the subjective words against each topic
- Aggregate the senti-scorse of all subjective terms in a topic (using mean)
from sentistrength import PySentiStr
senti = PySentiStr()
senti.setSentiStrengthPath('util_SentiStrength/jar_datei/SentiStrength.jar') # Note: Provide absolute path instead of relative path
senti.setSentiStrengthLanguageFolderPath('util_SentiStrength/SentiStrengthData/') # Note: Provide absolute path instead of relative path
topic_presenting_words = {}
topic_senti_words = {}
topic_senti_score = {}
for i in range(config['numTopics']):
topic_presenting_words[i] = []
topic_senti_words[i] = []
topic_senti_score[i] = 0
for topic, wordslist in topic_words.items():
for word in wordslist:
score = senti.getSentiment(word[0], score='scale')[0]
if score == 0:
topic_presenting_words[topic].append(word[0])
else:
topic_senti_words[topic].append(word[0])
topic_senti_score[topic] += score
topic_senti_score[topic] /= len(topic_senti_words)
7. Preparing output
- Prepare understandable topic name by concating its top 5 most presentable words
dict_output = {'topic' : [], 'topic_words' : [], 'senti_score' : [], 'top_remarks' : []}
for i in range(config['numTopics']):
dict_output['topic'].append(i+1)
dict_output['topic_words'].append(topic_presenting_words[i][:config['wordsPerTopic']])
dict_output['senti_score'].append(topic_senti_score[i])
dict_output['top_remarks'].append(' '.join(topic_senti_words[i][:5]))
df_output = pd.DataFrame(dict_output)
df_output.to_csv(config['output_file_path'], sep='\t', index = False)
df_output.head()
topic | topic_words | senti_score | top_remarks | |
---|---|---|---|---|
0 | 1 | [Areas, feel, every, really, made, Seixo, plac... | 0.3 | thank truly kind |
1 | 2 | [one, time, everything, service, make, You, si... | 0.3 | friendly perfect |
2 | 3 | [The, staff, restaurant, beach, definitely, ro... | 1.0 | nice best lovely fantastic enjoy |
3 | 4 | [also, The, food, team, kitchen, menu, And, Le... | 0.2 | good |
4 | 5 | [hotel, would, always, little, big, night, wel... | 0.4 | like care relaxing |
Commentary on Output
Topic 1 is synonymous to user experience of the hotel getting a score of 0.4 with top_remarks as special megical truely. Topic 2 is along the lines of services and has got a score of 1.2.