Guide To Sentiment Analysis Using BERT
Sentiment Analysis (SA)is an amazing application of Text Classification, Natural Language Processing, through which we can analyze a piece of text and know its sentiment. Let’s break this into two parts, namely Sentiment and Analysis. Sentiment in layman’s terms is feelings, or you may say opinions, emotions and so on. So basically, we are trying to find out the subjective impression of the statement or text in focus, not facts.
The analysis is the simple technique of extracting that feeling or sentiment in our case. First, we need to characterize the sentiment content of a text unit. Sometimes this is also referred to as opinion mining with emphasis on the extraction part.
Let’s see some examples of what a Sentiment Analysis tool can ask.
- Is the customer email satisfied or dissatisfied?
- Is this product review bad or good?
- Based on some tweets, how are the people reacting towards the product?
Sentiment Analysis has various applications in Business Intelligence, Sociology, Politics, Psychology and so on. All these require us to get the essence of the text.
In this article, we’ll be scraping some Google Play reviews from the Google Play store. Then after some text pre-processing of the data, we will leverage a pre-trained BERT model from the HuggingFace library. BERT is a transformer and simply a stack of encoders on one top of another. This is for understanding the text; hence we have encoders here. We’ll be having three labels, namely – Positive, Neutral and Negative.
The first task is to get feedback for the apps. Both negative and positive are good.
Google Play has plenty of apps, reviews, and scores. Scrape app information and reviews using the google-play-scraper package. Ideally, we would want to collect every possible review and work with that. However, in the real world, data is often limited.
We want apps that have been around some time, so opinion is collected organically. We want to mitigate advertising strategies as much as possible. Apps are constantly being updated, so the time of the review is an important factor. We can choose plenty of apps to analyze. But different app categories contain different audiences.
This article will have two notebooks –
The first one will be for extracting the data.
The second one will be for pre-processing and model training.
Let’s have a look at the implementation.
Text Extraction
Installing Dependencies
We will be leveraging the GooglePlayScraper alongside JSON dependencies for parsing the text.
We are also using a basic boilerplate to show the plots whose settings will be shown in the snippet below. We have to use the application package names from Playstore as this is the argument required.
# Google provides APIs to easily crawl the PlayStore for Python # without any external dependencies !pip install -qq google-play-scraper import json import pandas as pd from tqdm import tqdm import seaborn as sns import matplotlib.pyplot as plt # Pygments is a syntax highlighting package suitable for code hosting, forums from pygments import highlight # JsonLexer for parsing JSOn files from pygments.lexers import JsonLexer from pygments.formatters import TerminalFormatter from google_play_scraper import Sort, reviews, app %matplotlib inline %config InlineBackend.figure_format='retina' sns.set(style='whitegrid', palette='muted', font_scale=1.2) app_packages = [ 'com.anydo', 'com.todoist', 'com.ticktick.task', 'com.habitrpg.android.habitica', 'cc.forestapp', 'com.oristats.habitbull', 'com.levor.liferpgtasks', 'com.habitnow', 'com.microsoft.todos', 'prox.lab.calclock', 'com.gmail.jmartindev.timetune', 'com.artfulagenda.app', 'com.tasks.android', 'com.appgenix.bizcal', 'com.appxy.planner' ]
Usage of GooglePlayScraper
Here we have seen an example of how the scraper will crawl the Playstore for our required application. First, we’ll get all the data in a dictionary format out which we need the comments.
from google_play_scraper import app result = app('com.todoist', lang = 'en', # default is 'en' country = 'us') # default is 'us' result
Scraping App information
In the below snippet, we’ll be slicing out the comments out of the JSON format in which our data is present by default.
len(app_packages) app_infos = [] for ap in tqdm(app_packages): info = app(ap, lang='en', country='us') del info['comments'] # Deleting the comment information from the json obtained. app_infos.append(info) def print_json(json_object): json_str = json.dumps( json_object, indent=2, sort_keys=True, default=str # Some date time and other formats are also converted into string while printing ) print(highlight(json_str, JsonLexer(), TerminalFormatter())) # Terminal formatter to highlight the different objects print_json(app_infos[0]) # Can also spot some bool values. def format_title(title): sep_index = title.find(':') if title.find(':') != -1 else title.find('-') #Find the index of ':' or '-' else return -1. #print(sep_index) if sep_index != -1: title = title[:sep_index] # Strip the unnecessary characters after the ':' and '-'. return title[:10] # if there are no ':' or '-' return only the first 10 characters of the title. fig, axs = plt.subplots(2, len(app_infos) // 2, figsize=(14, 5)) for i, ax in enumerate(axs.flat): ai = app_infos[i] img = plt.imread(ai['icon']) ax.imshow(img) ax.set_title((format_title(ai['title']))) # Format title fits the title and removes junk characters ax.axis('off')
JSON objects to Pandas DF and Store them
We will be changing the JSON formatted text to a Pandas data frame for easy data manipulation and store these records in a CSV file later to be downloaded.
app_infos_df = pd.DataFrame(app_infos) # Converts the list of json objects and converts them to dataframe objects. app_infos_df.to_csv('apps.csv', index=None, header=True) app_infos_df.head() app_reviews = [] for ap in tqdm(app_packages): for score in list(range(1, 6)): # To iterate over ratings 1 to 5 separately. for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]: # Sort the most relevant and newest. rvs, _ = reviews( ap, lang='en', country='us', sort=sort_order, count= 200 if score == 3 else 100, # Get 200 3 rated reviews and 100 each of others. filter_score_with=score ) for r in rvs: r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest' # Store the sort order. r['appId'] = ap # Store the app id. app_reviews.extend(rvs) # Extend the reviews to contain information about sort order and app id # Sample review print_json(app_reviews[0]) # Contains replies from the creators which is not very useful. app_reviews_df = pd.DataFrame(app_reviews) app_reviews_df.to_csv('reviews.csv', index=None, header=True) len(app_reviews_df)
After the end of this notebook, you’ll be having two CSVs. One is having application reviews, and one is having the logos of the apps. So let’s move onto the next notebook. For the next notebook, make sure to upload the reviews.csv file in the notebook.
Installing Dependencies
Here we’ll install the HuggingFace transformers library for our BERT model and tokenizer. We’ll also use the same boilerplate for styling our plots here as well.
!pip install transformers import transformers from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup import torch import gc gc.collect() torch.cuda.empty_cache() import numpy as np import pandas as pd import seaborn as sns from pylab import rcParams import matplotlib.pyplot as plt from matplotlib import rc from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report from collections import defaultdict from textwrap import wrap from torch import nn, optim from torch.utils.data import Dataset, DataLoader import torch.nn.functional as F %matplotlib inline %config InlineBackend.figure_format='retina' sns.set(style='whitegrid', palette='muted', font_scale=1.2) HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"] sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE)) rcParams['figure.figsize'] = 12, 8 RANDOM_SEED = 42 np.random.seed(RANDOM_SEED) torch.manual_seed(RANDOM_SEED) device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") device
Reading the Data
Once the file has been uploaded, we can check the meta about our file. For example, by plotting a graph, we need to check how the data spread in our dataset.
df = pd.read_csv("reviews.csv") df.head() df.shape df.info() sns.countplot(df.score) plt.xlabel('review score'); def group_sentiment(rating): #rating = int(rating) if rating <= 2: return 0 # Negativeax = sns.countplot(df.sentiment) plt.xlabel('review sentiment') ax.set_xticklabels(class_names); sentiment elif rating == 3: return 1 # Neutral Sentiment else: return 2 # positive Sentiment df['sentiment'] = df.score.apply(group_sentiment) class_names = ['negative', 'neutral', 'positive']
Data Pre-processing
Let’s import the BERT tokenizer and see a sample of how we’ll read the text and want it for making the data loader.
PRE_TRAINED_MODEL_NAME = 'bert-base-cased' tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME) sample_txt = 'Best place that I have visited? Iceland was the most beautiful and I consider myself lucky to have visited Iceland at such an early age.' #sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.' tokens = tokenizer.tokenize(sample_txt) token_ids = tokenizer.convert_tokens_to_ids(tokens) print(f' Sentence: {sample_txt}') print(f'\n Tokens: {tokens}') print(f'\n Token IDs: {token_ids}') # Each token has a unique ID for the model to understand what we are referring to. len(tokens)
Tokenization
Here we will initialise the special tokens required at the start and end of each sequence. We also need to add padding to make sure each sequence has the same length. By plotting a plot, we can check how are the data points in our dataset, when converted to tokens, show their length.
tokenizer.sep_token, tokenizer.sep_token_id tokenizer.cls_token, tokenizer.cls_token_id tokenizer.pad_token, tokenizer.pad_token_id tokenizer.unk_token, tokenizer.unk_token_id sample_txt, max_length=32, # sequence length add_special_tokens=True, # Add '[CLS]' and '[SEP]' return_token_type_ids=False, pad_to_max_length=True, return_attention_mask=True, return_tensors='pt', # Return PyTorch tensors(use tf for tensorflow and keras) ) encoding_test.keys() print(' length of the first sequence is : ', len(encoding_test['input_ids'][0])) print('\n The input id\'s are : \n', encoding_test['input_ids'][0]) print('\n The attention mask generated is : ', encoding_test['attention_mask'][0]) tokenizer.convert_ids_to_tokens(encoding_test['input_ids'].flatten()) df.loc[df.content.isnull()] df = df[df['content'].notna()] df.head() token_lens = [] for text in df.content: tokens_df = tokenizer.encode(text, max_length=512) # Max possible length for the BERT model. token_lens.append(len(tokens_df)) sns.distplot(token_lens) plt.xlim([0, 256]); plt.xlabel('Token count'); MAX_LEN = 160 class GPReviewDataset(Dataset): def __init__(self, reviews, targets, tokenizer, max_len): self.reviews = reviews # Reviews is content column. self.targets = targets # Target is the sentiment column. self.tokenizer = tokenizer # Tokenizer is the BERT_Tokanizer. self.max_len = max_len # max_length of each sequence. def __len__(self): return len(self.reviews) # Len of each review. def __getitem__(self, item): review = str(self.reviews[item]) # returns the string of reviews at the index = 'items' target = self.targets[item] # returns the string of targets at the index = 'items' encoding = self.tokenizer.encode_plus( review, add_special_tokens=True, max_length=self.max_len, return_token_type_ids=False, pad_to_max_length=True, return_attention_mask=True, return_tensors='pt', ) return { 'review_text': review, 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'targets': torch.tensor(target, dtype=torch.long) # a dictionary containing all the features is returned. } df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_SEED) df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED) df_train.shape, df_val.shape, df_test.shape
Create DataLoaders
Creating a data loader so that our data set is sliced into batches and made into the right format needed by our BERT model. This is usually done for Pytorch modules, but we are doing it because our BERT model requires this batch formation.
def create_data_loader(df, tokenizer, max_len, batch_size): ds = GPReviewDataset( reviews=df.content.values, targets=df.sentiment.values, tokenizer=tokenizer, max_len=max_len ) # Dataset would be created which can be used to create and return data loaders. return DataLoader( ds, batch_size=batch_size, #num_workers=4 ) BATCH_SIZE = 8 train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE) val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE) test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE) data = next(iter(train_data_loader)) data.keys() print(data['input_ids'].shape) print(data['attention_mask'].shape) print(data['targets'].shape)
Fine Tune the model
Fine tuning the model hyperparameters to our application needs. The BERT model will have default Learning Rate, Optimizer and Epochs.
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME) model_test = bert_model( input_ids=encoding_test['input_ids'], attention_mask=encoding_test['attention_mask'] ) model_test.keys() last_hidden_state=model_test['last_hidden_state'] pooled_output=model_test['pooler_output'] last_hidden_state.shape bert_model.config.hidden_size pooled_output.shape class SentimentClassifier(nn.Module): def __init__(self, n_classes): super(SentimentClassifier, self).__init__() self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME) self.drop = nn.Dropout(p=0.3) ## For regularization with dropout probability 0.3. self.out = nn.Linear(self.bert.config.hidden_size, n_classes) ## append an Output fully connected layer representing the number of classes def forward(self, input_ids, attention_mask): returned = self.bert( input_ids=input_ids, attention_mask=attention_mask ) pooled_output = returned["pooler_output"] output = self.drop(pooled_output) return self.out(output) model = SentimentClassifier(len(class_names)) model = model.to(device) input_ids = data['input_ids'].to(device) attention_mask = data['attention_mask'].to(device) print(input_ids.shape) # batch size x seq length print(attention_mask.shape) # batch size x seq length F.softmax(model(input_ids, attention_mask), dim=1) EPOCHS = 10 optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False) total_steps = len(train_data_loader) * EPOCHS # Number of batches * Epochs (Required for the scheduler.) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=0, # Recommended in the BERT paper. num_training_steps=total_steps ) loss_fn = nn.CrossEntropyLoss().to(device)
Helper Function for epoch
Each epoch or iteration over the training data made by the model requires a function having arguments corresponding to the hyperparameters we mentioned above.
def train_epoch( model, data_loader, loss_fn, optimizer, device, scheduler, n_examples ): model = model.train() # To make sure that the droupout and normalization is enabled during the training. losses = [] correct_predictions = 0 for d in data_loader: input_ids = d["input_ids"].to(device) attention_mask = d["attention_mask"].to(device) targets = d["targets"].to(device) outputs = model( input_ids=input_ids, attention_mask=attention_mask ) max_prob, preds = torch.max(outputs, dim=1) # Returns 2 tensors, one with max_probability and another with the respective predicted label. loss = loss_fn(outputs, targets) correct_predictions += torch.sum(preds == targets) losses.append(loss.item()) loss.backward() # Back_Propogation nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Recommended by the BERT paper to clip the gradients to avoid exploding gradients. optimizer.step() scheduler.step() optimizer.zero_grad() return correct_predictions.double() / n_examples, np.mean(losses) # Return the mean loss and the ratio of correct predictions.
Helper Function to evaluate model
We need loss function plots to evaluate the model. These are our metrics to evaluate, such as complex models.
def eval_model(model, data_loader, loss_fn, device, n_examples): model = model.eval() # To make sure that the dropout and normalization is disabled during the training. losses = [] correct_predictions = 0 with torch.no_grad(): # Back propagation is not required. Torch would perform faster. for d in data_loader: input_ids = d["input_ids"].to(device) attention_mask = d["attention_mask"].to(device) targets = d["targets"].to(device) outputs = model( input_ids=input_ids, attention_mask=attention_mask ) max_prob, preds = torch.max(outputs, dim=1) loss = loss_fn(outputs, targets) correct_predictions += torch.sum(preds == targets) losses.append(loss.item()) return correct_predictions.double() / n_examples, np.mean(losses) %%time history = defaultdict(list) # Similar to Keras library saves history best_accuracy = 0 for epoch in range(EPOCHS): print(f'Epoch {epoch + 1}/{EPOCHS}') print('-' * 10) train_acc, train_loss = train_epoch( model, train_data_loader, loss_fn, optimizer, device, scheduler, len(df_train) ) print(f'Train loss {train_loss} accuracy {train_acc}') val_acc, val_loss = eval_model( model, val_data_loader, loss_fn, device, len(df_val) ) print(f'Val loss {val_loss} accuracy {val_acc}') print() history['train_acc'].append(train_acc) history['train_loss'].append(train_loss) history['val_acc'].append(val_acc) history['val_loss'].append(val_loss) if val_acc > best_accuracy: torch.save(model.state_dict(), 'best_model_state.bin') best_accuracy = val_acc plt.plot(history['train_acc'], label='train accuracy') plt.plot(history['val_acc'], label='validation accuracy') plt.title('Training history') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend() plt.ylim([0, 1]);
Evaluation
Checking how well the model is performing
test_acc, _ = eval_model( model, test_data_loader, loss_fn, device, len(df_test) ) test_acc.item() def get_predictions(model, data_loader): model = model.eval() review_texts = [] predictions = [] prediction_probs = [] real_values = [] with torch.no_grad(): for d in data_loader: texts = d["review_text"] input_ids = d["input_ids"].to(device) attention_mask = d["attention_mask"].to(device) targets = d["targets"].to(device) outputs = model( input_ids=input_ids, attention_mask=attention_mask ) _, preds = torch.max(outputs, dim=1) probs = F.softmax(outputs, dim=1) review_texts.extend(texts) predictions.extend(preds) prediction_probs.extend(probs) real_values.extend(targets) predictions = torch.stack(predictions).cpu() prediction_probs = torch.stack(prediction_probs).cpu() real_values = torch.stack(real_values).cpu() return review_texts, predictions, prediction_probs, real_values y_review_texts, y_pred, y_pred_probs, y_test = get_predictions( model, test_data_loader ) print(classification_report(y_test, y_pred, target_names=class_names)) def show_confusion_matrix(confusion_matrix): hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues") hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right') hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right') plt.ylabel('True sentiment') plt.xlabel('Predicted sentiment'); cm = confusion_matrix(y_test, y_pred) df_cm = pd.DataFrame(cm, index=class_names, columns=class_names) show_confusion_matrix(df_cm) idx = 2 review_text = y_review_texts[idx] true_sentiment = y_test[idx] pred_df = pd.DataFrame({ 'class_names': class_names, 'values': y_pred_probs[idx] }) print("\n".join(wrap(review_text))) print() print(f'True sentiment: {class_names[true_sentiment]}') sns.barplot(x='values', y='class_names', data=pred_df, orient='h') plt.ylabel('sentiment') plt.xlabel('probability') plt.xlim([0, 1]);
Predicting on Raw Text
review_text = "I love completing my todos! Best app ever!!!" encoded_review = tokenizer.encode_plus( review_text, max_length=MAX_LEN, add_special_tokens=True, return_token_type_ids=False, pad_to_max_length=True, return_attention_mask=True, return_tensors='pt', ) input_ids = encoded_review['input_ids'].to(device) attention_mask = encoded_review['attention_mask'].to(device) output = model(input_ids, attention_mask) _, prediction = torch.max(output, dim=1) print(f'Review text: {review_text}') print(f'Sentiment : {class_names[prediction]}')
EndNotes
As we can see the model has perfectly predicted the sentiment we wanted. Sentiment Analysis has a wide range of applications which means a whole array of different datasets for this task. This can also be done with a point-based system where we can rank the sentiment from best to worst.
References
- Official Github Repository
- Research Paper
- Official Source Code
- Colab Implementation Text Crawling
- Colab Implementation Text Preprocessing and Model
The post Guide To Sentiment Analysis Using BERT appeared first on Analytics India Magazine.