Guide To Sentiment Analysis Using BERT

Sentiment Analysis (SA)is an amazing application of Text Classification, Natural Language Processing, through which we can analyze a piece of text and know its sentiment. Let’s break this into two parts, namely Sentiment and Analysis. Sentiment in layman’s terms is feelings, or you may say opinions, emotions and so on. So basically, we are trying to find out the subjective impression of the statement or text in focus, not facts.

The analysis is the simple technique of extracting that feeling or sentiment in our case. First, we need to characterize the sentiment content of a text unit. Sometimes this is also referred to as opinion mining with emphasis on the extraction part.

Let’s see some examples of what a Sentiment Analysis tool can ask.

Is the customer email satisfied or dissatisfied?
Is this product review bad or good?
Based on some tweets, how are the people reacting towards the product?

Sentiment Analysis has various applications in Business Intelligence, Sociology, Politics, Psychology and so on. All these require us to get the essence of the text.

In this article, we’ll be scraping some Google Play reviews from the Google Play store. Then after some text pre-processing of the data, we will leverage a pre-trained BERT model from the HuggingFace library. BERT is a transformer and simply a stack of encoders on one top of another. This is for understanding the text; hence we have encoders here. We’ll be having three labels, namely – Positive, Neutral and Negative.

The first task is to get feedback for the apps. Both negative and positive are good.

Google Play has plenty of apps, reviews, and scores. Scrape app information and reviews using the google-play-scraper package. Ideally, we would want to collect every possible review and work with that. However, in the real world, data is often limited.

We want apps that have been around some time, so opinion is collected organically. We want to mitigate advertising strategies as much as possible. Apps are constantly being updated, so the time of the review is an important factor. We can choose plenty of apps to analyze. But different app categories contain different audiences.

This article will have two notebooks –

The first one will be for extracting the data.

The second one will be for pre-processing and model training.

Let’s have a look at the implementation.

Text Extraction

Installing Dependencies

We will be leveraging the GooglePlayScraper alongside JSON dependencies for parsing the text.

We are also using a basic boilerplate to show the plots whose settings will be shown in the snippet below. We have to use the application package names from Playstore as this is the argument required.

 # Google provides APIs to easily crawl the PlayStore for Python 
 # without any external dependencies 
 !pip install -qq google-play-scraper
 import json
 import pandas as pd
 from tqdm import tqdm
 import seaborn as sns
 import matplotlib.pyplot as plt
 # Pygments is a syntax highlighting package suitable for code hosting, forums
 from pygments import highlight
 # JsonLexer for parsing JSOn files
 from pygments.lexers import JsonLexer
 from pygments.formatters import TerminalFormatter
 from google_play_scraper import Sort, reviews, app
 %matplotlib inline
 %config InlineBackend.figure_format='retina'
 sns.set(style='whitegrid', palette='muted', font_scale=1.2)
 app_packages = [
   'com.anydo',
   'com.todoist',      
   'com.ticktick.task',
   'com.habitrpg.android.habitica',
   'cc.forestapp',
   'com.oristats.habitbull',
   'com.levor.liferpgtasks',
   'com.habitnow',
   'com.microsoft.todos',
   'prox.lab.calclock',
   'com.gmail.jmartindev.timetune',
   'com.artfulagenda.app',
   'com.tasks.android',
   'com.appgenix.bizcal',
   'com.appxy.planner'
 ]

Usage of GooglePlayScraper

Here we have seen an example of how the scraper will crawl the Playstore for our required application. First, we’ll get all the data in a dictionary format out which we need the comments.

 from google_play_scraper import app
 result = app('com.todoist',
             lang = 'en',    # default is 'en'
             country = 'us') # default is 'us'
 result

Scraping App information

In the below snippet, we’ll be slicing out the comments out of the JSON format in which our data is present by default.

 len(app_packages)
 app_infos = []
 for ap in tqdm(app_packages):
   info = app(ap, lang='en', country='us')
   del info['comments']   # Deleting the comment information from the json obtained.
   app_infos.append(info)
 def print_json(json_object):
   json_str = json.dumps(
     json_object, 
     indent=2, 
     sort_keys=True, 
     default=str         # Some date time and other formats are also converted into string while printing
   )
   print(highlight(json_str, JsonLexer(), TerminalFormatter()))  # Terminal formatter to highlight the different objects
 print_json(app_infos[0])   # Can also spot some bool values.
 def format_title(title):
   sep_index = title.find(':') if title.find(':') != -1 else title.find('-')  #Find the index of ':' or '-' else return -1.
   #print(sep_index)
   if sep_index != -1:
     title = title[:sep_index]   # Strip the unnecessary characters after the ':' and '-'.
   return title[:10]             # if there are no ':' or '-' return only the first 10 characters of the title.
 fig, axs = plt.subplots(2, len(app_infos) // 2, figsize=(14, 5))
 for i, ax in enumerate(axs.flat):
   ai = app_infos[i]
   img = plt.imread(ai['icon'])
   ax.imshow(img)
   ax.set_title((format_title(ai['title'])))   # Format title fits the title and removes junk characters
   ax.axis('off')

JSON objects to Pandas DF and Store them

We will be changing the JSON formatted text to a Pandas data frame for easy data manipulation and store these records in a CSV file later to be downloaded.

 app_infos_df = pd.DataFrame(app_infos)   # Converts the list of json objects and converts them to dataframe objects.
 app_infos_df.to_csv('apps.csv', index=None, header=True)
 app_infos_df.head()
 app_reviews = []
 for ap in tqdm(app_packages):
   for score in list(range(1, 6)):       
   # To iterate over ratings 1 to 5 separately.
     for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]:   
     # Sort the most relevant and newest.
       rvs, _ = reviews(
         ap,
         lang='en',
         country='us',
         sort=sort_order,
         count= 200 if score == 3 else 100,        # Get 200 3 rated reviews and 100 each of others.
         filter_score_with=score
       )
       for r in rvs:
         r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest' # Store the sort order.
         r['appId'] = ap  # Store the app id.
       app_reviews.extend(rvs)  # Extend the reviews to contain information about sort order and app id
 # Sample review
 print_json(app_reviews[0]) # Contains replies from the creators which is not very useful.
 app_reviews_df = pd.DataFrame(app_reviews)
 app_reviews_df.to_csv('reviews.csv', index=None, header=True)
 len(app_reviews_df)

After the end of this notebook, you’ll be having two CSVs. One is having application reviews, and one is having the logos of the apps. So let’s move onto the next notebook. For the next notebook, make sure to upload the reviews.csv file in the notebook.

Installing Dependencies

Here we’ll install the HuggingFace transformers library for our BERT model and tokenizer. We’ll also use the same boilerplate for styling our plots here as well.

 !pip install transformers
 import transformers
 from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
 import torch
 import gc
 gc.collect()
 torch.cuda.empty_cache()
 import numpy as np
 import pandas as pd
 import seaborn as sns
 from pylab import rcParams
 import matplotlib.pyplot as plt
 from matplotlib import rc
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import confusion_matrix, classification_report
 from collections import defaultdict
 from textwrap import wrap
 from torch import nn, optim
 from torch.utils.data import Dataset, DataLoader
 import torch.nn.functional as F
 %matplotlib inline
 %config InlineBackend.figure_format='retina'
 sns.set(style='whitegrid', palette='muted', font_scale=1.2)
 HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
 sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))
 rcParams['figure.figsize'] = 12, 8
 RANDOM_SEED = 42
 np.random.seed(RANDOM_SEED)
 torch.manual_seed(RANDOM_SEED)
 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 device

Reading the Data

Once the file has been uploaded, we can check the meta about our file. For example, by plotting a graph, we need to check how the data spread in our dataset.

 df = pd.read_csv("reviews.csv")
 df.head()
 df.shape
 df.info()
 sns.countplot(df.score)
 plt.xlabel('review score');
 def group_sentiment(rating):
   #rating = int(rating)
   if rating <= 2:
     return 0         # Negativeax = sns.countplot(df.sentiment)
 plt.xlabel('review sentiment')
 ax.set_xticklabels(class_names);
  sentiment
   elif rating == 3:
     return 1         # Neutral Sentiment
   else: 
     return 2         # positive Sentiment
 df['sentiment'] = df.score.apply(group_sentiment)
 class_names = ['negative', 'neutral', 'positive']

Data Pre-processing

Let’s import the BERT tokenizer and see a sample of how we’ll read the text and want it for making the data loader.

 PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
 tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
 sample_txt = 'Best place that I have visited? Iceland was the most beautiful and I consider myself lucky to have visited Iceland at such an early age.'
 #sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
 tokens = tokenizer.tokenize(sample_txt)
 token_ids = tokenizer.convert_tokens_to_ids(tokens)
 print(f'  Sentence: {sample_txt}')
 print(f'\n  Tokens: {tokens}')
 print(f'\n  Token IDs: {token_ids}')   # Each token has a unique ID for the model to understand what we are referring to.
 len(tokens)

Tokenization

Here we will initialise the special tokens required at the start and end of each sequence. We also need to add padding to make sure each sequence has the same length. By plotting a plot, we can check how are the data points in our dataset, when converted to tokens, show their length.

 tokenizer.sep_token, tokenizer.sep_token_id
 tokenizer.cls_token, tokenizer.cls_token_id
 tokenizer.pad_token, tokenizer.pad_token_id
 tokenizer.unk_token, tokenizer.unk_token_id
   sample_txt,
   max_length=32,           # sequence length
   add_special_tokens=True, # Add '[CLS]' and '[SEP]'
   return_token_type_ids=False,
   pad_to_max_length=True,
   return_attention_mask=True,
   return_tensors='pt',  # Return PyTorch tensors(use tf for tensorflow and keras)
 )
 encoding_test.keys()
 print(' length of the first sequence is :  ', len(encoding_test['input_ids'][0]))
 print('\n The input id\'s are : \n', encoding_test['input_ids'][0])
 print('\n The attention mask generated is : ', encoding_test['attention_mask'][0])
 tokenizer.convert_ids_to_tokens(encoding_test['input_ids'].flatten())
 df.loc[df.content.isnull()]
 df = df[df['content'].notna()]
 df.head()
 token_lens = []
 for text in df.content:
     tokens_df = tokenizer.encode(text, max_length=512)   # Max possible length for the BERT model.
     token_lens.append(len(tokens_df))
 sns.distplot(token_lens)
 plt.xlim([0, 256]);
 plt.xlabel('Token count');
 MAX_LEN = 160
 class GPReviewDataset(Dataset):
   def __init__(self, reviews, targets, tokenizer, max_len):
     self.reviews = reviews         # Reviews is content column.
     self.targets = targets         # Target is the sentiment column.
     self.tokenizer = tokenizer     # Tokenizer is the BERT_Tokanizer.
     self.max_len = max_len         # max_length of each sequence.
   def __len__(self):
     return len(self.reviews)       # Len of each review.
   def __getitem__(self, item):
     review = str(self.reviews[item])   # returns the string of reviews at the index = 'items'
     target = self.targets[item]        # returns the string of targets at the index = 'items'
     encoding = self.tokenizer.encode_plus(
       review,
       add_special_tokens=True,
       max_length=self.max_len,
       return_token_type_ids=False,
       pad_to_max_length=True,
       return_attention_mask=True,
       return_tensors='pt',
     )
     return {
       'review_text': review,                                   
       'input_ids': encoding['input_ids'].flatten(),
       'attention_mask': encoding['attention_mask'].flatten(),
       'targets': torch.tensor(target, dtype=torch.long)            # a dictionary containing all the features is returned.
     }
 df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_SEED)
 df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED)
 df_train.shape, df_val.shape, df_test.shape

Create DataLoaders

Creating a data loader so that our data set is sliced into batches and made into the right format needed by our BERT model. This is usually done for Pytorch modules, but we are doing it because our BERT model requires this batch formation.

 def create_data_loader(df, tokenizer, max_len, batch_size):
   ds = GPReviewDataset(
     reviews=df.content.values,
     targets=df.sentiment.values,
     tokenizer=tokenizer,
     max_len=max_len
   )                                  # Dataset would be created which can be used to create and return data loaders.
   return DataLoader(
     ds,
     batch_size=batch_size,
     #num_workers=4
   )
 BATCH_SIZE = 8
 train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
 val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
 test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)
 data = next(iter(train_data_loader))
 data.keys()
 print(data['input_ids'].shape)
 print(data['attention_mask'].shape)
 print(data['targets'].shape)

Fine Tune the model

Fine tuning the model hyperparameters to our application needs. The BERT model will have default Learning Rate, Optimizer and Epochs.

 bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
 model_test = bert_model(
   input_ids=encoding_test['input_ids'], 
   attention_mask=encoding_test['attention_mask']
 )
 model_test.keys()
 last_hidden_state=model_test['last_hidden_state']
 pooled_output=model_test['pooler_output']
 last_hidden_state.shape
 bert_model.config.hidden_size
 pooled_output.shape
 class SentimentClassifier(nn.Module):
   def __init__(self, n_classes):
     super(SentimentClassifier, self).__init__()
     self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
     self.drop = nn.Dropout(p=0.3)                                 ## For regularization with dropout probability 0.3.
     self.out = nn.Linear(self.bert.config.hidden_size, n_classes) ## append an Output fully connected layer representing the number of classes
   def forward(self, input_ids, attention_mask):
     returned = self.bert(
       input_ids=input_ids,
       attention_mask=attention_mask
     )
     pooled_output = returned["pooler_output"]
     output = self.drop(pooled_output)
     return self.out(output)
 model = SentimentClassifier(len(class_names))
 model = model.to(device)
 input_ids = data['input_ids'].to(device)
 attention_mask = data['attention_mask'].to(device)
 print(input_ids.shape)      # batch size x seq length
 print(attention_mask.shape) # batch size x seq length
 F.softmax(model(input_ids, attention_mask), dim=1)
 EPOCHS = 10
 optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
 total_steps = len(train_data_loader) * EPOCHS    # Number of batches * Epochs (Required for the scheduler.)
 scheduler = get_linear_schedule_with_warmup(
   optimizer,
   num_warmup_steps=0,     # Recommended in the BERT paper.
   num_training_steps=total_steps  
 )
 loss_fn = nn.CrossEntropyLoss().to(device)

Helper Function for epoch

Each epoch or iteration over the training data made by the model requires a function having arguments corresponding to the hyperparameters we mentioned above.

 def train_epoch(
   model, 
   data_loader, 
   loss_fn, 
   optimizer, 
   device, 
   scheduler, 
   n_examples
 ):
   model = model.train()    # To make sure that the droupout and normalization is enabled during the training.
   losses = []
   correct_predictions = 0
   for d in data_loader:
     input_ids = d["input_ids"].to(device)
     attention_mask = d["attention_mask"].to(device)
     targets = d["targets"].to(device)
     outputs = model(
       input_ids=input_ids,
       attention_mask=attention_mask
     )
     max_prob, preds = torch.max(outputs, dim=1)    # Returns 2 tensors, one with max_probability and another with the respective predicted label.
     loss = loss_fn(outputs, targets)
     correct_predictions += torch.sum(preds == targets)
     losses.append(loss.item())
     loss.backward()     # Back_Propogation
     nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Recommended by the BERT paper to clip the gradients to avoid exploding gradients.
     optimizer.step()
     scheduler.step()
     optimizer.zero_grad()
   return correct_predictions.double() / n_examples, np.mean(losses)     # Return the mean loss and the ratio of correct predictions.

Helper Function to evaluate model

We need loss function plots to evaluate the model. These are our metrics to evaluate, such as complex models.

 def eval_model(model, data_loader, loss_fn, device, n_examples):
   model = model.eval()        # To make sure that the dropout and normalization is disabled during the training.
   losses = []
   correct_predictions = 0
   with torch.no_grad():        # Back propagation is not required. Torch would perform faster. 
     for d in data_loader:
       input_ids = d["input_ids"].to(device)
       attention_mask = d["attention_mask"].to(device)
       targets = d["targets"].to(device)
       outputs = model(
         input_ids=input_ids,
         attention_mask=attention_mask
       )
       max_prob, preds = torch.max(outputs, dim=1)
       loss = loss_fn(outputs, targets)
       correct_predictions += torch.sum(preds == targets)
       losses.append(loss.item())
   return correct_predictions.double() / n_examples, np.mean(losses)
 %%time
 history = defaultdict(list)          # Similar to Keras library saves history
 best_accuracy = 0                    
 for epoch in range(EPOCHS):
   print(f'Epoch {epoch + 1}/{EPOCHS}')
   print('-' * 10)
   train_acc, train_loss = train_epoch(
     model,
     train_data_loader,    
     loss_fn, 
     optimizer, 
     device, 
     scheduler, 
     len(df_train)
   )
   print(f'Train loss {train_loss} accuracy {train_acc}')
   val_acc, val_loss = eval_model(
     model,
     val_data_loader,
     loss_fn, 
     device, 
     len(df_val)
   )
   print(f'Val   loss {val_loss} accuracy {val_acc}')
   print()
   history['train_acc'].append(train_acc)
   history['train_loss'].append(train_loss)
   history['val_acc'].append(val_acc)
   history['val_loss'].append(val_loss)
   if val_acc > best_accuracy:
     torch.save(model.state_dict(), 'best_model_state.bin')
     best_accuracy = val_acc
 plt.plot(history['train_acc'], label='train accuracy')
 plt.plot(history['val_acc'], label='validation accuracy')
 plt.title('Training history')
 plt.ylabel('Accuracy')
 plt.xlabel('Epoch')
 plt.legend()
 plt.ylim([0, 1]);

Evaluation

Checking how well the model is performing

 test_acc, _ = eval_model(
   model,
   test_data_loader,
   loss_fn,
   device,
   len(df_test)
 )
 test_acc.item()
 def get_predictions(model, data_loader):
   model = model.eval()
   review_texts = []
   predictions = []
   prediction_probs = []
   real_values = []
   with torch.no_grad():
     for d in data_loader:
       texts = d["review_text"]
       input_ids = d["input_ids"].to(device)
       attention_mask = d["attention_mask"].to(device)
       targets = d["targets"].to(device)
       outputs = model(
         input_ids=input_ids,
         attention_mask=attention_mask
       )
       _, preds = torch.max(outputs, dim=1)
       probs = F.softmax(outputs, dim=1)
       review_texts.extend(texts)
       predictions.extend(preds)
       prediction_probs.extend(probs)
       real_values.extend(targets)
   predictions = torch.stack(predictions).cpu()
   prediction_probs = torch.stack(prediction_probs).cpu()
   real_values = torch.stack(real_values).cpu()
   return review_texts, predictions, prediction_probs, real_values
 y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(
   model,
   test_data_loader
 )
 print(classification_report(y_test, y_pred, target_names=class_names))
 def show_confusion_matrix(confusion_matrix):
   hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
   hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
   hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
   plt.ylabel('True sentiment')
   plt.xlabel('Predicted sentiment');
 cm = confusion_matrix(y_test, y_pred)
 df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)
 show_confusion_matrix(df_cm)
 idx = 2
 review_text = y_review_texts[idx]
 true_sentiment = y_test[idx]
 pred_df = pd.DataFrame({
   'class_names': class_names,
   'values': y_pred_probs[idx]
 })
 print("\n".join(wrap(review_text)))
 print()
 print(f'True sentiment: {class_names[true_sentiment]}')
 sns.barplot(x='values', y='class_names', data=pred_df, orient='h')
 plt.ylabel('sentiment')
 plt.xlabel('probability')
 plt.xlim([0, 1]);

Predicting on Raw Text

 review_text = "I love completing my todos! Best app ever!!!"
 encoded_review = tokenizer.encode_plus(
   review_text,
   max_length=MAX_LEN,
   add_special_tokens=True,
   return_token_type_ids=False,
   pad_to_max_length=True,
   return_attention_mask=True,
   return_tensors='pt',
 )
 input_ids = encoded_review['input_ids'].to(device)
 attention_mask = encoded_review['attention_mask'].to(device)
 output = model(input_ids, attention_mask)
 _, prediction = torch.max(output, dim=1)
 print(f'Review text: {review_text}')
 print(f'Sentiment  : {class_names[prediction]}')

EndNotes

As we can see the model has perfectly predicted the sentiment we wanted. Sentiment Analysis has a wide range of applications which means a whole array of different datasets for this task. This can also be done with a point-based system where we can rank the sentiment from best to worst.