How To Create A Vocabulary Builder For NLP Tasks?

The vocabulary helps in pre-processing of corpus text which acts as a classification and also a storage location for the processed corpus text. Once a text has been processed, any relevant metadata can be collected and stored.

In this article, we will discuss the implementation of vocabulary builder in python for storing processed text data that can be used in future for NLP tasks.

About the dataset

The dataset is downloaded from the following link. It contains the tweets of youtube users. The tweets are used for sentiment analysis.

Code Implementation

Import all the libraries required for this project.

import pandas as pd
import numpy as np
data = pd.read_csv("youtube.csv")
data.shape
data.head()

Normalize the text

The next step is to perform the normalization of our text data. We need to convert the text data to lower case.

data["comment_text"].str.lower()

Making a dictionary for expanding the English language

For data pre-processing, expand the data using a contraction function. For example, words like ‘can’t’ are expanded to ‘cannot’.

contractions_dict = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

Contraction Function for expanding english language

import regex as re
contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
def expand_contractions(s, contractions_dict=contractions_dict):
     def replace(match):
        return contractions_dict[match.group(0)]
     return contractions_re.sub(replace, s)
data["comment_text"] = [expand_contractions(i) for i in data["comment_text"]]

Remove patterns using regex(Keep a-zA-Z0-9)

We have to keep the letters and numbers in text data and remove all the punctuation data. This can be achieved by using the regex function.

import re
data["comment_text"].replace( { r'[^a-zA-Z0-9, ]' : '' }, inplace= True, regex = True)
data["comment_text"]
#Import Libraries
import nltk
nltk.download('punkt')

Tokenize words

The main implementation of the code starts by tokenizing the words present in the text data.

from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in data["comment_text"]]
print(tokenized_sents)

Add words to the list

Let’s create a list and add all the tokenized words to the list.

flattened = []
for sublist in tokenized_sents:
    for val in sublist:
        flattened.append(val)
print(flattened)

Add words which are not in the list

The result obtained includes a lot of duplicate words. So, another list is used to store the words that are not present in the list.

Vocab=[]
for item in flattened:
    if not item in Vocab:
        Vocab.append(item)
print(Vocab)

Conclusion

In this article, we have learned the implementation of the Vocabulary builder that can be used for NLP tasks. Further, we can extend our research by removing the words which are not present in the English dictionary. Hopefully, this article will be useful to you.The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the code and this link to find the dataset.

The post How To Create A Vocabulary Builder For NLP Tasks? appeared first on Analytics India Magazine.