How To Create A Vocabulary Builder For NLP Tasks?
The vocabulary helps in pre-processing of corpus text which acts as a classification and also a storage location for the processed corpus text. Once a text has been processed, any relevant metadata can be collected and stored.
In this article, we will discuss the implementation of vocabulary builder in python for storing processed text data that can be used in future for NLP tasks.
About the dataset
The dataset is downloaded from the following link. It contains the tweets of youtube users. The tweets are used for sentiment analysis.
Code Implementation
Import all the libraries required for this project.
import pandas as pd import numpy as np data = pd.read_csv("youtube.csv") data.shape data.head()
Normalize the text
The next step is to perform the normalization of our text data. We need to convert the text data to lower case.
data["comment_text"].str.lower()
Making a dictionary for expanding the English language
For data pre-processing, expand the data using a contraction function. For example, words like ‘can’t’ are expanded to ‘cannot’.
contractions_dict = { "ain't": "am not / are not / is not / has not / have not", "aren't": "are not / am not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", "he'd": "he had / he would", "he'd've": "he would have", "he'll": "he shall / he will", "he'll've": "he shall have / he will have", "he's": "he has / he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how has / how is / how does", "I'd": "I had / I would", "I'd've": "I would have", "I'll": "I shall / I will", "I'll've": "I shall have / I will have", "I'm": "I am", "I've": "I have", "isn't": "is not", "it'd": "it had / it would", "it'd've": "it would have", "it'll": "it shall / it will", "it'll've": "it shall have / it will have", "it's": "it has / it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not", "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have", "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she had / she would", "she'd've": "she would have", "she'll": "she shall / she will", "she'll've": "she shall have / she will have", "she's": "she has / she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have", "so's": "so as / so is", "that'd": "that would / that had", "that'd've": "that would have", "that's": "that has / that is", "there'd": "there had / there would", "there'd've": "there would have", "there's": "there has / there is", "they'd": "they had / they would", "they'd've": "they would have", "they'll": "they shall / they will", "they'll've": "they shall have / they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we had / we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what shall / what will", "what'll've": "what shall have / what will have", "what're": "what are", "what's": "what has / what is", "what've": "what have", "when's": "when has / when is", "when've": "when have", "where'd": "where did", "where's": "where has / where is", "where've": "where have", "who'll": "who shall / who will", "who'll've": "who shall have / who will have", "who's": "who has / who is", "who've": "who have", "why's": "why has / why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have", "you'd": "you had / you would", "you'd've": "you would have", "you'll": "you shall / you will", "you'll've": "you shall have / you will have", "you're": "you are", "you've": "you have" }
Contraction Function for expanding english language
import regex as re contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys())) def expand_contractions(s, contractions_dict=contractions_dict): def replace(match): return contractions_dict[match.group(0)] return contractions_re.sub(replace, s) data["comment_text"] = [expand_contractions(i) for i in data["comment_text"]]
Remove patterns using regex(Keep a-zA-Z0-9)
We have to keep the letters and numbers in text data and remove all the punctuation data. This can be achieved by using the regex function.
import re data["comment_text"].replace( { r'[^a-zA-Z0-9, ]' : '' }, inplace= True, regex = True) data["comment_text"] #Import Libraries import nltk nltk.download('punkt')
Tokenize words
The main implementation of the code starts by tokenizing the words present in the text data.
from nltk.tokenize import word_tokenize tokenized_sents = [word_tokenize(i) for i in data["comment_text"]] print(tokenized_sents)
Add words to the list
Let’s create a list and add all the tokenized words to the list.
flattened = [] for sublist in tokenized_sents: for val in sublist: flattened.append(val) print(flattened)
Add words which are not in the list
The result obtained includes a lot of duplicate words. So, another list is used to store the words that are not present in the list.
Vocab=[] for item in flattened: if not item in Vocab: Vocab.append(item) print(Vocab)
Conclusion
In this article, we have learned the implementation of the Vocabulary builder that can be used for NLP tasks. Further, we can extend our research by removing the words which are not present in the English dictionary. Hopefully, this article will be useful to you.The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the code and this link to find the dataset.
The post How To Create A Vocabulary Builder For NLP Tasks? appeared first on Analytics India Magazine.