Most Benchmarked Datasets for Question Answering in NLP with implementation in PyTorch, Keras, and TensorFlow
Question Answering is a technique inside the fields of natural language processing, which is concerned about building frameworks that consequently answer addresses presented by people in natural language processing. The capacity to peruse the content and afterward answer inquiries concerning it, is a difficult undertaking for machines, requiring information about the world. Existing datasets for Question answering have two main weaknesses: those that are used in training data are excessively little for preparing present-day information, while those that are enormous don’t have similar attributes as express perusing comprehension questions.
To address the need for large and high-quality Question answering datasets, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.
SQuAD
Stanford Question Answering Dataset (SQuAD) is a dataset comprising 100,000+ inquiries presented by crowd workers on a bunch of Wikipedia articles, where the response to each address is a fragment of text from the comparing understanding entry. The dataset was presented by researchers: Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang from Stanford University.
Loading the dataset using PyTorch
import json from torchnlp.download import download_file_maybe_extract def squad_dataset(directory='data/', train=False, dev=False, train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', check_files_train=['train-v2.0.json'], check_files_dev=['dev-v2.0.json'], url_train='https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json', url_dev='https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'): download_file_maybe_extract(url=url_dev, directory=directory, check_files=check_files_dev) download_file_maybe_extract(url=url_train, directory=directory, check_files=check_files_train) squad= [] splits_text = [(train, train_filename), (dev, dev_filename)] splits_text = [f for (requested, f) in splits_text if requested] for filename in splits_text : full_path = os.path.join(directory, filename) with open(full_path, 'r') as temp: ret.append(json.load(temp)['data']) if len(squad) == 1: return ret[0] else: return tuple(squad)
Loading the dataset using TensorFlow
import tensorflow as tf def squad(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= squad('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')
State of the Art
The current state of the art on SQuAD dataset is SA-Net on Albert. The model gave an F1 score of 93.011.
bAbI
The bAbI-Question Answering is a dataset for question noting and text understanding. The dataset is made out of a bunch of contexts, with numerous inquiry answer sets accessible depending on the specific situations. It contains both English and Hindi content. The “ContentElements” field contains training data and testing data. The initial two give admittance to information designed to normal preparing errands. They are retrieved from the 10,000k variant in English.bAbI was presented by Facebook Group.
Loading the dataset using PyTorch
import os from io import open import torch from torchtext.data import Dataset, Field, Example, Iterator class BABI20Field(Field): def __init__(self, memory_size, **kwargs): super(BABI20Field, self).__init__(**kwargs) self.memory_size = memory_size self.unk_token = None self.batch_first = True def preprocess(self, x): if isinstance(x, list): return [super(BABI20Field, self).preprocess(s) for s in x] else: return super(BABI20Field, self).preprocess(x) def pad(self, minibatch): if isinstance(minibatch[0][0], list): self.fix_length = max(max(len(x) for x in ex) for ex in minibatch) padded = [] for ex in minibatch: # sentences are indexed in reverse order and truncated to memory_size nex = ex[::-1][:self.memory_size] padded.append( super(BABI20Field, self).pad(nex) + [[self.pad_token] * self.fix_length] * (self.memory_size - len(nex))) self.fix_length = None return padded else: return super(BABI20Field, self).pad(minibatch) def numericalize(self, arr, device=None): if isinstance(arr[0][0], list): tmp = [ super(BABI20Field, self).numericalize(x, device=device).data for x in arr ] arr = torch.stack(tmp) if self.sequential: arr = arr.contiguous() return arr else: return super(BABI20Field, self).numericalize(arr, device=device) class BABI20(Dataset): urls = ['http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz'] name = '' dirname = '' def __init__(self, path, text_field, only_supporting=False, **kwargs): fields = [('story', text_field), ('query', text_field), ('answer', text_field)] self.sort_key = lambda x: len(x.query) with open(path, 'r', encoding="utf-8") as f: triplets = self._parse(f, only_supporting) examples = [Example.fromlist(triplet, fields) for triplet in triplets] super(BABI20, self).__init__(examples, fields, **kwargs) @staticmethod def _parse(file, only_supporting): datanew, parse_story = [], [] for line in file: tid, text = line.rstrip('\n').split(' ', 1) if tid == '1': parse_story = [] if text.endswith('.'): parse_story.append(text[:-1]) else: query, answer, supporting = (x.strip() for x in text.split('\t')) if only_supporting: substory = [parse_story[int(i) - 1] for i in supporting.split()] else: substory = [x for x in story if x] datanew.append((substory, query[:-1], answer)) # remove '?' parse_story.append("") return datanew @classmethod def iters(cls, batch_size=32, root='.data', memory_size=50, task=1, joint=False, tenK=False, only_supporting=False, sort=False, shuffle=False, device=None, **kwargs): textnew = BABI20Field(memory_size) train, val, test = BABI20.splits(textnew, root=root, task=task, joint=joint, tenK=tenK, only_supporting=only_supporting, **kwargs) textnew.build_vocab(train) return Iterator.splits((train, val, test), batch_size=batch_size, sort=sort, shuffle=shuffle, device=device)
Loading the dataset using Keras
import re import tarfile import numpy as np from functools import reduce from keras.utils.data_utils import get_file from keras.preprocessing.sequence import pad_sequences try: path_new = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz') except: print('Error downloading dataset, please download it manually:\n' '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n' '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz') raise readfile= tarfile.open(path_new )
State of the Art
The current state of the art on bAbI dataset is STM. The model gave an accuracy of 99.85.
Natural Questions
Natural Questions contains 307,373 questions for training, 7,830 questions for development, and 7,842 questions for testing, alongside human-annotated answers from Wikipedia pages, to be utilized in preparing Question Answer frameworks. This dataset is the first to repeat start to finish measure wherein individuals discover answers to questions. It was developed by the researchers: Lin Pan, Rishav Chakravarti, Anthony Ferritto and Michael Glass.
Loading the dataset using TensorFlow
import bert from bert import tokenization import tensorflow as tf import tensorflow_hub as hub import numpy as np import hashlib import glob import os from tensorflow.python.ops import math_ops from collections import Counter from tensorflow.metrics import accuracy %matplotlib inline BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1" tf.enable_eager_execution() import jsonlines _train_file_path = '/Users/deniz/natural_questions/data/nq-train-*.jsonl' training = glob.glob(_train_file_path) examples = [] for _train_file in train_files: print(training) with jsonlines.open(training) as reader: for i, example in enumerate(reader): # pop ununsed keys del example['document_html'] examples.append(example)
State of the Art
The current state of the art on Natural Questions dataset is GPT-3 175B. The model gave an accuracy of 29.9.
Conclusion
In this article, we have covered some of the high-quality datasets that are used in Question Answering. Further, we implemented these data corpus using different Python Libraries. These datasets feature a diverse range of question and answer types. From the above result, we can see STM model performed exceptionally well on bAbI dataset with accuracy over 99%.
The post Most Benchmarked Datasets for Question Answering in NLP with implementation in PyTorch, Keras, and TensorFlow appeared first on Analytics India Magazine.