Guide to Robustness Gym: Unifying the NLP Evaluation Landscape
Once the AI/ML model is built, researchers spend a considerable amount of time to come up with different parameters on which that model should be evaluated. Evaluation methods are problem-specific. Recently, Stanford University along with Salesforce Research and UNC-Chapel Hill has proposed a system for the evaluation of NLP pipelines, commonly referred to as Robustness Gym. This framework was first submitted as a research paper: Robustness Gym: Unifying the NLP Evaluation Landscape, to ArXiV on January 13, 2021, by Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, Christopher Ré.
Robustness Gym is a simple python toolkit for evaluating the NLP systems systematically and it works across multiple idioms, dealing with data errors, distribution change, biasness, etc. The following are problems that are being focussed on by this framework.
- The paradox of choice: Given a particular problem and its specification, what type of evaluation to be run like biasness, generalization, etc
- Idiomatic Lock-In: Once the Paradox of choice is selected, Idiomatic Lock-In refers to the choice of tool to execute it. Four of the unique evaluation idioms in the existing toolkits are —subpopulations, transformations, adversarial attacks and evaluation sets.
- Workflow Fragmentation: It refers to keeping track of all the progress by saving all the data and generating reports.
Robustness Gym addresses the above challenges by Contemplate–>Create–> Consolidate evaluation loop where
- Contemplate helps in choosing what evaluation to run(Paradox of choice) by giving directions on decision variables.
- Create slices the data into different collections by using evaluation idioms.
- Consolidate arranges all the slices(from Create) into TestBench and creates reports.
Conventionally, the evaluation procedure involves three steps i.e.,
- Loading the data.
- Generate Predictions using the built-model.
- Compute the metrics.
But in Robustness Gym, this same procedure has been divided into six steps(section 3). The whole workflow diagram of Robustness Gym is mentioned below. I recommend you to go through this article, before proceeding further.
- Requirements
Python > = 3.8 , <4.0
- Installation
Install Robustness Gym toolkit via pip. Might take some time to install.
Install the latest release of robustnessgym. In this case, the version is 0.0.3. First create a conda environment, activate the created environment and install the Robustness Gym framework and its dependies and then add the environment to your jupyter notebook. Type these commands one by one on your terminal.
conda create --name robustnessgym python=3.8 -y
source activate robustnessgym
pip install robustnessgym==0.0.3
python -m spacy download en_core_web_sm
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=robustnessgym
- Robustness Gym Workflow
As discussed above, in contrast to traditional evaluation steps, Robustness Gym follows six steps which we are going to discuss in detail.
3.1 Load the data
Robustness Gym supports Huggingface datasets and it is very easy to use. Here is an example loading Boolq dataset of question-answering.
import robustnessgym as rg
# Load the boolq data
dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
# Load the first 10 training examples
dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
3.2 Compute and Cache-side information
In this part, we perform some pre-processing on the data and compute some information on example which later can be used for some kind of analysis. The idea of CachedOperation is quite similar to .map() on your dataset, except the fact that it can give any information which you have cached earlier. An example of it is shown below.
# Get a dataset
from robustnessgym import Dataset
dataset = Dataset.load_dataset('boolq')["train"]
print(len(dataset))
dataset[0]
# Run the Spacy pipeline
from robustnessgym import Spacy
spacy = Spacy()
# .. on the 'question' column of the dataset
dataset = spacy(batch_or_dataset=dataset,
columns=['question'])
# Run the Stanza pipeline
from robustnessgym import Stanza
stanza = Stanza()
# .. on both the question and passage columns of a batch
dataset = stanza(batch_or_dataset=dataset[:32],
columns=['question', 'passage'])
# .. use any of the other built-in operations in Robustness Gym!
# Or, create your own CachedOperation
from robustnessgym import CachedOperation, Identifier
from robustnessgym.core.decorators import single-column
# Write a silly function that operates on a single column of a batch
#@singlecolumn
def silly_fn(batch, columns):
"""
Capitalize text in the specified column of the batch.
"""
column_name = columns[0]
#assert type(batch[column_name]) == str, "Must apply to text column."
return [text.capitalize() for text in batch[column_name]]
# Wrap the silly function in a CachedOperation
silly_op = CachedOperation(apply_fn=silly_fn,
identifier=Identifier(_name='SillyOp'))
# Apply it to a dataset
dataset = silly_op(batch_or_dataset=dataset,
columns=['question'])
Retrieve the cached information:
from robustnessgym import Spacy, Stanza, CachedOperation, Dataset # Take a batch of data batch = dataset # Retrieve the (cached) results of the Spacy CachedOperation spacy_information = Spacy.retrieve(batch, columns=['question']) # Retrieve the tokens returned by the Spacy CachedOperation tokens = Spacy.retrieve(batch, columns=['question'], proc_fns=Spacy.tokens) # Retrieve the entities found by the Stanza CachedOperation entities = Stanza.retrieve(batch, columns=['passage'], proc_fns=Stanza.entities) # Retrieve the capitalized output of the silly_op capitalizations = CachedOperation.retrieve(batch, columns=['question'], identifier=silly_op.identifier) # Retrieve it directly using the silly_op capitalizations = silly_op.retrieve(batch, columns=['question']) # Retrieve the capitalized output and lower-case it during retrieval capitalizations = silly_op.retrieve( batch, columns=['question'], proc_fns=lambda decoded_batch: [x.lower() for x in decoded_batch] )
3.3 Build slices
With the help of cached information, slices of data are being made. These slices are just the collection of examples for evaluation which provide a method for the retrieval of cached information. Robustness Gym uses SliceBuilder class to do this work. Currently, Robustness Gym supports four types of slices.
- Evaluation Sets: slice constructed from a pre-existing dataset
from robustnessgym import Dataset, Slice
# Evaluation Sets: direct construction of a slice
boolq_slice = Slice(Dataset.load_dataset('boolq')["train"])
- Subpopulations: slice constructed by filtering a larger dataset
from robustnessgym import Spacy, ScoreSubpopulation, Identifier, Dataset
from robustnessgym.core.decorators import prerequisites
dataset = Dataset.load_dataset('boolq', split='validation')
# `datasets` has made some updates, temporary workaround to set the dataset identifier that we'll fix in v0.0.4
dataset._identifier = dataset.identifier.without('version')(version=dataset.info.version.version_str)
spacy = Spacy()
dataset = spacy(dataset, ['question'])
def length(batch, columns):
"""
Length using cached Spacy tokenization.
"""
column_name = columns[0]
# Take advantage of previously cached Spacy informations
tokens = Spacy.retrieve(batch, columns, proc_fns=Spacy.tokens)
return [len(tokens_) for tokens_ in tokens]
# Create a subpopulation that buckets examples based on length
# `prerequisites` is a temporary workaround to specify that `length` requires Spacy to be cached
# this will not be required in v0.0.4
length_subpopulation = prerequisites(Spacy)(ScoreSubpopulation)(
identifiers=[Identifier('0-10'), Identifier('10-20')],
intervals=[(0, 10), (10, 20)],
score_fn=length,
)
# v0.0.3 no longer modifies `dataset`
sls, mat = length_subpopulation(dataset, columns=['question'])
- Transformations: slice constructed by transforming a dataset.
- Attacks: slice constructed by attacking a dataset adversarially
3.4 Evaluate slices
In this section, we simply use the traditional metric on the slices to do the evaluation.
3.5. Report and share findings
The dashboard feature will soon be publicly available.
3.6. Iterate
Conclusion
In this article, we have talked about the NLP evaluation toolkit called Robustness Gym and examined all the steps required in this framework in comparison to traditional methods. This library is currently under development and many more pipelines and functionalities are going to be integrated very soon. Currently, available version is 0.0.3. There are high possibilities that some things might not work in the current framework.
Official code, docs & tutorial are available at:
The post Guide to Robustness Gym: Unifying the NLP Evaluation Landscape appeared first on Analytics India Magazine.



