Guide to Robustness Gym: Unifying the NLP Evaluation Landscape

Once the AI/ML model is built, researchers spend a considerable amount of time to come up with different parameters on which that model should be evaluated. Evaluation methods are problem-specific. Recently, Stanford University along with Salesforce Research and UNC-Chapel Hill has proposed a system for the evaluation of NLP pipelines, commonly referred to as Robustness Gym. This framework was first submitted as a research paper: Robustness Gym: Unifying the NLP Evaluation Landscape, to ArXiV on January 13, 2021, by Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, Christopher Ré.

Robustness Gym is a simple python toolkit for evaluating the NLP systems systematically and it works across multiple idioms, dealing with data errors, distribution change, biasness, etc. The following are problems that are being focussed on by this framework.

The paradox of choice: Given a particular problem and its specification, what type of evaluation to be run like biasness, generalization, etc
Idiomatic Lock-In: Once the Paradox of choice is selected, Idiomatic Lock-In refers to the choice of tool to execute it. Four of the unique evaluation idioms in the existing toolkits are —subpopulations, transformations, adversarial attacks and evaluation sets.
Workflow Fragmentation: It refers to keeping track of all the progress by saving all the data and generating reports.

Robustness Gym addresses the above challenges by Contemplate–>Create–> Consolidate evaluation loop where

Contemplate helps in choosing what evaluation to run(Paradox of choice) by giving directions on decision variables.
Create slices the data into different collections by using evaluation idioms.
Consolidate arranges all the slices(from Create) into TestBench and creates reports.

Conventionally, the evaluation procedure involves three steps i.e.,

Loading the data.
Generate Predictions using the built-model.
Compute the metrics.

But in Robustness Gym, this same procedure has been divided into six steps(section 3). The whole workflow diagram of Robustness Gym is mentioned below. I recommend you to go through this article, before proceeding further.

Requirements

Python > = 3.8 , <4.0

Installation

Install Robustness Gym toolkit via pip. Might take some time to install.

Install the latest release of robustnessgym. In this case, the version is 0.0.3. First create a conda environment, activate the created environment and install the Robustness Gym framework and its dependies and then add the environment to your jupyter notebook. Type these commands one by one on your terminal.

conda create --name robustnessgym python=3.8 -y

source activate robustnessgym

pip install robustnessgym==0.0.3

python -m spacy download en_core_web_sm

conda install -c anaconda ipykernel

python -m ipykernel install --user --name=robustnessgym

Robustness Gym Workflow

As discussed above, in contrast to traditional evaluation steps, Robustness Gym follows six steps which we are going to discuss in detail.

3.1 Load the data

Robustness Gym supports Huggingface datasets and it is very easy to use. Here is an example loading Boolq dataset of question-answering.

 import robustnessgym as rg
 # Load the boolq data
 dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
 # Load the first 10 training examples
 dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')

3.2 Compute and Cache-side information

In this part, we perform some pre-processing on the data and compute some information on example which later can be used for some kind of analysis. The idea of CachedOperation is quite similar to .map() on your dataset, except the fact that it can give any information which you have cached earlier. An example of it is shown below.

 # Get a dataset
 from robustnessgym import Dataset
 dataset = Dataset.load_dataset('boolq')["train"]
 print(len(dataset))
 dataset[0]
 # Run the Spacy pipeline
 from robustnessgym import Spacy
 spacy = Spacy()
 # .. on the 'question' column of the dataset
 dataset = spacy(batch_or_dataset=dataset,
                 columns=['question'])
 # Run the Stanza pipeline
 from robustnessgym import Stanza
 stanza = Stanza()
 # .. on both the question and passage columns of a batch
 dataset = stanza(batch_or_dataset=dataset[:32],
                  columns=['question', 'passage'])
 # .. use any of the other built-in operations in Robustness Gym!
 # Or, create your own CachedOperation
 from robustnessgym import CachedOperation, Identifier
 from robustnessgym.core.decorators import single-column
 # Write a silly function that operates on a single column of a batch
 #@singlecolumn
 def silly_fn(batch, columns):
     """
     Capitalize text in the specified column of the batch.
     """
     column_name = columns[0]
     #assert type(batch[column_name]) == str, "Must apply to text column."
     return [text.capitalize() for text in batch[column_name]]
 # Wrap the silly function in a CachedOperation
 silly_op = CachedOperation(apply_fn=silly_fn,
                            identifier=Identifier(_name='SillyOp'))
 # Apply it to a dataset
 dataset = silly_op(batch_or_dataset=dataset,
                    columns=['question'])

Retrieve the cached information:

 from robustnessgym import Spacy, Stanza, CachedOperation, Dataset
 # Take a batch of data
 batch = dataset
 # Retrieve the (cached) results of the Spacy CachedOperation
 spacy_information = Spacy.retrieve(batch, columns=['question'])
 # Retrieve the tokens returned by the Spacy CachedOperation
 tokens = Spacy.retrieve(batch, columns=['question'], proc_fns=Spacy.tokens)
 # Retrieve the entities found by the Stanza CachedOperation
 entities = Stanza.retrieve(batch, columns=['passage'], proc_fns=Stanza.entities)
 # Retrieve the capitalized output of the silly_op
 capitalizations = CachedOperation.retrieve(batch,
                                            columns=['question'],
                                            identifier=silly_op.identifier)
 # Retrieve it directly using the silly_op
 capitalizations = silly_op.retrieve(batch, columns=['question'])
 # Retrieve the capitalized output and lower-case it during retrieval
 capitalizations = silly_op.retrieve(
     batch,
     columns=['question'],
     proc_fns=lambda decoded_batch: [x.lower() for x in decoded_batch]
 )

3.3 Build slices

With the help of cached information, slices of data are being made. These slices are just the collection of examples for evaluation which provide a method for the retrieval of cached information. Robustness Gym uses SliceBuilder class to do this work. Currently, Robustness Gym supports four types of slices.

Evaluation Sets: slice constructed from a pre-existing dataset

 from robustnessgym import Dataset, Slice
 # Evaluation Sets: direct construction of a slice
 boolq_slice = Slice(Dataset.load_dataset('boolq')["train"])

Subpopulations: slice constructed by filtering a larger dataset

 from robustnessgym import Spacy, ScoreSubpopulation, Identifier, Dataset
 from robustnessgym.core.decorators import prerequisites
 dataset = Dataset.load_dataset('boolq', split='validation')
 # `datasets` has made some updates, temporary workaround to set the dataset identifier that we'll fix in v0.0.4
 dataset._identifier = dataset.identifier.without('version')(version=dataset.info.version.version_str)
 spacy = Spacy()
 dataset = spacy(dataset, ['question'])
 def length(batch, columns):
     """
     Length using cached Spacy tokenization.
     """
     column_name = columns[0]
     # Take advantage of previously cached Spacy informations
     tokens = Spacy.retrieve(batch, columns, proc_fns=Spacy.tokens)
     return [len(tokens_) for tokens_ in tokens]
 # Create a subpopulation that buckets examples based on length
 # `prerequisites` is a temporary workaround to specify that `length` requires Spacy to be cached
 # this will not be required in v0.0.4
 length_subpopulation = prerequisites(Spacy)(ScoreSubpopulation)(
     identifiers=[Identifier('0-10'), Identifier('10-20')],
     intervals=[(0, 10), (10, 20)],
     score_fn=length,
 )
 # v0.0.3 no longer modifies `dataset`
 sls, mat = length_subpopulation(dataset, columns=['question'])

Transformations: slice constructed by transforming a dataset.
Attacks: slice constructed by attacking a dataset adversarially

3.4 Evaluate slices

In this section, we simply use the traditional metric on the slices to do the evaluation.

3.5. Report and share findings

The dashboard feature will soon be publicly available.

3.6. Iterate

Conclusion

In this article, we have talked about the NLP evaluation toolkit called Robustness Gym and examined all the steps required in this framework in comparison to traditional methods. This library is currently under development and many more pipelines and functionalities are going to be integrated very soon. Currently, available version is 0.0.3. There are high possibilities that some things might not work in the current framework.

Robustness Gym Demo

Official code, docs & tutorial are available at:

The post Guide to Robustness Gym: Unifying the NLP Evaluation Landscape appeared first on Analytics India Magazine.

Guide to Robustness Gym: Unifying the NLP Evaluation Landscape

3.3 Build slices

3.4 Evaluate slices

Related Posts

What Enterprises Get Wrong About Data Complexity and How to Fix It

What Enterprises Get Wrong About Data Complexity and How to Fix It

xAI Launches Grok Studio for Developers