Language models and innovations in improving them are one of the most exciting and talked about research areas right now. However, though we have seen several large language models in the last year from tech giants (DeepMind’s 280 billion parameter transformer language model, Gopher, Google’s Generalist Language Model, LG AI Research’s Language model Exaone), they cannot often be deployed as they can be harmful to users in different ways difficult to predict prior. To take a progressive step towards solving this issue, innovation mammoth DeepMind has come out with a way to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves.

The researchers generated test cases (red teaming) using a language model and then used a classifier to detect various harmful behaviours on test cases. As per DeepMind, the team evaluated the language model’s replies to generated test questions by using a classifier trained to detect offensive content. What came out of this was a vast quantity of offensive replies in a 280B parameter language model chatbot. 

What is this model exactly?

As per the paper titled, “Red Teaming Language Models with Language Models“, though LLMs such as GPT-3 and Gopher can generate high-quality, there are several hurdles in their deployment. It added, “Generative language models come with a risk of generating very harmful text, and even a small risk of harm is unacceptable in real-world applications.”

The team added that they use the approach to train the 280B parameter Dialogue-Prompted Gopher chatbot for offensive, generated content. They work on several methods such as zero-shot generation, few-shot generation, supervised learning, and reinforcement learning to generate test questions with the large language models. 

Image: DeepMind

As per the paper, red teaming gave versatile responses, with some methods proving effective in producing diverse test cases while some were effective at generating difficult test cases.

The generated test cases compared favourably to manually-written test cases from Xu et al. (2021b) in terms of diversity and difficulty. The team also used LM-based red teaming to see harmful chatbot behaviours that leak memorised training data. The researchers also generated targeted tests for a particular behaviour by sampling from a language model conditioned on a “prompt” or text prefix. 

It said, “We also use prompt-based red teaming to automatically discover groups of people that the chatbot discusses in more offensive ways than others, on average across many inputs.”

Observations

After the failure cases were detected, the team added that the harmful behaviour could be fixed by blacklisting certain phrases that frequently came up in harmful outputs or finding offensive training data quoted by the model that removes data when training future iterations of the model. The model can also be trained to minimise the likelihood of its original, harmful output for a given test input.

Prior work in this area 

There has been previous work to detect issues such as hate speech, indecent language, etc. 

  • HateCheck is a suite of functional tests for hate speech detection models. The research team built 29 model functionalities driven by a review of previous research and interviews with civil society stakeholders. They brought out test cases for each functionality and validated their quality through a structured annotation process.

RealToxicityPrompts is a dataset of 100K naturally occurring, sentence-level prompts derived from a large volume of English web text, teamed with toxicity scores from a widely-used toxicity classifier. The team assessed “controllable generation methods” and found out that though data or compute based methods are more effective at moving away from toxicity, there is no current method that is “failsafe against neural toxic degeneration.”