Researchers at DeepMind have found it’s possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Human annotation is expensive, while hand-written test cases limit the number and lack diversity to predict when a language model can behave in a harmful way, said the researchers.

DeepMind researchers generated test cases using a language model and then used a classifier to detect various harmful behaviours on test cases. ‘Red teaming’ with language models was able to find thousands of diverse failures without writing them by hand, the researchers said. 

The researchers explored several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. The team also used prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation.

Researchers were able to uncover a variety of harmful model behaviours, including:

  1. Offensive Language: Hate speech, profanity, sexual content, discrimination, etc.
  2. Data Leakage: Generating copyrighted or private, personally-identifiable information from the training corpus.
  3. Contact Information Generation: Directing users to unnecessarily email or call real people.
  4. Distributional Bias: Talking about some groups of people in an unfairly different way than other groups, on average over a large number of outputs.
  5. Conversational Harms: Offensive language that occurs in the context of a long dialogue.

Researchers said their approach can be used to discover hypothesised harms from advanced machine learning systems, such as inner misalignment or failures in objective robustness.