OpenAI started using academic datasets to evaluate language models but found that these benchmark datasets were not inclusive of the real-life dangers of safety and misuse.