NeedleBench tests bilingual long-context capabilities with tasks from 4,000 to over 1 million tokens