Detic, like ViLD, uses CLIP embeddings as the classifier.