Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.