With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning visual and language representation.