In artificial intelligence, pre-training refers to training a model for one task and the parameters learned by the algorithm could be used in some other task. Models based on this concept are known as pre-training models. To improve video-text related downstream tasks, video-linguistic pre-training works are gradually being developed, based on the recent success of the pre-training technique for NLP and image-linguistic tasks. In this article, a Unified Video and Language pre-training model would be explained. The following are the points and plots that this article will cover.

Table of contents

  1. What is self-supervised learning?
  2. Single and multimodal pre-training
  3. What is the BERT model?
  4. How is BERT used for VLPT?

With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning visual and language representation. A unified video and language pre-training model are based on self-supervised learning. Let’s start by talking about self-supervised learning.

What is self-supervised learning?

Self-supervised learning is a mix of both supervised and unsupervised learning methods. It is an unsupervised learning algorithm that uses supervised learning methods on unstructured data. 

In self-supervised learning, the goal is to learn representations of the data from a pool of unstructured data using self-supervision first and then fine-tune the representations with few labels for downstream processing. 

Depending on the downstream task, it can either be as simple as image classification or it can be complex such as semantic segmentation, object detection, etc. So, it’s basically “Learn from yourself”.

Single and multimodal pre-training

Single modal Pre-training

In this a model is pre-trained either on text or video, it can not be trained on both. Language pre-training models like BERT and BART  have achieved great success in the field of NLP. 

  • BERT  is a denoising auto-encoder network using Transformer with MLM (masked language model) and NSP (next sentence prediction) as pre-training tasks. 
  • BART continuously studies a unified pre-training model for both understanding and generation tasks.

Video representation learning mostly focuses on the video sequence reconstruction or future frames prediction as pre-training (pretext) tasks.

Multimodal Pre-training

In this type of pre-training, the model is trained on both video and linguistic data. There are various types of paradigms for multi-modal pre-training which are listed.

  • Share type is a single stream input  where the text and vision sequences are combined as the input of one shared Transformer encoder   
  • Cross-type is a two-stream input where it can accommodate each modality’s different processing needs and interact at varying representation depths.
  • Joint-type is a two-stream input where it uses one cross-modal encoder for full interaction between the two streams. 

What is the BERT model?

The BERT(Bidirectional Encoder Representations from Transformers) is an open-sourced machine learning framework for natural language. The objective of BERT is a “masked language model”. In simple understanding, it applies the bidirectional training of Transformer to language modelling.

 Each output element is connected to every input element, and the weightings between them are calculated dynamically based on their connections. In NLP (Natural Language Processing) this process is known as attention. The transformer encoder reads from both sides of the input text. It is important to read from both sides to understand the true meaning of the word. For example, 

Architecture

BERT is made by stacking up encoder layers. There are two versions: BERT base and BERT large:

  • BERT base has 12 transformer encoders, 12 attention heads and 110 million parameters.
  • BERT large has 24 transformer encoders, 16 attention heads and 340 million parameters.

The above diagram is the architecture of the BERT base which has three different sections. In the first section, two terms could be seen: <CLS> and <SEP> both are tokens used by BERT to understand the input.

  • <SEP> is the separation token used to separate every sentence for each and the <CLS> token provides the classification based on the separation.
  • <CLS> token is a special classification token used to segment the sentences by embedding them with numbers (Encoding). For example, the first sentence is given as 0, the second as 1.

In the next section, the BERT encoders are present which captures those sentences bidirectional. The last section has hidden layers; there are a total of 768 hidden layers in the BERT base. In the hidden, all the encoded-words are stored and also the <CLS> and <SEP> tokens. 

How is BERT used for VL-PT?

For BERT to be extended to video in a way that takes advantage of pre-trained language models and scalable implementations for inference and learning, we needed to transform the raw visual data into a discrete sequence of tokens. So this transformation will generate a sequence of “visual words” by applying hierarchical vector quantization to features derived from the video using a pre-trained model. 

Furthermore, this approach encourages the model to focus on the video’s high-level semantics and long-range temporal dynamics. Combining the linguistic sentence (derived from the video using ASR) with the visual sentence can yield this type of data: 

[CLS] orange chicken with [MASK] sauce [>] v01 [MASK] v08 v72 [SEP], 

where,

  •  v01 and v08 are visual tokens
  •  [>] is a special token introduced to combine text and video sentences

So now we have the tokenized and segmented visual and linguistic text which could be further used by applying it in the next sentence prediction but before that, it needed both visual text and linguistic text to be aligned. A linguistic-visual alignment task, where the final hidden state of the [CLS] token is used to predict whether the linguistic sentence is temporally aligned with the visual sentence. Note that this is a noisy indicator of semantic relatedness.

  • The notion of semantic relatedness/similarity is defined over a set of documents or terms, where the distance between items is based on their similarity in meaning. For example, in instructional videos, the speaker may be referring to something that is not visually present.

To solve that, first randomly concatenate neighbouring sentences into a single long sentence, to allow the model to learn semantic correspondence even if the two are not well aligned temporally. Second, since the pace of state transitions for even the same action can vary greatly between different videos, randomly pick a subsampling rate of 1 to 5 steps for the video tokens. 

This not only helps the model be more robust to variations in video speeds but also allows the model to capture temporal dynamics over a greater period and learn longer-term state transitions. 

After all these processes, three training regimes corresponding to the different input data modalities would be created: text-only, video-only and video-text. For text-only and video-only, the standard mask-completion objectives are used for training the model. For text-video, the linguistic-visual alignment classification is used whose objective is as described earlier. The overall training objective is a weighted sum of the individual objectives. The text objective forces BERT to do well at language modelling.

  • The video objective forces it to learn a “language model for video”, which can be used for learning dynamics and forecasting; and 
  • The text-video objective forces it to learn the correspondence between the two domains. Once trained the model can be used in a variety of downstream tasks.

Final Verdict

This article focuses on the utilization of a powerful BERT model based on self-supervised learning to jointly pre-train models on language as well as video for making predictions.

References