What Is Vokenization And Its Significance For NLP Applications
Image captions explicitly rely on grounded language. However, grounded language has a large discrepancy for other kinds of natural language. Existing language pre-training frameworks are driven by contextual learning which only takes the language context as self-supervision.
To address this challenge, the researchers from UNC-Chapel Hill introduced a new technique that gets the context of the image right. More about this technique in the next section. Although self-supervised frameworks like BERT and GPT set new standards of natural language understanding, they did not borrow grounding information from the external visual world.
About Vokenization
The authors state that there is a huge divide between the visually-grounded language datasets and pure-language corpora. So, they have developed a technique named “vokenization” that extrapolates multimodal alignments to language-only data by contextual mapping language tokens to their related images. Hence the name vokens or visualised tokens. Based on these vokens, the authors propose a new pre-training task for language: voken classification. The “vokenizer” is trained on relatively small image captioning datasets and then applied to generate vokens for large language corpora.
Vokenization step by step procedure:
- Assign each token in a sentence with a relevant image.
- Retrieve an image from a set of images regarding a token-image-relevance scoring function.
- This scoring function, parameterized by ?, measures the relevance between the token in the sentences and the image.
- In the sentence is realized as the image that maximizes their relevance score.
Contextual language representation learning is driven by self-supervision without considering explicit connections (grounding) to the external world. As illustrated above, the researchers visually supervised the language model with token-related images. These images are called vokens (visualized tokens). Vokenization helps to generate them contextually.
The whole process of vokenization, can be summarised ad Lying in the core of the vokenization process is a contextual token-image matching model. The model takes a sentence and an image as input, and the sentence is composed of a sequence of tokens. The output is the relevance score between the token and the image while considering the whole sentence as a context.
For evaluating the technique, the researchers trained their model on English Wikipedia and its subset Wiki103. Now vokenizer is used to generate vokens for these two datasets. The pre-trained models are then fine-tuned on GLUE and others. GLUE benchmark is formatted in a way so that it is model-agnostic. Any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. The benchmark tasks that are selected so as to favour models that share information across tasks using parameter sharing or other transfer learning techniques.
The above table depicts the results of vision-and-language pre-trained models on GLUE tasks.
The authors stated that when the unique related image is hard to define, the vokenizer aims to ground the non-concrete tokens (e.g., “by”/“and”/“the”) to relevant images. This related
visual information helps understand the language and leads to the improvement
Key Takeaways
- This work on Vokenization explores the possibility of utilizing visual supervision to language encoders.
- Experimental results show significant improvement over the purely self-supervised language model on multiple language tasks.
While this work showed great improvement for vision-language applications, the authors admit that there are misalignments. These misalignments, they concluded, are possibly caused by the limitations of sentence-image weak supervision in the training data since the strong token-image annotations are not available.
Check the original paper here.
The post What Is Vokenization And Its Significance For NLP Applications appeared first on Analytics India Magazine.




