Inside HugsVision, An Open-Source Hugging Face Wrapper For Computer Vision
A researcher from Avignon University recently released an open-source, easy-to-use wrapper to Hugging Face for Healthcare Computer Vision, called HugsVision. This new toolkit is used to develop state-of-the-art computer vision technologies, including systems for image classification, semantic segmentation, object detection, image generation, denoising, etc.
The source code for HugsVision is available on GitHub.
HugsVision can be set up via PyPI to use the standard library. It supports both CPU and GPU computations. However, for most recipes, a GPU is necessary during training. Also, CUDA needs to be installed to use GPUs.
All the model checkpoints provided by Hugging Face Transformers and compatible with tasks can be seamlessly integrated from its Hugging Face model hub, where they are uploaded directly by users and organisations.
Currently, Hugging Face Transformers provides the following architectures for computer vision:
- ViT (from Google Research, Brain Team): ‘An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.‘
- DeiT (from Facebook AI and Sorbonne University): ‘Training data-efficient image transformers & distillation through attention.‘
- DETR (from Facebook AI): ‘End-to-end object detection with transformer.‘
- BEiT (from Microsoft Research): ‘BRIT: BERT pre-training of image transformers.‘
Example of HugsVision
In the article ‘How to train a custom vision transformer (ViT) image classifier to help endoscopists in less than five minutes,’ the creator of HugsVision, Yanis Labrak, showed how to train an image classifier model based on transformer architecture to help endoscopists automate the detection of various anatomical landmarks, pathological findings, or endoscopic procedures in the gastrointestinal tract.
Here are the steps to follow when building an image-classification model:
- Install HugsVision
- Download the Kvasir V2 dataset ~ 2.3 GB and load it
- Choose an image classifier model on Hugging Face
- Set up the Trainer and start the fine-tuning
- Evaluate the performance of the model
- Use Hugging Face to run inference on images
Install HugsVision
To begin with, set up the Anaconda environment. The author said Anaconda is a good way to reduce compatibility issues between package versions for all your projects by providing you with an isolated Python environment.
After this, install HugsVision from PyPI. Doing this will provide you with a fast way to install a toolkit without worrying about dependencies conflicts, said Labrak.
Download Kvasir V2 dataset & load it
For this study, the researcher has used Kvasir Dataset v2, which weighs ~2.3 GB. The dataset comprises eight classes, consisting of 1,000 images each, for a total of 8000 images. The ‘jpeg’ images are stored in a separate folder according to the class they belong to. Each class shows anatomical landmarks, pathological findings, or endoscopic procedures in the gastrointestinal tract.
Once the dataset has converted, the next step is to load the data. Here, the first parameter is the path to the dataset folder, followed by the size in percentage of the test dataset; allow to balance the number of documents in each class for the training dataset; and enable the data augmentation, which randomly changes the contrast of the images.
Choose an image classification model
The researcher has selected the Hugging Face transformers package, which provides access to the Hugging Face Hub, which includes pretrained models and pipelines for various tasks in domains such as NLP, computer vision (CV), or automatic speech recognition (ASR).
Once the base model is selected, you can perform a fine-tuning to make it fit the needs. Fine-tuning is an essential step of pursuing the training phase of generic models, pre-trained on a close (image classification) but on a larger amount of data. This approach has shown better results/outcomes than training a model from scratch using the targeted data in many tasks.
Advantages of using a pre-trained model:
- Since they are training only the classification layer and freezing the other ones, the training process becomes faster
- Due to already trained embeddings, the model becomes more effective
To ensure that the model is compatible with HugsVision, you need to have a model reported in PyTorch and compatible with the image classification task. Check out the models available with these criteria here.
Set up the Trainer and start the fine-tuning
Once the model is selected, you can start building the Trainer and start the fine-tuning. Here are the outputs.
Evaluate the performance of the model
The researcher has used F1-Score metrics to represent predictions for all the labels better and find any anomalies with a specific label. Here is how F1-Score is calculated:
When drawing the confusion matrix, the author believes that the F1-Score is a nice way to get an overview of the results, but is not enough to understand the reason for these errors deeply, as errors can be caused by an imbalanced dataset, a lack of data, or even high proximity between classes.
So, to understand the decision or fix the model, knowing classification confusion between classes may help.
Use Hugging Face to run inference on images
Here, you will have to rename the ‘./out/MODEL_PATH/config.json‘ file present in the model output to ‘./out/MODEL_PATH?preprocessor_config.json‘.
Wrapping up
HugsVision is still in the early stages of development and evolving. However, the new features, tutorials and documentation are expected to be released soon. Check out the complete code for training your custom vision transformer (ViT) image classifier here. Also, find more tutorials about using HugsVision on GitHub.



Introducing HugsVision: an easy-to-use wrapper to 
