Top initiatives taken to overcome Wikipedia challenges
Wikipedia completed its 20th-anniversary last year and is today the seventh-most popular website globally. The seventh most popular website in the world hosts over 55 million article and receives 15 billion visits every month. Unlike most platforms that struggle to be inclusive in language, Wikipedia is available in 309 languages.
That said, the platform holds several challenges of inaccuracies, lack of inclusivity, and gender-gap, among others. The Wikimedia Foundation is a research foundation that is working towards overcoming these issues. There are other third part companies which are coming up with innovative solutions. We list some of these tools here.
Meta’s women biographies
Only about 20 percent of the total Wikipedia biographies are about women, and it is even smaller for women from intersectional groups. Facebook’s latest initiative leverages AI to address this imbalance. The system can research and write first drafts of Wikipedia-style biographical entries. The open-sourced tool is an end-to-end AI model that automatically creates high-quality biographical articles about important real-world public figures. The model searches websites for relevant information to draft an entry about a personality and citations. The FAIR team is also releasing a novel data set that evaluates model performance on 1,527 biographies of women from marginalised groups.
The model uses a retrieval-augmented generation architecture based on large-scale pretraining to identify only relevant information that it receives at the introduction of the subject. Next, the generation module creates the text, followed by the citation module, which builds the bibliography linking back to the sources. Finally, the biography is created section by section, with the process repeating for each section, using a caching mechanism for better context.
Wikimedia Image/Caption Matching Competition on Kaggle
Wikimedia is a global movement whose mission is to free educational content. However, as explained by Wikimedia, “Wikipedia articles are missing images, and Wikipedia images are missing captions”. To mitigate this problem, the Wikipedia Image/Caption Matching Competition was introduced on Kaggle at the start of the year. It aims at developing systems that can automatically associate images with their corresponding image captions and article titles.
Google’s WIT dataset
In September 2021, Google released a Wikipedia-Based Image Text (WIT) dataset in partnership with Wikimedia Foundation to address Wikipedia’s knowledge gap. WIT is a large multimodal dataset created by extracting various text selections associated with an image from Wikimedia image links and articles. The whole development process was conducted by rigorous filtering to retain high-quality image-text sets. It aims to create a high-quality, large-sized, multilingual dataset with various content. It increases language coverage and large size compared to previous datasets, resulting in a curated set of 37.5 million entity-rich image-text examples and 11.5 million unique images across 108 languages. Google believes this dataset will help researchers build better multimodal multilingual models and identify better representation techniques, improving machine learning models in real-world tasks over visio-linguistic data.
IIT Madras’s Hidden Voices
In March, 2022, IIM’s research wing, Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), collaborated with business consultancy firm SuperBloom Studios to launch ‘Hidden Voices’, an initiative to reduce the gender data gap in digital sources. Its first source is Wikipedia. The goal is to auto-generate biographies of several notable women to address the data gender gap, including editors’ gender and interest and contributions from external sources. Through theoretical information approaches, ML-assisted auto-identification and validation of external sources and textual analysis methods, hidden voices aim to auto-generate the first draft of Wikipedia-style biography. Next, it will employ this approach to generate Wikipedia articles for notable women in STEMM (Science, Technology, Engineering, Medicine and Management).
Wiki-reliability
Another initiative by Wikimedia, Wiki-reliablity is a large scale dataset for content reliability on Wikipedia. Despite the platform’s reach, it is still managed by a community of volunteer editors that maintain its content reality. Wiki-Reliability is the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. It was built on Wikipedia templates, tags used by expert Wikipedia editors to indicate content issues. It proposes an approach to label almost 1 million samples of Wikipedia article revisions as positive or negative. Then they provide information on the possible downstream tasks enabled by such data. It can be used to train large-scale models for content reliability prediction.




