Igeekphone, February 14 news, science and technology media marktechpost published a blog post yesterday (February 13), reporting that Google DeepMind team released the WEBLU-100B billion data set, and by enhancing cultural diversity and multilingualism, And reduce performance differences between subgroups to improve inclusion.
Current challenge
Note: Machines learn from large data sets to connect images and text, and the more data there is, the better the model is at recognizing patterns and improving accuracy. Visual language models (VLMs) rely on these datasets to perform tasks such as image captioning and visual question answering.
Visual language models currently rely on large datasets such as Conceptual Captions and LAION, containing millions to billions of image-text pairs. These datasets support zero-sample classification and image captioning generation, but their development has slowed to about 10 billion pairs.
This limitation reduces the prospect of further improving model accuracy, inclusiveness, and multilingual understanding. Existing methods, based on data crawled from the web, have problems such as low sample quality, linguistic bias, and insufficient multicultural representation.
Weblu-100b billion-level data set
To mitigate the limitations of visual language models in terms of cultural diversity and multilingualism, researchers at Google DeepMind proposed the WEBLU-100B dataset, which contains 100 billion image-text pairs, ten times larger than previous datasets.
This dataset captures rare cultural concepts and improves the model’s performance in less explored areas such as low-resource languages and diverse representations. Unlike previous datasets, WeBLU-100B does not rely on strict filtering, which often removes important cultural details, but instead focuses on expanding the data.
The framework involves pre-training models on different subsets of the WEBLU-100B dataset (1B, 10B, and 100B) to analyze the effects of data scaling.
Models trained on full datasets perform better on cultural and multilingual tasks than models trained on smaller datasets, even using the same computational resources. The dataset is not aggressively filtered, but retains a broad representation of linguistic and cultural elements, making it more inclusive.
The results showed that increasing the dataset size from 10B to 100B had little impact on the Western-centric benchmark, but brought improvements in the cultural diversity task and low-resource language retrieval.