Current language-vision models have outstanded traditional approaches and overcome numerous limitations. This comes mostly thanks to training methods based on large-scale datasets from online sources. Unfortunately, research is mainly made by industry labs and datasets remain private — until now. This award-winning work by JSC and LAION e.V. provides open datasets from public internet resources that can be used to train state-of-the-art language-vision models and is accessible to all research labs around the world. Toolsets for dataset composition and pre-trained openCLIP models are also open-sourced as results of this work.
Open-vocabulary language-vision models are a type of trained algorithms able to match images with text, as in identifying objects from a picture or finding photos that contain a specific item provided a corresponding text query that contains keywords. Traditionally, developing a vision model required scientists a great effort in terms of tailored training for the task and supervision before being able to complete it successfully. And even then, models worked only for that task and could lose their performance under dataset distribution shifts. But lately, researchers have created new language-vision approaches like CLIP or Imagen that not only overcome these limitations but outstand traditional methods and open new doors in the field.
They used a type of learning that is self-supervised, not needing human curation of data before training, and can use data obtained from the public internet in very large quantities, instead of relying on costly human made labels. The models trained in this way work successfully even identifying scenarios outside of their training data (strong zero-shot transfer to novel settings) and show remarkable robustness adapting to data changes (data distribution shifts), unmatched by previous deep learning methods. However, most research work on this field was performed by large industry labs on closed private datasets, and thus often it is not accessible for the broad research community. That is where the international collaboration of LAION e.V. and Jülich Supercomputing Centre (JSC) researchers steps in.
Together, LAION e.V and JSC have created LAION-5B, a large-scale open dataset of 5.8 Billion image-text pairs obtained from Common Crawl, an open web repository. LAION-5B has demonstrated to be good enough to train strong state-of-the-art models without the necessity of human curation. The work on composing and validating LAION-5B by training CLIP models of various scale on JUWELS Booster, the supercomputer installed at JSC, has culminated in winning the Outstanding Paper Award at NeurIPS 2022, one of the strongest international top conferences for machine learning and artificial intelligence that took place in New Orleans from 27.11 to 04.12
Dr. Jenia Jitsev, co-founder and scientific lead at LAION e.V, was co-leading the collaborative effort, working together with his postdoc Dr. Mehdi Cherti at Scalable Learning & Multi-Purpose AI Lab at JSC. He is very delighted to receive such a high recognition from the broad research community. "It is very rewarding to see how such a grand cooperation between many research labs, independent researchers, citizen scientists and industry bears such fruits, opening wide range of further perspectives for transparently studying together important phenomena in large, strongly transferable models, now accessible to academic research labs across the world and not any more restricted to few industry labs with large resources."
Many institutions made this study possible along with LAION and JSC: UC Berkeley, TU Darmstadt, TU Munich, University of Washington, Allen AI Institute, EleutherAI, HuggingFace and Stability AI. Pivotal for these experiments was JUWELS Booster, the supercomputer hosted by JSC, and the compute time granted by Gauss Center for Supercomputing.
This collaborative effort made it clear for all participants that achieving further breakthroughs on frontiers of machine learning and AI will require further advances in handling model training on very large scales. While its results are already being put to use by numerous research labs across the world, JSC prepares to host the first European Exascale machine, JUPITER. The international collaboration partners involved in the awarded work are expecting to take upcoming opportunities for further groundbreaking large-scale research at the site in Juelich.