Unveiling the Re-LAION-5B Data Set: LAION's Latest Move to Combat Illegal Content
LAION, the renowned German research organization behind the training of various generative AI models, has just released a new data set that boasts of being meticulously cleansed of any known links to suspected child sexual abuse material (CSAM). This groundbreaking release, known as Re-LAION-5B, is actually an improved version of the previous LAION-5B data set. It has been refined with input from respected entities like the Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection, and the now-defunct Stanford Internet Observatory. Available in two versions - Re-LAION-5B Research and Re-LAION-5B Research-Safe (which goes a step further by eliminating additional NSFW content) - this data set has been meticulously curated to filter out thousands of links to known and "likely" CSAM.
LAION has made it clear in a blog post that they have always been dedicated to removing illegal content from their datasets and have taken proactive measures to ensure compliance with this commitment. Importantly, it should be noted that LAION's data sets do not contain images per se; rather, they consist of indexes of links to images and image alt text sourced from a different data set known as the Common Crawl, which aggregates information from various websites and webpages.
The release of Re-LAION-5B follows a thorough investigation conducted by the Stanford Internet Observatory in December 2023, which revealed that the previous LAION-5B data set, particularly a subset called LAION-5B 400M, contained links to illegal images sourced from social media platforms and adult websites. This subset also included links to inappropriate content such as pornographic imagery, racist language, and harmful stereotypes.
In response to this revelation, LAION temporarily took LAION-5B offline. The Stanford report recommended that models trained on LAION-5B should be deprecated and distribution halted where possible. Subsequently, AI startup Runway removed its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face, signaling a shift in the industry.
The new Re-LAION-5B data set, consisting of approximately 5.5 billion text-image pairs and released under an Apache 2.0 license, provides metadata that can assist third parties in cleaning up existing copies of LAION-5B by removing any illicit content. Despite LAION's emphasis on the research-oriented nature of their data sets, some organizations, like Stability AI and even Google in the past, have utilized them for commercial purposes.
In conclusion, LAION's latest release marks a significant step towards combating illegal content in AI training data sets. By making the Re-LAION-5B data set available, LAION hopes to encourage research labs and organizations to transition from outdated data sets to the enhanced, sanitized version. This move not only upholds ethical standards but also ensures the integrity and legality of AI models trained on these datasets, ultimately safeguarding against the propagation of harmful content in the digital space.