Synthetic data is a dangerous teacher| Trending Viral hub

In April 2022, when Dall-E, a text-to-image visiolinguistic model, was released, it reportedly attracted more than a millions of users within the first three months. This was followed by ChatGPT, in January 2023, which apparently reached 100 million monthly active users just two months after launch. Both mark notable moments in the development of generative AI, which in turn has led to an explosion of AI-generated content on the web. The bad news is that, in 2024, this means we will also see an explosion of fabricated and meaningless information, misinformation and disinformation, and the exacerbation of negative social stereotypes encoded in these AI models.

The AI ​​revolution was not driven by any recent theoretical advances (in fact, most of the fundamental work underlying artificial neural networks has existed for decades), but by the “availability” of massive data sets. Ideally, an AI model captures a given phenomenon (be it human language, cognition, or the visual world) in a way that is representative of real phenomena as closely as possible.

For example, for a large language model (LLM) to generate human-like text, it is important that the model receives enormous volumes of data that somehow represent human language, interaction, and communication. The belief is that the larger the data set, the better it will capture human affairs, in all their inherent beauty, ugliness, and even cruelty. We are in an era marked by the obsession with scaling models, data sets and GPUs. Today’s LLMs, for example, have entered an era of trillion-parameter machine learning models, meaning they require data sets billions in size. Where can we find it? In the net.

This web-sourced data is supposed to capture the “fundamental truth” for human communication and interaction, a proxy from which language can be modeled. Although several researchers have shown that online data sets are often of poor qualitytends to exacerbate negative stereotypesand contain problematic content such as racial slurs and hate speechoften toward marginalized groups, this hasn’t stopped big AI companies from using that data in the race for growth.

With generative AI, this problem is about to get much worse. Instead of objectively representing the social world from input data, these models encode and amplify social stereotypes. In fact, recent work sample that generative models encode and reproduce racist and discriminatory attitudes towards historically marginalized identities, cultures and languages.

It is difficult, if not impossible, even with state-of-the-art detection tools, to know for sure how much text, image, audio, and video data is currently being generated and at what rate. Stanford University researchers Hans Hanley and Zakir Durumeric estimate a 68 percent increase in the number of synthetic articles posted on Reddit and a 131 percent increase in misinformed news articles between January 1, 2022 and March 31, 2023. boomyan online music generation company, claims to have generated 14.5 million songs (or 14 percent of recorded music) so far. In 2021, Nvidia predicted that by 2030, there will be more synthetic data than real data in AI models. One thing is certain: the Web is being flooded with synthetically generated data.

The worrying thing is that these large quantities of generative AI results will, in turn, be used as training material for future generative AI models. As a result, in 2024, a very important part of the training material for generative models will be synthetic data produced from generative models. Soon, we will be stuck in a recursive loop where we train AI models using only synthetic data produced by AI models. Most of this will be contaminated with stereotypes that will continue to amplify historical and social inequalities. Unfortunately, this will also be the data we will use to train generative models applied to high-risk sectors, including medicine, therapy, education, and law. We still have to deal with the disastrous consequences of this. By 2024, the generative explosion of AI content that we now find so fascinating will become a huge toxic dump that will come back to haunt us.

Check Also

The Shark SpeedStyle Just Dropped to a New Low Price| Trending Viral hub

SAVE $40: He Shark speed style It’s down to just $159.99 on Amazon as of …

MWC 2024: 3 wildest technologies that left us speechless| Trending Viral hub

CMM 2024 is on fire this year, showcasing some of the most extraordinary technology we’ve …

Anker Soundcore 2 Portable Bluetooth Speaker $29.99| Trending Viral hub

GET 25% OFF: Get a Anker Soundcore 2 Portable Bluetooth speaker for only $29.99 on …

Leave a Reply

Your email address will not be published. Required fields are marked *