AI generated data training ai systems

The demand for large amounts of data to train AI systems is well-known, and companies often rely on gig workers to complete tasks that are difficult to automate, such as data labeling and CAPTCHA solving. However, these workers sometimes do not get paid as they should and need to do their tasks very quickly.

As a result, some of them are turning to AI tools like ChatGPT to increase their earnings.

This trend has caught the attention of researchers, who are now studying the implications of using AI-generated data to train AI models.

The Study

Researchers from the Swiss Federal Institute of Technology (EPFL) recently conducted a study on Amazon Mechanical Turk to understand the extent of gig workers using AI models.

They hired 44 workers to summarize excerpts from medical research papers and analyzed their responses using an AI model they had trained. The researchers looked for signs of ChatGPT output, such as repetitive word choices, and also examined the workers' keystrokes to detect potential copying and pasting.

The study revealed that between 33% and 46% of the workers had utilized AI models like ChatGPT.

This percentage is expected to rise further as AI systems become more powerful and accessible. However, using AI-generated data to train AI models introduces additional errors into already flawed systems because large language models often generate false information and, if this incorrect output is used to train other AI models, the errors can be compounded over time.

The Consequences Of Using AI-Generated Data

Ilia Shumailov, a junior research fellow at Oxford University, warns that using artificial data can lead to statistical errors and biases in the output of other models. This raises significant concerns about the accuracy and reliability of AI systems.

The study mentioned above underscores the need for new methods to determine whether data has been produced by humans or AI. It also raises questions about the reliance of tech companies on gig workers to clean up data for AI systems.

And while the rise of AI-generated data does not spell the end of crowdsourcing platforms, it certainly changes the dynamics.

The AI community must closely examine which tasks are most vulnerable to automation and develop strategies to prevent it. It is crucial to address these issues to ensure the accuracy and reliability of AI systems in the future.