OpenAI Under Fire: Controversy Surrounds AI Training Data Practices!

  • Editor
  • July 1, 2024
    Updated
OpenAI-Faces-Backlash-Over-Controversial-AI-Training-Data

Key Takeaways:

  • The Center for Investigative Reporting (CIR) has filed a lawsuit against Microsoft and OpenAI for unauthorized use of their copyrighted material.
  • CIR accuses the tech giants of scraping journalistic content to train AI models like ChatGPT without permission or compensation.
  • This lawsuit is part of a broader trend, with several media organizations taking similar legal actions against AI companies.
  • OpenAI has signed licensing agreements with some media organizations but faces challenges in acquiring sufficient training data legally.

The Center for Investigative Reporting (CIR), the non-profit organization behind Mother Jones and Reveal, has filed a lawsuit against tech giants Microsoft and OpenAI, alleging unauthorized use of their copyrighted material to train AI models. This legal action follows similar lawsuits filed by The New York Times and other media organizations.

CIR accuses OpenAI and Microsoft of scraping their journalistic content without permission or compensation to bolster the capabilities of AI products like ChatGPT.

According to CIR’s CEO Monika Bauerlein, “OpenAI and Microsoft started vacuuming up our stories to make their product more powerful, but they never asked for permission or offered compensation, unlike other organizations that license our material. This free-rider behaviour is not only unfair; it is a violation of copyright.

The work of journalists, at CIR and everywhere, is valuable, and OpenAI and Microsoft know it.”

The lawsuit argues that this unauthorized use of their content undermines CIR’s relationships with readers and partners while depriving them of rightful revenue.

This lawsuit adds CIR to a growing list of media entities taking legal action against OpenAI and Microsoft over similar copyright concerns.

The New York Times, which has already invested $1 million in its legal battle, along with publications owned by Alden Global Capital, including the New York Daily News and Chicago Tribune, The Intercept, Raw Story, AlterNet, and The Denver Post, are also engaged in litigation.

Comment
byu/rikbona from discussion
inethtrader

Interestingly, some media organizations have opted for a different approach, signing licensing deals with OpenAI. These include prominent names like The Associated Press, Axel Springer, the Financial Times, Dotdash Meredith, News Corp, Vox Media, The Atlantic, and Time.

Responding to CIR’s lawsuit, an OpenAI spokesperson told CNBC, “We are working collaboratively with the news industry and partnering with global news publishers to display their content in our products like ChatGPT, including summaries, quotes, and attribution, to drive traffic back to the original articles.”

OpenAI’s nearly unrestricted use of information from the internet as training data for ChatGPT continues to be a legal problem for the company.

To train ChatGPT, OpenAI reportedly leverages publicly available data, including online books and papers, leading to demands for payment from content owners.

Prominent technology corporations such as Microsoft, Google, Anthropic, Meta, and OpenAI are in a rush to locate fresh data sources, with Meta even considering purchasing one of the largest publishing houses, Simon & Schuster, at one point.


The issue stems from publishers’ growing accusations against these businesses for purging protected information. In responses to the US Copyright Office, Meta and OpenAI contended that posting copyrighted content online qualifies it as “publicly available” and falls within fair use.

However, they still need to present that defense in court, as the business is being sued by multiple parties over copyrighted content.

The CIR’s attorneys accused Microsoft and OpenAI of using Mother Jones’ copyrighted material to train their GPT and Copilot AI models.

Previous OpenAI lawsuits include actions from several prominent newspapers, such as the New York Daily News and the Chicago Tribune, owned by the Alden Capital Group, which alleged intentional copyright violations.

Court documents in the Author’s Guild lawsuit revealed that OpenAI deleted two significant datasets used to train GPT-3, which likely contained more than 100,000 published books.

However, the scale of content required for these bots to learn continuously will require more than a handful of licensing agreements. One solution is synthetic data, which is artificially generated rather than collected from the real world and can be produced by machine learning algorithms.

OpenAI has considered synthetic data as an option to train its models, but CEO Sam Altman has raised concerns about producing quality data.

The CIR lawsuit highlights the ongoing conflict over AI training data and copyright infringement, emphasizing the need for clear guidelines and fair practices in the use of copyrighted material for AI development.

As legal battles continue, the industry must navigate the complex intersection of technological advancement and intellectual property rights.

For more news and trends, visit AI News on our website.

Was this article helpful?
YesNo
Generic placeholder image

Dave Andre

Editor

Digital marketing enthusiast by day, nature wanderer by dusk. Dave Andre blends two decades of AI and SaaS expertise into impactful strategies for SMEs. His weekends? Lost in books on tech trends and rejuvenating on scenic trails.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *