AI-Generated Spam Overwhelms Australia’s Internet Archive!

  • Editor
  • June 26, 2024
    Updated
Australias-Internet-Archive-Struggles-with-AI-Generated-Content-Overflow

The National Library of Australia (NLA) has been archiving the Australian Internet since the 1990s to help Australians understand their social, cultural, and intellectual history. Initially, this archiving started with a select group of websites, including curated collections of news outlets like Crikey.

Since 2004, the NLA expanded this effort to collect annual snapshots of websites on the .au domain using automated web crawlers, allowing people to search and observe changes over the years.

With the advent of generative AI tools like OpenAI’s ChatGPT, there has been a significant increase in the creation of text and other types of content.

This technology has been used to generate large volumes of spam and low-quality content across various platforms, including social media, e-commerce sites like Amazon, and other internet resources traditionally written by humans.

Comment
byu/PsychoComet from discussion
intechnews

Researchers indicate that a “shocking” amount of the web now appears to be produced by AI, making it challenging to detect AI-generated language.

The NLA’s Australian Web Archive (AWA) has also been affected. Searches on NLA’s online research platform Trove yield hundreds of hits for phrases like “As an AI language model” and “as of my last knowledge update,” which are commonly produced by ChatGPT but not typically by humans.

Comment
byu/PsychoComet from discussion
intechnews

While written by humans, some web pages in the archive include disclosed excerpts of ChatGPT language, such as a page from an Australian IT company showcasing AI chatbot capabilities.

However, many instances involve undeclared, spammy Australian websites automatically generated by ChatGPT, often featuring incongruous content, like an article titled “Do You Need A Pickup Truck” on a wedding planner’s website.

Comment
byu/PsychoComet from discussion
intechnews

Including AI-generated content in the 2024 snapshot of the Australian internet raises concerns about its impact on understanding the culture.

Earlier this year, tech outlet 404 Media reported that entirely AI-generated books were being indexed by Google Books, potentially affecting language research based on the platform’s data.

An NLA spokesperson acknowledged that Trove is now capturing AI content, stating,

“Our role is to comprehensively collect these publications without making any judgment on content,” they told Crikey in an email.

“The AWA is made available for researchers to interrogate according to their own standards and requirements. In the event that there are complaints relating to copyright, privacy or defamation, the National Library has a comprehensive takedown policy in place,” they said.

Kieran Hegarty, a research fellow at the ARC Centre of Excellence for Automated Decision-Making and Society, highlighted that this issue extends beyond the NLA’s archive.

He emphasized the need to change perceptions of library collections. Archiving should be viewed as capturing a comprehensive record of public output, including books and web content, rather than signifying value or prestige.

Comment
byu/PsychoComet from discussion
intechnews

While incorporating AI-generated content is part of the NLA’s archiving scope, it poses logistical challenges due to the increasing size of each year’s snapshot. For instance, in 1995, the NLA captured 5,150 web pages, requiring 261 megabytes of storage.

By 2022, the snapshot included over two trillion web pages, consuming 151 terabytes of storage—more than 577,000 times the original size.

Comment
byu/PsychoComet from discussion
intechnews

Hegarty noted that if the influx of AI-produced content continues, the NLA might need to reconsider its archiving strategies to manage costs, especially given its past funding struggles.

“What we’re talking about is an explosion of [AI] content as part of capturing a snapshot of Australia. These are some very hard decisions,” he said.

The challenge of managing AI-generated content is further complicated by the legal and ethical considerations of archiving such material. The NLA must balance the need to preserve a comprehensive historical record with the practicalities of storage and the potential biases introduced by AI content.

Comment
byu/PsychoComet from discussion
intechnews

This ongoing issue underscores the broader implications of AI on digital preservation and the importance of developing strategies to address the evolving landscape of internet content.

For more news and trends, visit AI News on our website.

Was this article helpful?
YesNo
Generic placeholder image

Dave Andre

Editor

Digital marketing enthusiast by day, nature wanderer by dusk. Dave Andre blends two decades of AI and SaaS expertise into impactful strategies for SMEs. His weekends? Lost in books on tech trends and rejuvenating on scenic trails.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *