AI Threat: 80% DNPA members restrict use of crawlers on websites to train AI models

Digital news publishers the world over and in India are jumping into action to safeguard their content against powerful web-crawlers like OpenAI’s GPTBot, which collects data from websites to train its AI models.

By
  • Aashrey Baliga,
| September 14, 2023 , 9:43 am
While ML, Python and SQL continue to dominate current skills requirements in AI; GitHub, PyTorch and Databricks are also beginning to emerge as important skills. (Representative image by Andrea De Santis via Unsplash)
While ML, Python and SQL continue to dominate current skills requirements in AI; GitHub, PyTorch and Databricks are also beginning to emerge as important skills. (Representative image by Andrea De Santis via Unsplash)

The scale and depth of the impact of Artificial Intelligence systems and the companies behind them is becoming more apparent by the day. While AI’s impact is being felt across industries, it is also particularly acute for the news publishing industry. News publishers across the world are jumping into action to protect against generative AI tools like ChatGPT that are siphoning their content through powerful web-crawlers GPTBot to train the LLMs (large language models).

In August, Microsoft-backed OpenAI announced that website owners can now block its GPTBot from accessing their webpage’s content. Bloomberg, CNN and The New York Times are among hundreds of publishers that have shut themselves off to OpenAI’s web crawlers. Major media companies in India as well have restricted or shut off access to crawlers. The list includes top publishers like The Indian Express group, India TV, HT Group and The Hindu among others.

These Digital News Publishers Association (DNPA) members have already restricted access to OpenAI. DNPA represents leading news publishers in India such as India Today Group, HT Group, Times Group, DB Corp, Dainik Jagran, Amar Ujala, Hindustan Times, Zee Media, ABP Network, Lokmat, NDTV, New Indian Express, Mathrubhumi, Hindu, and Network18.

Storyboard18 reached out to Sujata Gupta, secretary general at DNPA to know about their recent communications with the Ministry of Information & Broadcasting (MIB) and Ministry of Electronics and Information Technology (MeitY) and what sort of help they are looking for to protect themselves against AI and safeguard the future of the industry.

Gupta mentioned that it is still very early days as all stakeholders are beginning to understand who is crawling what data. Few of the countries have already taken steps towards putting up measures to protect their content and rights. “In India too, correct measures and the necessary steps have been made by our publishers. Around 70 to 80 percent of publishers of the DNPA have already restricted the use of crawlers on websites to train AI models,” said Gupta.

“We knew AI was coming. We knew changes had to be made. MIB and MeitY are aware of the concerns,” said Gupta, who also acknowledges the proactiveness of the Indian government in dealing with the AI opportunity and threat.

Furthermore, Gupta mentioned that the Government is working towards the issue and the new Digital India Act (DIA) should factor in all the changes and should have a ramification for both the revenue and copyright package for them. “This is an established fact that this concern needs to be addressed and we are very hopeful that the Digital India Act will cater to all of this.”

Even Australia has taken credible steps towards it. They have reopened the Treasury Bill to incorporate technological advancements with respect to AI. Similarly, Canada has incorporated it, the EU has incorporated it. “India too is very proactive. We are not sitting and waiting but the West is ahead. They have already taken the necessary action,” added Gupta.

Union Minister of State for Electronics and Information Technology, Rajeev Chandrasekhar, stated this week that the draft of the Digital India Act is ready and will be released soon. This move comes after India’s successful showcase of its digital infrastructure at the G20 Summit, where it garnered global attention.

Copyright infringements

At the heart of the issue for the news industry are LLMs (large language models), a type of AI trained on published news content available on the internet. It is anticipated that LLMs may go on to transform the media industry.

Global news organisations had written an open letter directed at regulators and AI-focused tech companies. The letter called on lawmakers to frame rules and regulations to safeguard copyright in the use of news content to train generative AI models.

These organisations further sought compensation from publishers for the AI-centric use of their published news content by tech giants. The open letter was signed by Getty Images, Agence France-Presse, Associated Press, European Publishers’ Council, Gannett, Authors Guild, European Pressphoto Agency, National Press Photographers Association, News Media Alliance, and National Writers Union.

Big Tech firms like Google and Meta are investing heavily in developing LLMs to build their own generative AI tools, alarming news organisations across the world including India. Reports suggest that Indian Hindi news brands like Dainik Bhaskar and Amar Ujala have shut off “AI and tech firms” from feeding on their content to train LLMs.

The updated terms and conditions of Dainik Bhaskar’s website reads, “All materials published or available on the Services are protected by copyright, and owned or controlled by DBCL solely or in association with third parties or with such other parties who are given credit as the provider of the Content. Non-commercial use of the Service shall also include the use of Content only upon obtaining prior written consent from DBCL in connection with: (1) the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system or large language model (LLM); or (2) providing archived or cached data sets containing Content to another person or entity.”

Leave a comment

Your email address will not be published. Required fields are marked *