Dozens of large companies including Amazon and The New York Times have rushed to block GPTBot, a tool that OpenAI recently announced it was using to crawl the web for data that would be fed to its popular chatbot, ChatGPT.
As of this week, 70 of the world's top 1,000 websites have moved to block GPTBot, the web crawler OpenAI revealed two weeks ago was being used to collect massive amounts of information from the internet to train ChatGPT. Originality.ai, a company that checks content to see if it's AI-generated or plagiarized, conducted an analysis that found more than 15% of the 100-most-popular websites have decided to block GPTBot in the past two weeks.
The six largest websites now blocking the bot are amazon.com (along with several of its international counterparts), nytimes.com, cnn.com, wikihow.com, shutterstock.com, and quora.com.
The top 100 sites blocking GPTBot include bloomberg.com, scribd.com, and reuters.com, as well as insider.com and businessinsider.com. Among the top 1,000 sites blocking the bot are ikea.com, airbnb.com, nextdoor.com, nymag.com, theatlantic.com, axios.com, usmagazine.com, lonelyplanet.com, and coursera.org.
“GPTBot launched 14 days ago and the percentage of Top 1,000 sites blocking it has been steadily increasing,” the analysis said.
How these websites block GPTBot is relatively simple, even crude, depending on your perspective. The sites include a file called robots.txt, and GPTBot has been added to its “disallow” list.
Robots.txt is a tool created in the 1990s meant to stop web crawlers, such as Google or Bing's search crawlers, from extracting data and information from a website. When revealing the crawler, OpenAI said it would abide by robots.txt and GPTBot would not crawl websites that deploy it.
Much of what is available on the internet, particularly text and images, is technically under copyright. Crawlers like GPTBot do not ask for permission, license, or pay to use any data or information they extract. The only way to avoid them at this point is through robots.txt, although companies that deploy crawlers are not legally bound to recognize robots.txt restrictions.
There's been an increasing awareness about copyright rules and the ownership of data these crawlers take to train AI projects based on large language models, or LLMs, as tools like ChatGPT have exploded onto the tech scene. Several lawsuits are already in the works. The author Stephen King, after learning his books have been used in AI training sets, said he's looking to the future with a “certain dreadful fascination.”
For its part, OpenAI has taken to trying to hide that ChatGPT was trained on any copyrighted material.
A representative of OpenAI could not be immediately reached for comment.
See below for a full list of those among the biggest websites to have blocked GPTBot between August 8 and August 22:
- amazon.com
- quora.com
- nytimes.com
- shutterstock.com
- wikihow.com
- cnn.com
- foursquare.com
- healthline.com
- scribd.com
- businessinsider.com
- reuters.com
- medicalnewstoday.com
- amazon.co.uk
- insider.com
- yourdictionary.com
- slideshare.net
- amazon.de
- bloomberg.com
- amazon.in
- studocu.com
- ikea.com
- uol.com.br
- amazon.fr
- geeksforgeeks.org
- pcmag.com
- theverge.com
- nextdoor.com
- amazon.ca
- amazon.co.jp
- airbnb.com
- vulture.com
- polygon.com
- prnewswire.com
- mashable.com
- nymag.com
- detik.com
- theatlantic.com
- trulia.com
- amazon.es
- eater.com
- picclick.com
- bustle.com
- etymonline.com
- teacherspayteachers.com
- archiveofourown.org
- vox.com
- kumparan.com
- theathletic.com
- amazon.it
- alltrails.com
- thrillist.com
- amazon.com.br
- usmagazine.com
- pikiran-rakyat.com
- city-data.com
- hellomagazine.com
- stern.de
- chicagotribune.com
- spanishdict.com
- lonelyplanet.com
- inverse.com
- actu.fr
- fool.com
- coursera.org
- france24.com
- myfitnesspal.com
- dotesports.com
- theglobeandmail.com
- axios.com
Originally published on BusinessInsider.com