Now there’s a growing fight between AI crawlers and the people who hate them, who are using aggressive technology and financial demands to punch back. This battle will help decide whether there’s room for both AI and the websites that you rely on.
Why websites object to AI crawlers
Bots have been an internet fixture for decades.
Google’s crawlers regularly grab parts of websites to organise the information into its search results. The Internet Archive’s crawlers save snapshots of websites over time to catalogue internet history.
Website owners have beefs with those automated programmes, but Google crawlers in particular are generally considered a mutually beneficial relationship.
Google crawls and catalogues websites to feed its search engine, and in return websites are found by the billions of people using Google search.
But experts say that AI crawling – which has exploded since the 2022 public debut of ChatGPT kicked off an AI boom – is more problematic in two ways.
Firstly, some websites doubt that they will benefit from AI crawlers grabbing their information to “train” AI or to reply to people’s chatbot questions.
Secondly, many websites say AI companies’ crawlers act like unpredictable greedy jerks in ways that can break websites or drive up their costs.
Michael Weinberg, co-director of the GLAM-E Lab that works with museums, academic archives, and other cultural institutions, said traditional crawlers like those from Google search typically sip doses of website information in fairly regular intervals and blend in with human users.
By contrast, AI crawlers might gobble a bunch of text, images and videos to download from a website all within minutes or hours.
As a result, some cultural organisations have suddenly found their websites straining or busted because of AI crawler swarms, Weinberg detailed in a June report.
The University of North Carolina at Chapel Hill, for example, recently said AI crawlers drove five times the usual number of simultaneous searches of its online library catalogue, “overloading the system and triggering glitches”.
Even one of the world’s most popular websites, Wikipedia, said in April that a huge surge of visits from AI crawlers forced the site to spend more money and scramble to remain online for users.
“The sheer amount of traffic generated by crawlers causes a strain on the underlying infrastructure that keeps our sites available for everyone,” said a spokesperson for the Wikimedia Foundation, the non-profit that oversees Wikipedia.
AI crawlers vs crawler blockers
Eric Holscher, co-founder of Read the Docs, an online project for software developers, echoed many other website owners in saying that his biggest concerns about AI crawlers are fairness and survival.
Holscher, Wikipedia, and other website publishers say that when chatbots spit out AI-generated replies or improve their computer systems with the websites’ information but don’t link to them as Google search does, it deprives sites of valuable brand recognition or online visits that are their lifeblood.
“If the data is just used for [AI] training or summarisation in answers, there is no way to sustain the publisher if they don’t get traffic,” Holscher said by email.
TollBit’s chief executive said the sports website that had 13 million monthly AI crawler visits also had 15 million Google search crawler visits in the same month.
But he said millions of people found the sports website as a result of the Google crawlers, compared with 600 from the AI crawlers.
Some websites maintain lists of unwelcome crawlers, but not all AI crawlers respect those keep-out notices.
The battle with AI crawlers has reached such a breaking point that more websites are using technology to block or confuse AI crawlers.
Some AI companies have also agreed to pay websites for AI activity. (The Washington Post has a content partnership with ChatGPT owner OpenAI.)
Cloudflare, which helps millions of websites manage their online traffic, said yesterday that it can now automatically block or limit AI crawlers for its website customers.
Cloudflare and TollBit also let websites erect AI-only paywalls that demand the crawlers pay or get out.
Some website owners and AI backers say the crawler hatred has gone too far.
Rich Skrenta, executive director of the Common Crawl Foundation, which oversees an open repository of crawled website information for AI and other uses, said it will take time and collaborative effort to figure out how websites can benefit from helping fuel AI chatbots and systems.
Websites may regret blocking AI crawlers instead of experimenting with how to earn money from people using AI as a new type of web search, he said.
But people behind online information and entertainment say that something must change now with AI crawlers.
“If publishers want to thrive, we have to find a solution that is mutually beneficial to both sides,” Panigrahi said.