Reddit, Google, and the Real Cost of AI Data Fever

Photo Illustration: Intelligencer; Photo: Getty Images

As of last week, the only major search engine that still includes Reddit is Google. You won’t find the latest Reddit links on Microsoft’s Bing, or useful results from the platform on the privacy-focused DuckDuckGo. For most people, this change won’t make much of a difference in the short term—after all, most people use Google, which has about 90 percent of the global search market share, and they can still visit Reddit directly—but it’s a strange change. Reddit is a platform with deep ties to the web, which started life as a link aggregator and grew into an online community with more than 80 million daily active users (the more, the better). Why, according to a report in 404 Media, is it suddenly closing its doors?

As is often the case when tech companies behave strangely or unpredictably these days, the answer has something to do with artificial intelligence. Earlier this year, Google struck a deal with Reddit, worth an estimated $60 million per year, to license the site’s data to train AI models. In recent months, Google watchers Also noticed an increase in the number of Reddit posts appearing in search results, with user comments ranking high across a wide range of queries. While these stories are related, they’re also somewhat independent: Google, like others in the AI space, wants to license data so it can build better models and avoid lawsuits; Google users were already adding “Reddit” to search queries as a kind of search quality trick, so the company sort of followed suit.

It’s where the two stories overlap that things start to go wrong. Search engines gather relevant and timely information by crawling the web with bots, indexing what they find, and ranking it based on whether and how it’s relevant to user searches. Sites have some control over whether and how that crawling happens, and there are various reasons why they might not want to crawl parts of their entire sites (an individual might want to keep an old blog online but unsearchable, while a company like Facebook might want Google users to be able to find profiles but not search their content). But for decades, crawling was part of a simple, mutually beneficial arrangement. Search engines collected lots of users by providing a useful service; sites allowed and even adapted to crawling by search engines in order to connect with that audience.

In the last few years, however, indexing has taken on an additional purpose. The robots that crawl your site and read all your data are not just building a search index. They can also build an AI model. For many sites, this is very NO part of the deal. As David Pierce writes in The Verge, the sudden pivot from search index to AI training means that “the basic social contract of the web is unraveling.” A mutually beneficial arrangement is being replaced by an extractive one, fueled by frantic and unilateral actions by both startups and tech giants.

At first, the fallout from this collapse was relatively limited, with large sites and platforms—owned by big companies like Facebook and Amazon—clearly blocking crawlers from companies like OpenAI. That clarity didn’t last long. Google is all in on AI. Bing is owned by Microsoft, OpenAI’s largest investor and partner. Suddenly, all search companies were AI companies, and new crawlers were thrown into the mix. All indexing was AI indexing—it would be naive to assume otherwise. The sudden toll was obvious to anyone paying attention to traffic statistics. The bots were harvesting everything they could.

In response to the 404 Media reports, critics have argued that Google—a company that has otherwise seemed terrified of past and potential antitrust violations—is nonetheless buying an unfair advantage in exchange for a product that already has a near-monopoly. And they’re right: Without Reddit, one of the largest repositories of authentic human text on the internet, smaller search engines can’t compete.

But the story wouldn’t be complete without Reddit, which is the company actually doing the blocking. (Microsoft, for its part, confirmed that its crawlers were blocked.) As a Reddit spokesperson told The Verge:

This is completely unrelated to our recent partnership with Google. We have been in talks with many search engines. We have not been able to reach an agreement with all of them, as some are unable or unwilling to make enforceable promises regarding the use of Reddit content, including its use for AI purposes.

This is practical behavior on the part of Reddit’s management, a public company with a duty to its shareholders, but it’s also simply bad for the public: in addition to restricting access for users who don’t want to use Google, it further subjects Reddit to the specific incentives of Google search, which are already changing rapidly, in part because of AI. Spammers are already cluttering popular threads on Reddit, which is suddenly receiving huge amounts of traffic from Google, in the hope of gaining more visibility in search results.

Google, like Reddit, owes its existence and success to the principles and practices of the open web, but such exclusive arrangements mark the end of a long and incredibly fruitful era. They are also a sign of things to come. The web was already in poor shape, reduced over the past 15 years by the rise of closed platforms, devastated by the consolidation of advertising, and polluted by an excess of content from the AI products that used it for training. The rise of AI scraping threatens to end the work, destroying a flawed but incredibly successful, decades-long experiment in the open web and human communication with a set of antagonistic agreements between warring tech companies.

See everything