For more than a decade, the nonprofit Common Crawl “has been scraping billions of webpages to build a massive archive of the internet,” notes the Atlantic, making it freely available for research.
“In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.
“In the process, my reporting has found, Common Crawl has opened a back door for AI companies to…








