The Protocol Gap: Brazil
Other editions: also published in Portguese and Spanish
"An overwhelming majority of news websites in Brazil have either no AI-scraping policy at all or a very permissive approach. As of November 2025, only 7.2% of Brazilian news websites disallow at least one AI crawler through the use of a robots.txt file, even though most news sites (75%) have a robots.txt file in their website. Websites in Brazil that implement robots.txt directives focus their efforts on what appear to be the most well known AI crawlers from OpenAI, Common Crawl, Google, ByteDance, Amazon, Apple, Meta and Huawei. This data indicates that many websites do not use robots.txt as a mechanism to signal their preferences regarding AI companies scraping their content. Keeping an updated robots.txt file is not a foolproof solution, but it is a readily available tool for newsrooms to state the preferences regarding the use of their content, in line with the organization’s content strategy and business model. Whatever their stance, organizations need to be able to have a clear public direction as to how their content is used to train new AI models." (Key findings, page 4)
"In recent years, news sites have been receiving an influx of traffic from a new kind of visitor: AI crawlers. From Googlebot to OpenAI’s “GPTbot,” Anthropic’s “Claudebot” and Meta’s “ExternalAgent,” these crawlers all have one thing in common: a persistent need for data to train Large Language Models (LLMs), develop AI assistants, and feed AI-powered search engines. These new visitors are noteworthy especially when we consider that around 30% of the global web traffic today comes from bots. News sites are attractive destinations as they produce timely, reliable information that can help to improve the quality of their AI models and outputs. This surging demand for AI data presents new opportunities—but many challenges—for media organizations. It raises important legal and ethical issues around unauthorized use, copyright infringement, and privacy violations. But the very survival of journalism hangs in the balance, too: without mechanisms in place to signal and enforce permissions on AI crawlers, media organizations are unable to effectively protect and monetize the value of their work as it gets scraped, repackaged, and redistributed by AI systems. This gap risks accelerating the sustainability crisis facing media outlets that are already grappling with declining visibility and traffic from digital platforms—what many publishers have already called a “traffic apocalypse”— and the collapse of traditional business models based on advertising.
News sites are becoming increasingly aware of this growing challenge, but there are currently few ways to manage AI crawlers and none of them are completely foolproof. Some sites are implementing hard paywalls, while others are launching lengthy lawsuits claiming copyright infringement. Technical responses are also emerging: in July 2025, the web security and infrastructure provider Cloudflare announced that its protections would block AI crawlers by default and help to enforce a permission-based model. Content marketplaces—like the one piloted by Microsoft or startups such as ProRata.ai and Tollbit—and new data solutions developed by the likes of Flexolmo and OpenMined are also emerging as ways to give content producers more control over how their content gets accessed, used, and compensated by AI companies. But many more still resort to managing content scraping by AI crawlers by implementing the Robots Exclusion Protocol (known as robots.txt)." (Introduction, page 4)
News sites are becoming increasingly aware of this growing challenge, but there are currently few ways to manage AI crawlers and none of them are completely foolproof. Some sites are implementing hard paywalls, while others are launching lengthy lawsuits claiming copyright infringement. Technical responses are also emerging: in July 2025, the web security and infrastructure provider Cloudflare announced that its protections would block AI crawlers by default and help to enforce a permission-based model. Content marketplaces—like the one piloted by Microsoft or startups such as ProRata.ai and Tollbit—and new data solutions developed by the likes of Flexolmo and OpenMined are also emerging as ways to give content producers more control over how their content gets accessed, used, and compensated by AI companies. But many more still resort to managing content scraping by AI crawlers by implementing the Robots Exclusion Protocol (known as robots.txt)." (Introduction, page 4)
Introduction, 4
About this project, 5
Robots.txt: An Underutilized Tool in Brazil, 6
Robots.txt Among Brazilian Publishers: What the Data Tells Us, 6
Blocked Agents, 7
Implications for Publishers in Brazil’s Evolving AI Landscape, 8
Conclusion and Next Steps, 10
Methodology, 11
About this project, 5
Robots.txt: An Underutilized Tool in Brazil, 6
Robots.txt Among Brazilian Publishers: What the Data Tells Us, 6
Blocked Agents, 7
Implications for Publishers in Brazil’s Evolving AI Landscape, 8
Conclusion and Next Steps, 10
Methodology, 11