How AIs Learn From Your Content: Considerations Before Blocking Bots

How AIs Learn From Your Content: OpenAI’s ChatGPT accesses website content to enhance its learning capabilities. However, concerns have arisen about the absence of effortless methods to prevent one’s content from being utilized to train large language models (LLMs) such as ChatGPT. This article outlines the processes and considerations involved in blocking access to website content for AI training purposes.

Blocking GPTBot

OpenAI has provided established allows individuals to block GPTBot through the robots.txt file. GPTBot is the user agent for OpenAI’s web crawler. Although this action may seem straightforward, it’s important to understand that OpenAI does not explicitly state that GPTBot is utilized to build the datasets used to train ChatGPT. Additionally, there is a public dataset by CommonCrawl, which already extensively crawls the internet. This raises questions about the necessity of OpenAI duplicating this work. Protecting Your Content: Blocking ChatGPT and Similar Bots with QUIC.cloud

The full user agent string for GPTBot is: User agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot).

GPTBot can be disallowed in the robots.txt file using the following lines:

User-agent: GPTBot
Disallow: /

Additionally, GPTBot adheres to specific directives that control which parts of a website are allowed for crawling and which parts are prohibited, offering further control over its access.

OpenAI also shares an IP range that can be used to identify the official GPTBot. While it’s possible to block this IP range through .htaccess, it’s important to note that the IP range can change, necessitating regular updates to the .htaccess file.

Considering the dynamic nature of the IP range, utilizing the range for confirming the user agent and blocking GPTBot with the robots.txt file is a more convenient approach.

Protecting Your Content and Website from ChatGPT

How AIs Learn From Your Content

Large Language Models (LLMs) are trained on data from various sources including Wikipedia, government court records, books, emails, and crawled websites. Numerous open-source portals and websites offer vast amounts of data for AI training purposes.

Datasets Used to Train ChatGPT

The datasets used to train GPT-3.5 are identical to those used for GPT-3 with the key difference being the use of reinforcement learning from human feedback (RLHF). These datasets include Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia. PDF

OpenWebText2

While WebText2 (created by OpenAI) is not publicly available, a publicly accessible open-source version known as OpenWebText2 has been developed. This dataset was created utilizing similar crawl patterns to the original WebText2, potentially offering a comparable dataset of URLs.

A cleaned up version of OpenWebText2 can be downloaded here, while the raw version is available here.

There is no specific information available about the user agent used for either crawler. However, it’s important to note that sites linked from Reddit with three or more upvotes have a strong chance of being included in both the private OpenAI WebText2 dataset and its open-source counterpart, OpenWebText2.

Common Crawl

The Common Crawl dataset, created by the non-profit organization Common Crawl, is a widely utilized resource comprising internet content. Website owners can opt to block Common Crawl with Robots.txt to prevent their content from entering other datasets sourced from newer Common Crawl datasets.

The CCBot User-Agent string is: CCBot/2.0.

To block the Common Crawl bot, add the following to the robots.txt file:

User-agent: CCBot
Disallow: /

Additionally, the use of the “nofollow” robots meta tag helps control the actions of CCBot.

OpenAI, ChatGPT & AIPRM

Considerations Before Blocking Bots

It’s essential to consider that datasets such as Common Crawl might be utilized by companies to filter and categorize URLs for advertising purposes. Exclusion from such databases could impact potential advertising opportunities for publishers. Protecting Your Content and Website from ChatGPT and Similar Language Models

Blocking AI From Using Your Content

Despite search engines and Common Crawl offering the ability to opt out of being crawled, there is currently no efficient way to remove one’s website content from existing datasets. It remains uncertain whether research scientists will provide a way for website publishers to opt out of being crawled by AI systems. The ethical implications of using website data without permission or an opt-out mechanism are explored further in the article “Is ChatGPT Use Of Web Content Fair?”.

In conclusion, the processes involved in controlling AI’s access to website content are intricate, and full opt-out capabilities are yet to be established. As technology advances, it’s crucial for website owners to be informed and proactive in managing access to their content for AI training.