Search engines are curious! Once your website is online, they regularly receive visits from small programs that crawl your web presence for new content. Whether you yourself have previously requested a search engine listing is irrelevant. Many things happen automatically these days, including these little search helpers. Since websites are freely accessible anyway, no invitation is necessary. These so-called "bots" (also "robots," informally "bots" or "spiders," and occasionally "crawlers") are an integral part of the internet. They independently search web content and follow specified hyperlinks to the next resource. It is sufficient if your homepage is mentioned on another website and this has been crawled by a bot.
The robots.txt protocol is a plain text file that complies with the robots exclusion standard. It can be used to include or exclude certain bots, i.e., small programs used to collect data. The file is stored in the root directory of a website and is taken into account by the corresponding bots. This is useful if a website operator does not want certain content to be listed in search engines or read by them.
In general: without the use of such a file, all bots are permitted to access web content and forward link data to the respective search engines, as well as to follow internal and external links on the respective pages. Technically, it is sufficient to define prohibition rules in the robots.txt file. However, this should only be considered a supplement. You should use a "noindex, nofollow" tag in the metadata of individual web pages to inform search engines that indexing and following links on the page is not desired. Of course, only if you don't want this.
Creating a robots.txt file
This is probably the simplest step. The file can be created using any writing program. The default filename is "robots.txt"; this is the name you'll use to save the file. After defining the rules, the file is placed in the root directory of the website. An FTP (file transfer protocol) program is typically used to upload files.
Defining rules
The rules to be defined can refer to individual directories or even individual files or web pages. Some rules can be combined into a single data set.
The trivial syntax (organization system) of robots.txt provides the following structure:
- User-agent: (followed by a blank line and the name of the bot to which the rule applies. An asterisk "*" here means all bots)
- Allow (oder) Disallow: / (Determines which path or file you want to include or exclude for the bots. The root path is given with "/")
You have free rein when it comes to designing the rules. The entries in Robots.txt could look like this:
Prohibit all bots from browsing the databases
User-agent: *
Disallow: /
Forbid a specific bot from browsing, but allow it to browse one level and page. All others are allowed to browse, since the rule refers only to a specific bot by name.
User-agent: Googlebot
Disallow: /
Allow: /index.html
Allow: /meineartikel/
The Google/Bing and Yahoo bots are not allowed to access these files and data directories, but all others are. The "subdirectory" and PDF documents are blocked. Please note that this prohibition only applies to the user agents mentioned. All other user agents are allowed to browse.
User-agent: Googlebot
User-agent: Slurp
User-agent: Bingbot Disallow: /Verzeichnisname/Unterverzeichnis/
Disallow: /Verzeichnisname2/
Disallow: /steven.html Disallow: /*.pdf$
Disallow: /*.docx$
Further considerations
- File names are case sensitive.
- The "*" symbol is used as a wildcard when no specific names have been specified. Not all bots honor this wildcard.
- The "$" symbol in this syntax is a line-end anchor. It indicates that the preceding letters must follow the period in this combination.
- The hashtag symbol ("#") indicates a comment in the syntax. These marked comments are ignored when the rules are read. Insert this symbol at the beginning of a line of text if you want to add a personal comment within the file.
- Problem: It may happen that websites are crawled even though they have been explicitly excluded in the robots.txt file. This is an automated process if the web pages' meta tags (the page-specific information for search engines) do not contain any exclusions. This exclusion is in addition to the information in robots.txt. Additionally, add the addition "noindex,nofollow" [<meta name="robots" content="noindex,nofollow">] to the meta field of the relevant web page.
- Well-known bots: Googlebot (Google), Googlebot-Image (Google Picture search), Bingbot (Bing), Slurp (Yahoo), ia_archiver (archive.org), ExaBot (Exalead), YandexBot (Yandex), Applebot (Apple), DuckDuckBot (DuckDuckGo), Baiduspider (Baidu), Swiftbot (Swiftype) ...
Furthermore, the order of the user agent information is irrelevant. The wildcard ("*") is merely an addition and not a general clause to be placed at the beginning. What is important is the fact that everything that is not excluded or mentioned is allowed.
All bots, with two exceptions, should be banned. For the two mentioned and welcome bots, only a few should be excluded.
User-agent: Googlebot
User-agent: Bingbot Disallow: /Verzeichnisname/Unterverzeichnis/
Disallow: /Verzeichnisname2/
Disallow: /sven.html Disallow: /*.csv$
User-agent: * Disallow: /