2.19.5.2. robots.txt
Important points:
- The directives in the
robots.txtfile are only recommendations for bots; they do not guarantee that individual services will comply with them. - The
robots.txtfile:- Must have exactly this name.
- Must be located in the site root directory and accessible via a browser at an address of the form
http://www.example.com/robots.txt. - Size — no more than 32 KB.
- Encoding — UTF-8.
- There can only be one
robots.txtfile on a single site. - By default, bots are allowed access to all pages on the site. To deny access to certain pages, use the
Disallowdirective. - Each directive must begin on a new line.
- Rules are case sensitive.
The robots.txt file consists of groups of rules that determine the behavior of search bots on the site.
Syntax
Each group can contain several identical rules. This is convenient for specifying multiple bots or pages.
The group of rules must be in the following order and consist of the specified directives:
User-agent— required directive, can be specified multiple times in a single group of rules.DisallowandAllow— required directives; at least one of them must be specified in each group of rules.Host,Crawl-delay,Sitemap— optional directives.
To specify regular expressions, use:
*— a sequence of any length consisting of any characters.$— end of line.
Basic directives
Attention!
Addresses and names are case sensitive. For example,Example and example will yield different results.
User-agent
The User-agent directive specifies the name of the bot for which the rule will apply. To specify all bots, you can use:
User-agent: *
When using a directive with a specific bot name, the rule with * is ignored.
An example of directives that allow access to the Googlebot bot and deny access to others:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
</code>
Disallow
The Disallow directive specifies pages to which bots are denied access.
Deny access to the entire site:
Disallow: /
Deny access to specific pages:
Disallow: /admin
Attention!
When you specify/admin, access will be denied to the admin directory and files with that name, such as admin.php and admin.html. To deny access only to the directory, add a slash at the end — /admin/.
Allow
The Allow directive specifies the pages that bots are allowed to access. The directive is used to create exceptions when using Disallow.
Deny access to the Googlebot bot to the entire site, except for the pages directory:
User-agent: Googlebot
Disallow: /
Allow: /pages/
Host
The Host directive determines which domain of the site should be considered the main one. The other domains will be defined as mirrors, technical addresses, etc. It is used for correct search indexing when several domains are linked to the site.
For a site with the domains example.com and domain.com, the domain example.com will be considered the primary domain for all bots:
User-agent: *
Disallow:
Host: domain.com
</code>
Crawl-delay
The Crawl-delay directive specifies the interval in seconds between the end of loading one page and the start of loading the next for bots. It can be used to reduce the number of requests to the site, which helps to reduce the load on the server.
Example of use:
User-Agent: *
Disallow:
Crawl-delay: 3
Sitemap
The Sitemap directive specifies the URL of the sitemap file on the site. It can be specified multiple times. The address must be in the format protocol://address/path/to/sitemap.
Example of use:
Sitemap: https://example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap.xml
Multidomain robots.txt
Important points:
- The existing
robots.txtfile must be deleted. - In the "Site settings" section, the option "Forward request to backend if static file is not found" must be enabled, or the
txtextension must be removed from static files.
On sites with multiple domains (for example, when using aliases), the settings in the robots.txt file may differ for each domain due to specific SEO optimization or other tasks.
To implement dynamic robots.txt, do the following:
- Read the important information in this article and make sure that all conditions are met.
- In the site root directory, create files named
domain.com-robots.txt(replacedomain.comwith the domains to which the rules should apply). - In the created files, specify the necessary rules for each domain.
- At the beginning of the
.htaccessfile, add the following directives:RewriteEngine On RewriteCond %{REQUEST_URI} ^/robots\.txt$ RewriteRule ^robots\.txt$ %{HTTP_HOST}-robots.txt [L] - Check the output of the rules for each domain.