2.19.5.2. robots.txt

Important points:

  • The directives in the robots.txt file are only recommendations for bots; they do not guarantee that individual services will comply with them.
  • The robots.txt file:
    • Must have exactly this name.
    • Must be located in the site root directory and accessible via a browser at an address of the form http://www.example.com/robots.txt.
    • Size — no more than 32 KB.
    • Encoding — UTF-8.
  • There can only be one robots.txt file on a single site.
  • By default, bots are allowed access to all pages on the site. To deny access to certain pages, use the Disallow directive.
  • Each directive must begin on a new line.
  • Rules are case sensitive.

The robots.txt file consists of groups of rules that determine the behavior of search bots on the site.

Each group can contain several identical rules. This is convenient for specifying multiple bots or pages.

The group of rules must be in the following order and consist of the specified directives:

  1. User-agent — required directive, can be specified multiple times in a single group of rules.
  2. Disallow and Allow — required directives; at least one of them must be specified in each group of rules.
  3. Host, Crawl-delay, Sitemap — optional directives.

To specify regular expressions, use:

  • * — a sequence of any length consisting of any characters.
  • $ — end of line.

Attention!

Addresses and names are case sensitive. For example, Example and example will yield different results.

The User-agent directive specifies the name of the bot for which the rule will apply. To specify all bots, you can use:

User-agent: *

When using a directive with a specific bot name, the rule with * is ignored.

An example of directives that allow access to the Googlebot bot and deny access to others: User-agent: * Disallow: / User-agent: Googlebot Disallow: </code>

The Disallow directive specifies pages to which bots are denied access.

Deny access to the entire site:

Disallow: /

Deny access to specific pages:

Disallow: /admin

Attention!

When you specify /admin, access will be denied to the admin directory and files with that name, such as admin.php and admin.html. To deny access only to the directory, add a slash at the end — /admin/.

The Allow directive specifies the pages that bots are allowed to access. The directive is used to create exceptions when using Disallow.

Deny access to the Googlebot bot to the entire site, except for the pages directory:

User-agent: Googlebot
Disallow: /
Allow: /pages/

The Host directive determines which domain of the site should be considered the main one. The other domains will be defined as mirrors, technical addresses, etc. It is used for correct search indexing when several domains are linked to the site.

For a site with the domains example.com and domain.com, the domain example.com will be considered the primary domain for all bots: User-agent: * Disallow: Host: domain.com </code>

The Crawl-delay directive specifies the interval in seconds between the end of loading one page and the start of loading the next for bots. It can be used to reduce the number of requests to the site, which helps to reduce the load on the server.

Example of use:

User-Agent: *
Disallow:
Crawl-delay: 3

The Sitemap directive specifies the URL of the sitemap file on the site. It can be specified multiple times. The address must be in the format protocol://address/path/to/sitemap.

Example of use:

Sitemap: https://example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap.xml

Important points:

  • The existing robots.txt file must be deleted.
  • In the "Site settings" section, the option "Forward request to backend if static file is not found" must be enabled, or the txt extension must be removed from static files.

On sites with multiple domains (for example, when using aliases), the settings in the robots.txt file may differ for each domain due to specific SEO optimization or other tasks.

To implement dynamic robots.txt, do the following:

  1. Read the important information in this article and make sure that all conditions are met.
  2. In the site root directory, create files named domain.com-robots.txt (replace domain.com with the domains to which the rules should apply).
  3. In the created files, specify the necessary rules for each domain.
  4. At the beginning of the .htaccess file, add the following directives:
    RewriteEngine On
    RewriteCond %{REQUEST_URI} ^/robots\.txt$
    RewriteRule ^robots\.txt$ %{HTTP_HOST}-robots.txt [L]
  5. Check the output of the rules for each domain.
Content

    (1)