JSitemap Professional is a plugin to generate a Joomla sitemap and manage SEO.
The robots.txt file
Pointing out sitemaps to Search Engines and preventing overloading website with crawling requests
Before crawling a site, Google's crawlers download and parse the site's robots.txt file to extract information about which parts of the site may be crawled and where your sitemaps are stored.
Setting up your file
- Upload the file robots.txt.dist from Joomla! to the root folder of your website
- Rename the file robots.txt
- In the JSitemap control panel on the backend of your website, select the Robots.txt Editor functionality
- Copy paste the following text and hit "Save robots.txt"
- User-agent: *
- User-agent: AdsBot-Google
- Disallow: /administrator/
- Disallow: /api/
- Disallow: /bin/
- Disallow: /cache/
- Disallow: /cli/
- Disallow: /components/
- Disallow: /includes/
- Disallow: /installation/
- Disallow: /language/
- Disallow: /layouts/
- Disallow: /libraries/
- Disallow: /logs/
- Disallow: /modules/
- Disallow: /plugins/
- Disallow: /tmp/
Resources
- Introduction to robots.txt
- Create a robots.txt file
- How Google interprets the robots.txt specification
A robots.txt file lives at the root of your site.
- It is mainly used to avoid overloading your site with requests, it is not a mechnism for keeping aweb page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page. To keep information secure from web crawlers, use other blocking methods such as password-protecting private files on your server.
- A robots.txt file consists of one or more groups:
- Each group consists of multiple rules or directives, one directive per line. These are case-sensitive.
- A group gives the following information:
- "User-agent" : Who the group applies to. Note that you can group together rules that apply to multiple user agents by repeating user-agent lines for each crawler.
- "Disallow" : Which directories or files that agent cannot access.
- "Allow" : Which directories or files that agent can access. This is the default assumption for all user agents. You only use it when you wish to override a "Disallow" directive to allow crawling of a subdirectory page.
- "Sitemap" : The location of a sitemap for the website. Sitemaps are the only directives that require the entire string (Ex: https://esentialist.com/").
- Each group begins with a User-agent line that specifies the target of groups:
- To specify all crawlers for your "rules" use syntax "User-agent: *". This syntax excludes AdsBot crawlers, which must be named explicitly with the syntax "User-agent: AdsBot-Google".
- To specify Google as a "given crawler" use syntax "User-agent: Googlebot".
- The "#" character marks the beginning of a comment.
Test your file
To test whether your newly uploaded robots.txt file is publicly accessible:
- open a private browsing window (or equivalent) in your browser and navigate to the location of the robots.txt file. For example, https://example.com/robots.txt. If you see the contents of your robots.txt file, you're ready to test the markup.
- Use Google's robots.txt Tester
Excluding specific images and videos folders from sitemaps
JSitemap Pro offers an advanced filtering system to include or exclude images/videos for each single data source, based on slices of string or paths that you can
specify comma separated:
- Global Configuration -> Sitemaps settings
- Exclude filters for Images sitemap : ex. favicons