library_books Documentation

info

Configuration: Crawler Rules

Configuration sections: Main Options | Attributes | Crawler Rules | Customize

► Note: this feature is only available for upgraded PRO Sitemaps accounts

"Crawler Rules" configuration section allows you to narrow down the set of pages to be fetched/indexed on the website.

Maximum pages to crawl
Maximum Pages: This will limit the number of pages crawled. You can enter "0" value (or empty setting) to remove the limit.
Maximum Depth: Limit the number of "clicks" it takes to get to the page starting from homepage here. "0" for unlimited
Exclude URLs: noindex, nofollow
To disallow any part of your website from inclusion in the sitemap use "Exclude URLs" setting: all URLs that contain the strings specified will be skipped.
For instance, to exclude all pages within "www.domain.com/folder/" add this line:
folder/
If your site has pages with lists that can be reordered by columns and URLs look like "list.php?sort=column2", add this line to exclude duplicate content:
sort=
Anyway, you may leave this box empty to get ALL pages listed.
This option is working in the way similar to robots "noindex, nofollow" directive
Add directly in sitemap (do not parse) URLs: index, nofollow
"Do not parse URLs" works together with the option above to increase the speed of creating the sitemap.
If you are sure that some pages at your site do not contain the unique links to other pages, you can tell crawler not to fetch them.

For instance, if your site has "view article" pages with urls like "viewarticle.php?..", you may want to add them here, because most likely all links inside these pages are already listed at "higher level" (like the list of articles) documents as well:
viewarticle.php?id=
If you are not sure what to define here, just leave this field empty.
Concluding, pages found by the crawler and matching this setting are included in sitemap.
This option is working in the way similar to robots "index, nofollow" directive
Crawl, but do not include URLs: noindex, follow
Crawl pages that contain these substrings in URL, but do NOT include them in sitemap
This option is similar to robots "noindex, follow" directive
"Include ONLY" URLs:
Fill it if you would like to include into sitemap ONLY those URls that match the specified string, separate multiple matches with space. Leave this field empty by default.
Add directly in sitemap (do not parse) extensions:
These types of files will be added to the sitemap, but not fetched from your server to save bandwidth and time, because usually they are not html files and have no embedded links.
Please make sure these files should be indexed by search engines since there is no sense in adding them to sitemap otherwise!
This option is working in the way similar to robots "index, nofollow" directive
"Parse ONLY" URLs:
Fill it if you would like to parse (fetch from your website) ONLY those URls that match the specified string, separate multiple matches with space. Leave this field empty by default.
Remove parameters from URLs (separated with spaces):
URL parameters defined in this option will be removed from the link.
For instance, "page.html?article=new&sid=XXXXXX" will be replaced with "page.html?article=new" in case if "sid" is defined in this option.
Allow links not matching original location:
Allow including pages from any website subdomain (like news.yourdomain.com or mobile.yourdomain.com)
Allow both http and https protocol URls
Follow robots.txt directives:
PRO Sitemaps bot will read and follow blocking directives from your website's robots.txt file if this option is enabled (it will check "User-agent: *" and "User-agent: googlebot" sections)
Detect canonical URLs meta tags
In case when canonical meta tag in html source of your page points to a different URL, PRO Sitemaps bot will follow the link defined in the tag. open_in_newMore details
Detect external broken links
You can instruct PRO Sitemaps to check links pointing to externals website for "broken URLs", i.e. those returning "Not found" error.

Note: the number of links checked depends on the subscription plan and the number of URLs in sitemap. Ex. with 5,000 subscription and 4,000 links included in sitemap only up to 1,000 external links will be checked (remaining part of subscription plan)
Disable cookies:
With this option PRO Sitemaps bot will DISABLE support of cookies when crawling your website.
Additional Starting URLs:
If some pages on your website cannot be reached by clicking links, starting from homepage, you can define additional Entry points for sitemap PRO Sitemaps bot in this setting.
Example:
http://yourdomain.com/pageslist.html
http://yourdomain.com/more_pageslist.html
Make a delay between requests:
This option allows to slow down crawling and avoid overloading your server by adding delay for X seconds after N requests. "0" for no delay.
Crawling Schedule:
Crawling Schedule If you want to disallow Pro Sitemaps bot crawling your site on specific days/hours, you can define the schedule in this setting.
You can allow/disallow specific days of week and/or days of month and/or day hours. PRO Sitemaps bot will start crawling session only when defined schedule is matching.
If there is not enough time to finish the sitemap in a single session, the bot will pause the process and continue when the schedule is matching next time.

Sitemap update interval:
Select how often sitemap should be updated.
"Default" is daily for most websites (depends on the website size).
"Never" would disable automatic updates of sitemap - sitemap will be updated only when requested manually via "Create sitemap" page in your account.