library_books Documentation

info

Settings > Crawler Rules

General / Attributes / Crawler Rules / Advanced

Note: this feature is only available for upgraded PRO Sitemaps accounts

"Crawler Rules" configuration section allows you to narrow down the set of pages to be fetched/indexed on the website.

Maximum pages to crawl
Maximum Pages: This will limit the number of pages crawled. Enter "0" value (or empty setting) to disable the limit.
Maximum Depth: Limit the number of "clicks" it takes to get to the page starting from homepage here. "0" for no limit.
Exclude URLs: noindex, nofollow
Use this setting to control which parts of your website are indexed: all URLs that contain the strings specified will be skipped.

Scenario 1: to exclude all pages within "www.domain.com/folder/" add this line:
folder/
Scenario 2: your website has pages with listings that can be reordered by columns and URLs look like "list.php?sort=column2", add this line to exclude duplicate content:
sort=
Leave this box empty to index all pages.
This option is similar to robots "noindex, nofollow" directive.

Excluded File types (extensions): these types of files will be skipped from the main web sitemap. Note that image file types will still be indexed in the Images sitemap separately.
Do not fetch, but Add directly in sitemap URLs: index, nofollow
Speed up sitemap generation by skipping fetching of URLs that don't contain unique links to other pages.
Enter URLs that can be added directly in sitemap without fetching them from your website.
Extensions: these types of files will be included in the sitemap without fetching them from your server, as they typically don't contain links to other pages.
This option is similar to robots "index, nofollow" directive.
Fetch, but do not include URLs: noindex, follow
Crawl pages that contain these substrings in URL, but do NOT include them in sitemap.
This option is similar to robots "noindex, follow" directive.
Filter - "Include ONLY" URLs:
Use this setting to control which parts of your website are indexed: only URLs that contain the strings specified will be included in sitemap. Leave this field empty by default.
Add directly in sitemap (do not parse) extensions:
These types of files will be added to the sitemap, but not fetched from your server to save bandwidth and time, because usually they are not html files and have no embedded links.
This option is working in the way similar to robots "index, nofollow" directive
Filter - "Fetch ONLY" URLs:
Use this setting to control which parts of your website are crawled: only URLs that contain the strings specified will be fetched. Leave this field empty by default.
Remove parameters from URLs (separated with spaces):
Remove query string parameters from URLs by defining them here, separated with spaces. For example, if you enter "sid", the link "page.html?article=new&sid=XXXXXX" will become "page.html?article=new".
Allow links not matching original location:
Allow including pages from any website subdomain (like news.yourdomain.com or mobile.yourdomain.com)
Allow both http and https protocol URls
Follow robots.txt directives:
PRO Sitemaps bot will read and follow blocking directives from your website's robots.txt file if this option is enabled (it will check "User-agent: *" and "User-agent: googlebot" sections)
Detect canonical URLs meta tags
If your pages include a canonical meta tag pointing to a different URL, PRO Sitemaps bot will follow that link and use it in sitemap instead of the original page URL. open_in_newMore details
Detect external broken links
PRO Sitemaps can be configured to detect broken URLs that are outside of your website. This includes links to external websites that return a "Not found" error.

Note: The number of links checked is limited by the subscription plan and the number of URLs in the sitemap. For example, with a 5,000 URL subscription and 4,000 URLs in the sitemap, only up to 1,000 external links will be checked.
Disable cookies:
When this option is enabled, PRO Sitemaps bot will not accept cookies while crawling your website.
Additional Starting URLs:
If some parts of your website are not accessible by following links from your homepage, you can define additional starting points for the sitemap PRO Sitemaps bot in this setting.
Example:
http://yourdomain.com/landing.html
http://yourdomain.com/another_landing.html
Throttle crawling speed:
To avoid overloading your server, you can throttle the crawling speed by adding a delay of X seconds after N requests. Enter "0" for no delay.
Crawling Schedule:
Crawling Schedule Schedule when PRO Sitemaps bot can crawl your website by selecting specific days of the week, days of the month, and hours of the day.
The bot will only start new crawling sessions when the schedule is matching. If there is not enough time to finish the sitemap in a single session, the bot will pause the process and continue when the schedule is matching next time.

Sitemap update interval:
Choose the frequency for sitemap updates - this determines the time gap after a sitemap is completed before initiating the next update.
By default, updates occur daily for most sites, depending on their size.
Selecting "Never" will turn off automatic updates, meaning the sitemap will only refresh when manually triggered through the "Update sitemap" page in your account.