Netpeak Spider 18.104.22.168: XML Sitemap Crawling and Issues Detection
Today we’ve prepared a bit unusual update, but before going into its details let me ask you one question. How do you usually set a task of XML sitemap creation to your developers?
It often happens so that the requirements specification contains only the list of necessary directories/categories/pages and links, for example, to Google documentation about sitemaps and Standard Sitemap Protocol. Therefore another question logically arises: have you ever entirely checked an XML Sitemap file implementation taking into account the search engines recommendations and standard protocol requirements?
That is what we’re going to talk about in this review, so get comfortable as ahead is a great deal of important information and pleasant surprises.
1. XML Sitemap Crawling
XML Sitemaps are created to increase the probability of indexation of the pages which can be difficult for search engine robots to find while crawling the website. And sitemaps have a significant importance if we’re talking about huge web portals or online stores. That is why we are glad to present a new Netpeak Spider feature – ‘XML Sitemap’ crawling mode that allows to quickly define the final number of URLs that are to be checked for the issues. This way we admittedly avoid the resource-intensive crawling of the whole website when we cannot forecast a total number of its pages.
The crawling in ‘XML Sitemap’ mode is carried out as follows:
- one separated thread crawls XML Sitemap (the reason is that Sitemaps can be very big) and hands over the URLs to the remaining threads
- the remaining threads (their number depends on your settings, by default it’s five) check all the selected parameters and detect the issues
Notice that in this mode Netpeak Spider doesn’t crawl all the pages of the website, it scans only the URLs stated in the specified sitemap → try to avoid links to 4xx error pages, 3xx redirects, non-canonical URLs and disallowed pages appearing in it.
For the convenience and simplicity of work with the new crawling mode, we’ve realized the automatic detection of Sitemap type and adaptive handling of this data. Totally, Netpeak Spider can operate three Sitemap types:
- XML Sitemap file → a standard XML file with the list of website URLs
- XML Sitemap index file → an XML file with enumeration of all standard XML Sitemaps
- TXT Sitemap → yes, it happens :) here every line must contain only one URL and all URLs must begin with the protocol (http / https)
Thus, after choosing an ‘XML Sitemap’ crawling mode, you just enter the initial URL and Netpeak Spider itself adjusts further functioning depending on the file type.
When the Sitemap crawling is completed, an ‘XML Sitemap Overview’ window will be opened by default (you can set the automatic opening either in this window with the help of an appropriate tick or in the ‘General’ tab of the tool settings). In this window, you can find the crawling results and also some pleasant features we’re going to talk about next.
2. XML Sitemap Issues Detection
‘XML Sitemap overview’ is a unique report similar to the main tool interface: there is a results table on the left and on the right side you can find a panel with Sitemap issues. The presented issues are detected based on official Standard Sitemap Protocol documentation and validation schemas for Sitemap files and Sitemap index files that are supported by Google, Bing, and Yahoo! search engines.
Generally, Netpeak Spider detects more than 20 Sitemap issues among which are the following:
- basic Sitemap file issues, starting with disallowed URLs and cross submission error and ending with a maximum number of URLs in a sitemap or its maximum size. Hover over the issue to see the prompt how it’s detected.
- validation issues based on checking of the above-stated validation schemas, as a result, you get an issues log with ‘Error’ or ‘Warning’ severity: we recommend to forward such issues straight to the website developers. In order to see these issues, double click on an appropriate value in ‘Validation Issues’ column.
Pay attention to one the most common issues related to the Sitemap file location – roughly speaking, the protocol (http / https), subdomain (www / non-www) and file directory (e.g., http://example.com/blog/sitemap.xml) have much influence on what URLs can be added to this Sitemap file. The URLs which didn’t pass this checking will be marked as ‘Disallowed’ in the results table and won’t be crawled at all.
This is justified by an official standard which states that such URLs are excluded from the further checking for security reasons. For instance:
- if the access privileges in your company are set so that the recording access is provided separately to the different directories, each Sitemap file, located in the specified site directory, must contain only the URLs from this directory
- if you need to place the Sitemap file pointing to one host on another host, then you’ll have to confirm your rights to manage that host in robots.txt file in order to avoid a cross submission error
If you’ve found some problems in your Sitemap, lose no time to fix them and then let’s move on to the next point.
3. Sending Sitemap File to the Search Engines
On the whole, there are three ways to inform the search engine that your website has a Sitemap:
- submit a link to the file via the search engine's submission interface
- specify its location in the site's robots.txt file
- send an HTTP request directly to the search engine
The last way is what we’ve implemented in the new version of Netpeak Spider. This HTTP-request is called a ‘Ping’, with its help we send the link to the Sitemap file straight to Google and/or Bing search engine previously having checked that the access to this file isn’t blocked in robots.txt. If the crawled website is changed on a regular basis together with its Sitemap, you’d also better check it regularly and send a ‘Ping’ to the search engines, so the new pages get indexed faster.
Time to sum it up
We’ve worked hard and now we are happy to share with you a unique tool that will help you:
- check the Sitemap file for the issues
- audit all the URLs in the Sitemap file
- correct all found issues (this you’re doing on your own)
- check the implemented changes
- send the updated Sitemap file straight to the search engines
That’s all we wanted to tell you about… for now, since we’ll be back very soon with a new long-awaited feature. Stay with us to be the first to learn about the new updates!