Dear friends, this time we’ll tell you about a new feature of Netpeak Spider and will give you a clear instruction on how to work with Sitemap files. And for dessert, we’ll present a list of several changes to the program that will make your work even more safe and intuitive.
1. Sitemap Generator
With the new tool you can generate the following types of Sitemaps:
- XML Sitemap → a standard Sitemap file which contains only links to crawled pages and is generated according to the Standard Sitemap Protocol.
- Image Sitemap → a file which contains both links to crawled pages and links to all images that are found on these pages. If the page doesn’t have images, the link to such a page won’t be added to the Image Sitemap. For more information on Image Sitemaps follow this link.
- HTML Sitemap (content) → a special HTML file which contains a list of links to all crawled pages and can be placed into a specific directory of your website.
- TXT Sitemap → a text sitemap with a similar list of links to all crawled pages: a less popular but still relevant way to help search engines fully index your website.
⚠ Hover over the video to make it full screen, stop, or rewind
We have tried to answer a number of questions you might have when generating Sitemap files:
✔ What pages can be added to a Sitemap?
To the original Sitemap you can add the URLs that fall under the following conditions:
- HTML or PDF files with '200 OK' status code
- access is allowed by robots.txt
- Canonical Tag is missing or pointing to the same URL
- Meta Refresh is missing or pointing to the same URL
- indexing is allowed by X-Robots-Tag or Meta Robots (index)
- following links is allowed by X-Robots-Tag or Meta Robots (follow)
✔ How are subdomains processed?
According to the official standard, each separate Sitemap can contain links to a single host only. Now, you can either generate a Sitemap for all subdomains (each file will belong to its subdomain alone) or choose a separate subdomain and generate a Sitemap for it.
✔ Are there any Sitemap Generator settings?
Yes. We’ve left only a few to let you generate files in just a couple of clicks:
- Only URLs in ‘All’ / ‘Filters’ tab → allows flexibility in generating a sitemap: by applying a filter you can generate a sitemap that will contain only specific URLs
- Last modified date → the ‘lastmod’ parameter tells search engines whether the page should be reindexed or whether the content on the page has been left unchanged. You can either choose one of the standard values, enter it manually for all URLs or leave the field empty
- Change frequency → the ‘changefreq’ parameter tells search engines how often the content on the page is changed. You can either choose one of the standard values or leave the field empty
- Priority → the ‘priority’ parameter will recommend search engine robots to index / reindex certain URLs first. You don’t have to specify the parameter, although there is an option to set the priority based on either the number of incoming links (in this case the most popular pages will get the highest priority) or the number of outgoing links (in this case your website will be indexed faster, since the pages with most internal links will get the highest priority)
- Compress into .gz archive (only for XML Sitemap, Image Sitemap, and TXT Sitemap) → we recommend turning this function on as it greatly reduces the size of the generated files and thus the load on your server
- Anchor text source (only for HTML Sitemap) → as link text (anchor) you can choose a URL, a title tag, or an h1 header. To use these parameters, make sure they are turned on in the crawling settings
- Split into parts by the number of URLs (only for HTML Sitemap) → this function lets you split the list of URLs into several files 100, 500, or 1000 URLs in each
✔ What will the result of the generation be?
As a result, you will get a folder which contains the Sitemap files you wanted to generate. Each file has a particular name. XML Sitemap and Image Sitemap files are minimized to occupy less space and include more URLs. At the same time, the links inside the files are sorted by the number of URL segments, pages with the minimum number of segments come first.
Netpeak Spider will automatically calculate when a sitemap index file needs to be created. In this case, index files will be generated separately and will contain links to the standard XML Sitemaps of the website.
✔ Have you considered the changes in the official standard?
If you have any questions we haven’t answered – please leave them in the comments! :-)
2. Instructions for working with XML Sitemaps
The new tool brings to a close the full cycle of works on Sitemap files and we prepared an instruction on how to quickly create and check an XML Sitemap in Netpeak Spider:
2.1. Website crawling
- select the ‘Entire Website’ crawling mode
- deselect all functions in the ‘General’ tab to maximize crawling speed and crawl the main host URL only
- deselect all parameters in the ‘Parameters’ tab except for the required ones (just click on the ‘Parameters’ checkbox)
- restore the default settings in the ‘Advanced’ tab
- crawl the website
- go to the ‘Tools’ menu at the top right corner of the interface and choose ‘Sitemap Generator’
- select ‘XML Sitemap’ only
- choose the necessary change frequency (gives recommendations to search engines, can lower the load on the server)
- click ‘Generate’
- choose the end folder and click ‘OK’
- after the sitemap is generated, copy the files from the XML Sitemap folder to the root directory of your website
- add the ‘Sitemap’ directive to the robots.txt file with the address of the uploaded sitemap like https://example.com/sitemap.xml or https://example.com/sitemap-index.xml in case of a sitemap index file
- choose the ‘XML Sitemap’ crawling mode
- add the address of the uploaded sitemap or the sitemap index file
- click ‘Start’
- open the ‘XML Sitemap Overview’ tool (by default opens automatically)
- make sure that the sitemap doesn’t have any issues (note that each sitemap must contain only links to one particular host and it should be in the root directory of this host)
2.5. Sending to search engines
- go to ‘Search Engines Ping’ tab in the right panel of the ‘XML Sitemap Overview’ window
- choose the necessary search engines and click ‘Send Sitemap’
- It is highly recommended to add the link to the generated sitemap to all webmaster tools (Google Search Console, Bing Webmaster, etc.)
3. Other Improvements
The update also includes some more subtle changes that nonetheless make the program more logical and safe:
- we have added a new ‘Last-Modified’ parameter that contains the date and time of the last modification of the file and is used in the Sitemap Generator (if you turn on the corresponding function)
- in the previous version the internal PageRank calculation would start automatically when the crawl is paused or after it has been successfully completed, which sometimes caused problems with high memory usage when crawling larger sites; that is why we have set up a limit of 10,000 results for the automatic calculation: if the crawl results in more, no automatic calculation will take place – it can only be done in the internal PageRank tool itself
- restrictions (the maximum number of crawled URLs and the maximum crawling depth) and exclusions (robots.txt instructions, meta robots, crawling rules, etc.) have been removed from the ‘List of URLs’, ‘XML Sitemap’, and ‘Google SERPs’ crawling modes; as a result, if you are using these modes, you can be sure that all pages will be added and no URLs will disappear without a trace
- we have changed the design of the ‘Quick Settings’ button to attract your attention to this menu – remember that you can adjust these settings during the crawl
In a nutshell
In Netpeak Spider 184.108.40.206 we have introduced the tool for generating different types of sitemaps: XML, Image, HTML, and even TXT Sitemap. Now, the program has a complete cycle of works with Sitemap files: you can crawl your website, generate necessary sitemaps, check the generated files, and send them directly to search engines.
Soon we will tell you about a new feature of Netpeak Spider that will close the season of major program updates and we will finally enjoy some rest… and by resting I mean making a new version of Netpeak Checker! ;-)