Netpeak Spider 2.1.1: Viewing Page Source and HTTP Headers + 5 New Issues

3
3
Netpeak Spider 2.1.1: Viewing Page Source and HTTP Headers + 5 New Issues
Updates

This update is aimed to improve your understanding of the certain aspects of search optimization and help you learn more about SEO. Take the time and read this review to find out more about all the new features and ‘upgrade’ your Netpeak Spider skills :-)

View Page Source and HTTP Headers

Prepare to be totally surprised by the update!

Try and crawl any URL; as soon as it appears in the result table, choose the line you want and:

✔ right-click to open a context menu → ‘View the source code and HTTP headers’

...or...

✔ press the famous key combination → Ctrl+U

...or...

✔ press hold Shift and double-click any link if it is in the table or single-click if it is outside the table


A window will open where you can choose what options to analyze:

  • general information about the crawled URL
  • redirects, if the URL redirects to another one
  • HTTP response headers
  • HTTP request headers
  • a list of GET parameters, if there are any in the URL
  • the source code of the page

The field for viewing page source will please the eye and the soul with the following features:

✔ Viewing files of different types

With the help of the current version you can view page source of the following document types:

  • HTML
  • PlainText (e.g. TXT-file)
  • JavaScript
  • CSS (style files)
  • XML
  • GZIP → notice that Netpeak Spider will extract the files from the archive if you ask kindly :-)

Codehighlighting

You don’t need to squint and struggle with the code to understand where the <title> tag is, or the <meta name="description"> tag, or your favourite links → <a href="URL">anchor</a>. You can easily find them all with the help of code highlighting.

Remember that every document type (see the paragraph above) has its specific highlighting. As a result, it is much easier to work with both standard HTML files and XML sitemaps, even if they are compressed in gzip format.

✔ Line numbering and automatic line-break

To see the whole line you don’t need to scroll horizontally anymore: vertical scrolling alone is enough! Line numbers will help to avoid confusion as to where the line begins or ends.

✔ Code search with additional functions

What good is a source code if you cannot search it? For this very reason, we’ve introduced a search box, which is on by default. If you’ve closed it but want to access it again, you can always reopen it by pressing the familiar key combination Ctrl+F.

Here you’ll find a great number of features that we suggest using in certain situations:

  • highlight a part of the text and hit Ctrl+F → the highlighted text will automatically appear in the search box and the search will start
  • if the case is essential for your search, you can turn on the corresponding parameter in the menu to the right of the search box → tick ‘Match case’
  • there is also an option to search by whole words only: e.g. if you need to find all instances of the word ‘site’ but exclude the forms ‘website’ or ‘websites’ from your search → tick ‘Find whole words only’
  • the most experienced of you will be able to use regular expressions in your search: it is important here to figure out which tasks need your attention and then only imagination is the limit → tick ‘Use regular expressions’

Note that the outcoming results depend on the following crawling settings:

  • User Agent → keep in mind that you can pick a User Agent from a big number of pre-installed templates in the crawling settings
  • response timeout → this parameter is set in the ‘Restrictions’ tab and is 30,000 ms (or 30 seconds) by default
  • the maximum number of redirects → is located in the same tab, is also customizable, and is 5 by default
  • proxy → when ‘Use proxy server’ option is turned on, there appears a corresponding notice at the top of the window

Honestly, we just had to add this extension. We received many questions like ‘Why do I see one thing on my website and a different thing in Netpeak Spider?’ Now it won’t bother you anymore. In any situation, if you cannot be completely sure why the issue is there, you can always see what Netpeak Spider sees and what it has to deal with when analyzing the page :-)

Here are some of our users’ cases:

1) there is some kind of protection on the website → for example, a certain website would check a user’s cookies file before allowing access to its content, if this check was successful, the website would load and you could work with it; Netpeak Spider, though, (similar to search engine robots) doesn’t use cookies by default – to put it shortly, it starts every session from scratch – not letting us see the content of the page and showing an error message;

2) some standard HTML tags are missing on the page → in this case, the user would see that the top part of the page has loaded in the browser and believe that the page is ok; Netpeak Spider, however, wouldn’t be able to see the closing tags </body> and </html> just because they wouldn’t be returned from the server;

3) various User Agents get various content → because of this the user would see their ‘own’ website and we would see a completely different one;

4) various IP addresses return various content → if your IP belongs to a certain country, it could happen that the access to the website would be suddenly blocked, or redirects would appear out of nowhere: for this reason, the program shows whether you are using a proxy server.

In any case, the new option will help you to quickly find the difference between visiting the website in a browser and crawling it with Netpeak Spider and why the program cannot access the web resource.

As a result, there is no need to switch to the developer tools in a browser or use some kind of outside online services → almost any task connected to HTTP headers and source code can be solved directly in the interface of Netpeak Spider – right there and then.

Five New Issues

We love finding problems on your website, but we are especially happy when you can get hold of them before the search engines do. We promised to continue extending the list of issues and we’ve kept our word. We introduce to your attention 5 new issues:

1. Invalid HTML Document Structure

Issue severity: error. We discussed this issue earlier and now we’ve found a solution - the program will itself show an HTML document that doesn’t contain the necessary tags:

<html> <head> </head> <body> </body> </html>

And if your pages don’t have these standard tags... well, search engines will probably not think highly of you :-(

2. Links with bad URL format

Issue severity: error. We were able to detect the links with bad format for a long time. They are shown in our internal tables type ‘Incoming links’ / ‘Outgoing links’ → these are the links that do not use entity escaping:

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

Here are some examples of this issue:

  • <a href="http:// example.com/">Link containing a backspace symbol after the protocol</a>
  • <a href="https://#">Link without a host</a>

Now we define that the page contains such links and raise the alarm. To view a complete report on such links, filter the pages with this issue, click ‘Current table summary’ button, choose ‘Outgoing links’, and set the appropriate filter (Include → URLs with issue → Bad URL format).

3. External 4xx-5xx Errors Pages

Issue severity: error. When turning on the external links analysis (tick ‘Crawl external links’ in the ‘General’ tab of the crawling settings), if any issues with 4xx-5xx HTTP status code are detected, they will be shown in this filter.

Try to avoid such links, so that the visitors of your website were satisfied and search engines felt as snug as a bug in a rug.

4. External Redirect

Issue severity: warning. If Netpeak Spider detected that the target URL of the redirect is another website, this filter will collect all the source pages of these redirects.

Consider the setting ‘Crawl all subdomains’ and ‘Consider all subdomains’ (depends on the crawling mode) – it allows you to manage our crawler from the point of view whether the subdomain links are external or internal. Therefore, if you untick this option, be ready to see the ‘External Redirect’ issue, if the target URL was a subdomain.

5. Non-https Protocol

Issue severity: notice. Nowadays it is popular to switch to a more protected https protocol, so the search engines consider it as a separate factor of document ranking. Since it is so important and at the same time is not something extremely difficult (especially if you have 1 domain without numerous subdomains), we decided to mark it as an issue with low severity, in the style of ‘do-not-forget-about-it’.

In a nutshell

Dear friends, in Netpeak Spider 2.1.1 we’ve introduced a new option that allows you to view page source, HTTP headers, and additional URL data inside the program. We’ve also added 5 new issues to our list:

  • Invalid HTML Document Structure
  • Links with bad URL format
  • External 4xx-5xx Errors Pages
  • External Redirect
  • Non-https Protocol

This version is a halfway point on the way to Netpeak Spider 2.1.3 – a very important event in the life of our SEO crawler. We are already working on the new features and will be able to please you again soon!

Read this post inRussian