Netpeak Spider 2.1.1.2: Internal PageRank Calculation

17
17
Netpeak Spider 2.1.1.2: Internal PageRank CalculationUpdates

Dear Friends, we are finally ready to announce the revolutionary function of Netpeak Spider – calculation of the internal PageRank! Nothing has been left from the previous mechanism, and the new one required the previous update, in which the crawling algorithm has been changed completely. We have prepared this blog post/instruction, which you will be able to access directly from the window of the new tool for the internal PageRank calculation.

What is PageRank

PageRank is a relative weight of a page calculated by the formula:

PR (A) = (1 - d) / N + d * (PR(B) / L(B) + PR(C) / L(C) + ...)

where:

  • N is the total number of active pages that are part of the calculation
  • d is the damping coefficient (usually, it has the value of 0,85)
  • L is the number of outgoing links

It is generally accepted that at zero (0) iteration every page has the same PageRank value equal to 1 / N. At the next iterations, the weight of all incoming links is considered, which is essentially the weight from the previous iteration divided by the number of outgoing links (L in the formula).

We have made several tables to illustrate the algorithmic process:

Example of an ideal websiteExample of a real website

Google computes this parameter for every page of the Internet, and Netpeak Spider calculates the internal PageRank, which is limited by the crawled website or the list of URLs.

Why do we need to calculate the internal PageRank

This function is revolutionary at least while it helps to get the following insights into your project:

1. Find out how link juice is distributed across the website and where it is concentrated.

2. Detect pages unnecessary for search engine optimization that receive too much link juice.

3. Discover dead end pages that burn the incoming link juice.

Suppose there are external links pointing to your website, just imagine how much of the SEO budget you can save by introducing a more effective internal link structure.

How is the internal PageRank calculated

There are two ways you can calculate the internal PageRank in Netpeak Spider:

1. Automatic

Go to the ‘Parameters’ tab in the crawling settings and select ‘Internal PageRank’ to automatically calculate the internal PageRank. The calculation will start when the crawl is paused or after it has been successfully completed.

Note that ‘Outgoing Links’ is a required parameter in this case, since the outgoing links are essential when analyzing linking connections, without which you cannot calculate the internal PageRank.

2. Manual (using a separate tool)

To access the tool, go to ‘Tools’→ ‘Internal PageRank Calculation’.

Here you will find the following:

2.1. Settings that can be applied for both manual and automatic calculations:

  • Iterations [5 – 50] → a high number of iterations allows a higher accuracy of the calculation; however, according to our observations the optimal number of iterations is 15, since it allows to quickly get the necessary results, because of this Netpeak Spider uses 15 iterations by default
  • Only Internal Links → a setting that allows to disregard all external outgoing links in the calculation
  • Only Links in Tab: [All] / [Filters] → a setting that allows to limit the calculation by the links that are displayed in the corresponding tabs: use [Filters] when you need to calculate PageRank of a particular category of the crawled website
  • Mode → ‘Real’ mode shows the accurate results of the PageRank calculation, can be inconvenient when dealing with large numbers of pages; ‘Adaptive’ mode shows the results multiplied by a special coefficient, which is more convenient when dealing with larger websites

Note that if you deselect ‘only internal links’ and ‘only links in tab: [All] / [Filters]’ at the same time, Netpeak Spider will load and analyze all outgoing links from all crawled pages. In this case, the report can contain links with ‘Not Crawled’ status code – this is done to calculate the internal PageRank most accurately, based on relevant outgoing links.

2.2. The formula of the internal PageRank calculation, the above mentioned parameters N and d, and a link to this article.

2.3. List of ignored URLs: you can add a link to this list to completely exclude it from the PageRank analysis. This function gives flexibility to your calculations, changing the internal linking directly inside of the program.

Note that the whole node not a separate link on a certain page is excluded: imagine, there is no links to this page from the entire website (incoming links) and no links from this page to other pages (outgoing links).

2.4. Export data from the table in CSV / Excel format.

2.5. Results table that contains the following columns:

  • ‘Pages’ → number in the table (#) and a link to the page
  • ‘Iterations’ → after the calculation has been started, the columns with the information on every iteration will appear here
  • ‘Relations’ → here the number of outgoing and incoming links is displayed, which you can open by double-clicking the left button of the mouse or by accessing the context menu: we have developed a convenient way to view these reports; you can go forward and return with the help of the usual buttons ‘Next’ / ‘Prev’ to get the complete access to the connection graph
  • ‘Algorithmic analysis’ → here you can find the parameters that are defined by the PageRank algorithm, namely ‘Link Status’ (read below to learn more about this parameter) and ‘Target Link’ – is shown if a redirect was found in the course of the calculation
  • ‘General parameters’ → you can see the response status code and the content type of the corresponding pages
  • ‘Indexation parameters’ → unites parameters that are critical for link juice distribution: robots.txt instructions, canonical, x-robots-tag, meta robots as well as redirect target URL and refresh tag if they are any on the page

In the lower part of the table the ‘Total PageRank’ is calculated → on every iteration the sum should equal 1 (in the ‘Real’ mode) and 10 to a certain power (in the ‘Adaptive’ mode). If the sum differs from these values, it is a sign that the crawled website has dead ends and you are losing your link juice.

2.6. Status panel that together with the results table shows all the steps of the algorithmic process, allowing you to see the calculation dynamics.

After exiting the ‘Internal PageRank Calculation’ tool, the results of the last iteration will be automatically placed into the main table of the program into the corresponding column. New data will replace any previous results.

Calculation algorithm

We would like to remind you again that ‘Outgoing links’ is a required parameter for calculating the internal PageRank. It shows the relations between pages, allowing to consider the main indexation instructions, link attributes, and link juice distribution.

The whole process consists of 2 consecutive stages:

1. Establishing the connection graph → the goal of this stage is to establish link connections and evaluate their status:

1.1. Loading and filtering links according to the applied settings.

1.2. Initial analysis → categorizing links according to their status ‘OK’, ‘Dead End’ and ‘Redirect’ (read more below about link statuses).

1.3. Loading outgoing links → at this stage all links with a nofollow attribute are excluded and the hashtag (#) is clipped. As a result, only unique links are left for analysis.

1.4. Calculating incoming links.

1.5. Finishing analysis → a detailed analysis of outgoing and incoming links, detection of ‘Target Links’ and ‘Orphan’ links.

2. Internal PageRank calculation → starting with 0 iteration till the one stated in the settings.

Link Status

The most interesting part of the PageRank algorithm – logically all links fall into 4 categories:

1. OK

HTML-pages with the ‘200 OK’ status code that contain outgoing links and can have:

  • a noindex tag → noindex pages also pass link juice
  • a canonical tag pointing to itself
  • a refresh tag pointing to itself

2. Dead End

Pages that have 0 outgoing links and, as a result, do not pass link juice.

This category includes:

  • 2xx pages that simply do not contain outgoing links
  • 2xx pages blocked by robots.txt
  • 2xx pages with nofollow in X-Robots-Tag and meta robots instructions
  • non-HTML 2xx pages that cannot have outgoing links
  • 3xx links blocked by robots.txt
  • 3xx links with endless redirect («3xx Redirect Loop» status code)
  • 4xx pages
  • 5xx pages
  • pages that return any other status code
  • redirected pages (canonical and refresh) that haven’t reached their target URL (Endless Redirect)
  • outgoing links that are not displayed in the ‘All’ results table → note that by default, with ‘only internal links’ and ‘only links in tab [All] / [Filters]’ parameters deselected, Netpeak Spider will try to find all links on the website, disregarding the crawling settings – this is necessary to convey the complete and accurate picture of the link juice distribution

3. Redirect

Links that pass all their link juice to the target page (its URL is stated in the ‘Target Link’ column.

This category includes:

  • 3xx pages
  • 2xx pages with a canonical tag pointing to another URL
  • 2xx pages with a refresh tag pointing to another URL

4. Orphan

Pages that have no incoming links.

Such pages may appear when:

  • crawling a website with the indexation instructions turned off (robots.txt, canonical, refresh, X-Robots-Tag, meta robots, and nofollow link attribute) → note that if you choose to disregard the indexation instructions, Netpeak Spider will crawl your website in a different way than search engine robots do. The PageRank algorithm, however, always considers these instructions, so some links received in the process of the crawl can be inaccessible for the PageRank algorithm.
  • crawling the list of URLs → links that are not connected with each other.

Note that links with this status are not included in the internal PageRank calculation.

Table with Link Statuses in Netpeak Spider

3 new issues

Right after the automatic or manual internal PageRank calculation, 3 types of issues will be displayed in the main interface if they have been found on your website:

  • PageRank: Dead End → as was stated above these pages contain no outgoing links and do not pass link juice, creating disbalance in the link juice distribution across the website
  • PageRank: Redirect → pages that redirect link juice – these can be pages that return a 3xx redirect or have canonical / refresh tag pointing to another URL
  • PageRank: Orphan → these are inaccessible pages that have no incoming links

In a nutshell

Dear friends, we have released the most accurate internal PageRank algorithm that gives you a number of insights into the crawled website: find out how link juice is distributed across the website, what pages unnecessary for search engine optimization receive excessive link juice, which pages are dead ends, and, finally, how these issues can be corrected.

Check out this new unique function, experiment with various settings and introduce a new, more effective internal link architecture! ;)

Read this post inRussian