Netpeak Spider 2.1.1.2: Internal PageRank Calculation

Alex WiseCEO & Founder at Netpeak Software

10499

November 7, 2016

Netpeak Spider 2.1.1.2: Internal PageRank Calculation

Dear Friends, we are finally ready to announce the revolutionary function of Netpeak Spider – calculation of the internal PageRank! Nothing has been left from the previous mechanism, and the new one required the previous update, in which the crawling algorithm has been changed completely. We have prepared this blog post/instruction, which you will be able to access directly from the window of the new tool for the internal PageRank calculation.

What is PageRank

PageRank is a relative weight of a page calculated by the formula:

PR (A) = (1 - d) / N + d * (PR(B) / L(B) + PR(C) / L(C) + ...)

where:

N is the total number of active pages that are part of the calculation
d is the damping coefficient (usually, it has the value of 0,85)
L is the number of outgoing links

It is generally accepted that at zero (0) iteration every page has the same PageRank value equal to 1 / N. At the next iterations, the weight of all incoming links is considered, which is essentially the weight from the previous iteration divided by the number of outgoing links (L in the formula).

We have made several tables to illustrate the algorithmic process:

Example of an ideal website Example of a real website

Google computes this parameter for every page of the Internet, and Netpeak Spider calculates the internal PageRank, which is limited by the crawled website or the list of URLs.

Why do we need to calculate the internal PageRank

This function is revolutionary at least while it helps to get the following insights into your project:

1. Find out how link juice is distributed across the website and where it is concentrated.

2. Detect pages unnecessary for search engine optimization that receive too much link juice.

3. Discover dead end pages that burn the incoming link juice.

Suppose there are external links pointing to your website, just imagine how much of the SEO budget you can save by introducing a more effective internal link structure.

How is the internal PageRank calculated

There are two ways you can calculate the internal PageRank in Netpeak Spider:

1. Automatic

Go to the ‘Parameters’ tab in the crawling settings and select ‘Internal PageRank’ to automatically calculate the internal PageRank. The calculation will start when the crawl is paused or after it has been successfully completed.

Note that ‘Outgoing Links’ is a required parameter in this case, since the outgoing links are essential when analyzing linking connections, without which you cannot calculate the internal PageRank.

2. Manual (using a separate tool)

To access the tool, go to ‘Tools’→ ‘Internal PageRank Calculation’.

Here you will find the following:

2.1. Settings that can be applied for both manual and automatic calculations:

Iterations [5 – 50] → a high number of iterations allows a higher accuracy of the calculation; however, according to our observations the optimal number of iterations is 15, since it allows to quickly get the necessary results, because of this Netpeak Spider uses 15 iterations by default
Only Internal Links → a setting that allows to disregard all external outgoing links in the calculation
Only Links in Tab: [All] / [Filters] → a setting that allows to limit the calculation by the links that are displayed in the corresponding tabs: use [Filters] when you need to calculate PageRank of a particular category of the crawled website
Mode → ‘Real’ mode shows the accurate results of the PageRank calculation, can be inconvenient when dealing with large numbers of pages; ‘Adaptive’ mode shows the results multiplied by a special coefficient, which is more convenient when dealing with larger websites

Note that if you deselect ‘only internal links’ and ‘only links in tab: [All] / [Filters]’ at the same time, Netpeak Spider will load and analyze all outgoing links from all crawled pages. In this case, the report can contain links with ‘Not Crawled’ status code – this is done to calculate the internal PageRank most accurately, based on relevant outgoing links.

2.2. The formula of the internal PageRank calculation, the above mentioned parameters N and d, and a link to this article.

2.3. List of ignored URLs: you can add a link to this list to completely exclude it from the PageRank analysis. This function gives flexibility to your calculations, changing the internal linking directly inside of the program.

Note that the whole node not a separate link on a certain page is excluded: imagine, there is no links to this page from the entire website (incoming links) and no links from this page to other pages (outgoing links).

2.4. Export data from the table in CSV / Excel format.

2.5. Results table that contains the following columns:

‘Pages’ → number in the table (#) and a link to the page
‘Iterations’ → after the calculation has been started, the columns with the information on every iteration will appear here
‘Relations’ → here the number of outgoing and incoming links is displayed, which you can open by double-clicking the left button of the mouse or by accessing the context menu: we have developed a convenient way to view these reports; you can go forward and return with the help of the usual buttons ‘Next’ / ‘Prev’ to get the complete access to the connection graph
‘Algorithmic analysis’ → here you can find the parameters that are defined by the PageRank algorithm, namely ‘Link Status’ (read below to learn more about this parameter) and ‘Target Link’ – is shown if a redirect was found in the course of the calculation
‘General parameters’ → you can see the response status code and the content type of the corresponding pages
‘Indexation parameters’ → unites parameters that are critical for link juice distribution: robots.txt instructions, canonical, x-robots-tag, meta robots as well as redirect target URL and refresh tag if they are any on the page

In the lower part of the table the ‘Total PageRank’ is calculated → on every iteration the sum should equal 1 (in the ‘Real’ mode) and 10 to a certain power (in the ‘Adaptive’ mode). If the sum differs from these values, it is a sign that the crawled website has dead ends and you are losing your link juice.

2.6. Status panel that together with the results table shows all the steps of the algorithmic process, allowing you to see the calculation dynamics.

After exiting the ‘Internal PageRank Calculation’ tool, the results of the last iteration will be automatically placed into the main table of the program into the corresponding column. New data will replace any previous results.

Calculation algorithm

We would like to remind you again that ‘Outgoing links’ is a required parameter for calculating the internal PageRank. It shows the relations between pages, allowing to consider the main indexation instructions, link attributes, and link juice distribution.

The whole process consists of 2 consecutive stages:

1. Establishing the connection graph → the goal of this stage is to establish link connections and evaluate their status:

1.1. Loading and filtering links according to the applied settings.

1.2. Initial analysis → categorizing links according to their status ‘OK’, ‘Dead End’ and ‘Redirect’ (read more below about link statuses).

1.3. Loading outgoing links → at this stage all links with a nofollow attribute are excluded and the hashtag (#) is clipped. As a result, only unique links are left for analysis.

1.4. Calculating incoming links.

1.5. Finishing analysis → a detailed analysis of outgoing and incoming links, detection of ‘Target Links’ and ‘Orphan’ links.

2. Internal PageRank calculation → starting with 0 iteration till the one stated in the settings.

Link Status

The most interesting part of the PageRank algorithm – logically all links fall into 4 categories:

1. OK

HTML-pages with the ‘200 OK’ status code that contain outgoing links and can have:

a noindex tag → noindex pages also pass link juice
a canonical tag pointing to itself
a refresh tag pointing to itself

2. Dead End

Pages that have 0 outgoing links and, as a result, do not pass link juice.

This category includes:

2xx pages that simply do not contain outgoing links
2xx pages blocked by robots.txt
2xx pages with nofollow in X-Robots-Tag and meta robots instructions
non-HTML 2xx pages that cannot have outgoing links
3xx links blocked by robots.txt
3xx links with endless redirect («3xx Redirect Loop» status code)
4xx pages
5xx pages
pages that return any other status code
redirected pages (canonical and refresh) that haven’t reached their target URL (Endless Redirect)
outgoing links that are not displayed in the ‘All’ results table → note that by default, with ‘only internal links’ and ‘only links in tab [All] / [Filters]’ parameters deselected, Netpeak Spider will try to find all links on the website, disregarding the crawling settings – this is necessary to convey the complete and accurate picture of the link juice distribution

3. Redirect

Links that pass all their link juice to the target page (its URL is stated in the ‘Target Link’ column.

This category includes:

3xx pages
2xx pages with a canonical tag pointing to another URL
2xx pages with a refresh tag pointing to another URL

4. Orphan

Pages that have no incoming links.

Such pages may appear when:

crawling a website with the indexation instructions turned off (robots.txt, canonical, refresh, X-Robots-Tag, meta robots, and nofollow link attribute) → note that if you choose to disregard the indexation instructions, Netpeak Spider will crawl your website in a different way than search engine robots do. The PageRank algorithm, however, always considers these instructions, so some links received in the process of the crawl can be inaccessible for the PageRank algorithm.
crawling the list of URLs → links that are not connected with each other.

Note that links with this status are not included in the internal PageRank calculation.

Table with Link Statuses in Netpeak Spider

3 new issues

Right after the automatic or manual internal PageRank calculation, 3 types of issues will be displayed in the main interface if they have been found on your website:

PageRank: Dead End → as was stated above these pages contain no outgoing links and do not pass link juice, creating disbalance in the link juice distribution across the website
PageRank: Redirect → pages that redirect link juice – these can be pages that return a 3xx redirect or have canonical / refresh tag pointing to another URL
PageRank: Orphan → these are inaccessible pages that have no incoming links

In a nutshell

Dear friends, we have released the most accurate internal PageRank algorithm that gives you a number of insights into the crawled website: find out how link juice is distributed across the website, what pages unnecessary for search engine optimization receive excessive link juice, which pages are dead ends, and, finally, how these issues can be corrected.

Check out this new unique function, experiment with various settings and introduce a new, more effective internal link architecture! ;)

Digging This Update? Let's Discuss Netpeak Spider Perks in Person

1. Ok, this explains how this new feature works, but can you explain how the internal PageRank is working in the eyes of search engines? 2. Any suggestions on best practices, or how to manipulate the internal linking for better impact on search results? 3. Any plans on implementing in the future a feature that would illustrate the structure of internal PageRank? 4. Does rel="nofollow" have an impact on internal PageRank? 5. Do menu, footer and sidebar links pass any internal PageRank weight?

Hi there, Thanks for reaching out. 1. When we were implementing this feature, we built its algorithm on the following: – initial global Pagerank formula from Google developers; – search engine recommendations; – recommendations of this search engines' official representatives. Thus, we can say that at least Google handles your website internal linking in the same (or very similar) way. 2. In this case, manipulation is hardly possible, however, you can change the website architecture, analyze and improve internal linking. There is pretty much information on this matter and it differs depending on your website. 3. We've also been thinking about this. Let's imagine that this feature is already released and you can see a visual graph with lots of connections like this one → https://dhs.stanford.edu/wp-content/uploads/2012/09/peer_communities.png. How would you handle this data further? 4. Yes, it has a huge impact. If the link has rel="nofollow" attribute, then it doesn't don’t pass PageRank. You can learn more in this post by Matt Cutts https://www.mattcutts.com/blog/pagerank-sculpting/ 5. Certainly yes. We're sure that Google detects link's location and maybe gives links in the low-visibility blocks less weight, however, the weight is passed. We're going to make some changes to how Netpeak Spider calculates internal PageRank and to consider such links in a simple form for now. If you have any other questions, feel free to contact me.

Hey! Thanks for the long response, much appreciated. 1. Interesting 2. I see. I was asking this because on this same post only in Russian language I read (as much I can read in a kinder-garden level) that someone asked a similar question and received some long responses as "Скоро будет отдельный пост, посвящённый тому, как именно использовать фишки Netpeak Spider на практике :)." Therefore I was hoping to get some similar answer in English language. :P 3. Precisely that one data picture - I could use it to visualize competitor sites to get a visual impression how big their sites are. To look for patterns. If the visualization tool would have more marking functions, then I could look what types of internal links are there linking and so on. If this model could get modified to show levels of depth, or only categories or tags, this would be a powerful vidualization tool which with time could give users a new sense. As well the visualizations could be in two dimensions as those examples in https://docs.google.com/spreadsheets/d/1XK3NsDnxPfLVAYrdIb7J-udRTWzoaZChKoVrM_-0yk0/edit#gid=0 , where the arrow lines could visualize all similar links as from thin to bold lines. Just a thought, I am not a coder. 4. Thanks 5. Thanks :)

Hey, Sorry if you had to wait some time for our response. We try to handle all the requests as quickly as possible and during the working week we reach out in less than 24 hours. As for the post you've mentioned in point №2, we do plan to prepare some case studies on how to use Netpeak Spider and it will cover not only PageRank feature. Now we can't tell the exact ETA but I'll inform you when the post is released. And as for the visualization, we'll for sure keep your idea in mind and we'll think about the ways of its implementation. Do not hesitate to contact us in case you have any other questions or suggestions.

Hey Mike, Thanks for contacting us. Google has never stopped considering PageRank as a factor for website ranking. They just stopped showing this data. And Netpeak Spider calculates website internal PageRank in the same (or very similar) way as Google does. If you have any other questions, just drop me a line and I'd be glad to help.

No amber they do not. You are leaving out a key ingredient - Pagerank of a page is determined by incoming links from other site pages. Thats how Pagerank flows from site to site. If you do not know the amount of pagerank in your site you CANNOT calculate Page Rank on any page. As a SEO for over a decade that has taught other people how to do it I am well aware of Google still using paegrank internally but there is now no way whatsoever to calculate internal or external PR without knowing how much PR there is coming to your site. This is VERY deceptive advertising and I am surprised that Netspeak would stoop to this. You may get some newbies with this but calling this Pagerank just makes you look bad to professional SEOs.

Mike, at the beginning of the post we especially described what PageRank basically is and how Google calculates it. Also, we noted that Netpeak Spider calculates internal PageRank – totally different rate which considers only website internal linking. With the help of this tool, you can get page relative weight that can correlate with a real weight, since the algorithm is aimed to learn how link juice is distributed across the website, where it is concentrated, and what pages burn the incoming link juice.

"Page relative weight" tool is fine. Claiming the tool calculates Pagerank is utterly false and misleading. Here's a simple description since you guys obviously do not know what is involved with pagerank "PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites." https://en.wikipedia.org/wiki/PageRank Theres no way anyone can calculate internal PageRank with your tool as you claim because concentration of Pr all depends on incoming links and the PR quality of those links. The strength of those links flows through the pages. Google has turned that data off so you do not have a clue about actual Pagerank and cannot calculate and unknown rank. Do you guys have any professional SEOs on your staff? because again this makes you look pretty bad.

Mike, this - "...learn how link juice is distributed across the website, where it is concentrated, and what pages burn the incoming link juice." Everything else what you are claiming is a straw-man fallacy, and a snobbish behavior.

SamuelStewart, thanks for sticking up for us. Mike, our CEO has more than 10 years experience in SEO, so we do have professionals in our team. We're sorry that you don't find our tool useful after all the explanations and, unfortunately, we can do nothing with it. We strongly believe in this algorithm and do not see any difficulties or conflicts. Anyway, thanks for reaching out. We'd be happy to help you with any other issues concerning our tools.

"Everything else what you are claiming is a straw-man fallacy, and a snobbish behavior." If its snobbish behavior to point out false advertising then its snobbish behavior I am quite proud of. Dream as you wish when you learn more about SEO you will learn that no "where it is concentrated" (Pagerank) is much more dependent on where the links are coming in and the quality of those links than anything else. Since you (stewart) do not know that or what a strawman fallacy actually is I leave you to that ignorance but since Netpeaks seems fully committed to misleading advertising it will no longer have my recommendation to my clients or students.

"Mike, our CEO has more than 10 years experience in SEO, so we do have professionals in our team. Good amber then kindly ask him what the actual PageRank of this page is since you claim this calculates PageRank. If he can tell you without his nose growing (and I would require confirmation of his guess with Google) I will take him as the greatest SEO alive today. PageRank already has a definition in the SEO community. trying to claim the name for something else is misleading advertising. The end. Incidentally lest anyone think I am some drive by poster - I have taught a couple hundred people how to do SEO in one course alone and I used and recommended netpeak which is why I was surprised to see them use false naming/advertising techniques.