top of page
Search

A formal explanation of the Google PageRank Algorithm

Updated: Sep 10

Warning This is an advanced-level topic, mathematics and probability are required to understand the Google PageRank algorithm.


The Pages are ranked by Google as the probability of being at all times at a specific page, so a more popular page is more likely to be visited by users more often than a less important page.


The concept is quite simple, spouse we have this network of pages:


Network


A network made of 3 pages linking to each other. Page 1 links to Page 2,  Page 2 links to  Page 3 and Page 3 links to Page 1 and Page 2

The image above represents the internal links between pages.

  • Page 1 has a link to page 2

  • Page 2 has a link to page 3

  • Page 3 has a link to page 1 and page 2


In our reality, this is also the case, but with:

  • Millions of pages are connected by links to each other

  • Sometimes groups of pages that only link between their group

  • Pages without links at all to other pages

Something more like this:


Nodes are representing the number of backlinks coming from different pages
A real representation of a network of pages

Dots or nodes are pages, the size of each dot represents visually the number of backlinks on a particular page.


For practical effects, we will focus on the Network image above.


The Google PageRank algorithm would use a random surfer to explore all the pages and rank them as follows:


A is the page to rank

G is each page that links to A


There 2 possible ways for a random surfer to be on a particular page:


  • With a probability of x, the surfer follows a link from one page to another, x is defined as d which is the damping factor that normally is used as 0.85, 85% of probability for a random surfer to follow a link on a page to another page.


  • The other possible way is that 1 - d is the probability that a random surfer picks at random one of the pages in our network, so if d is 0.85 then this probability would be 0.15, 15%

Given these definitions, the PageRank is defined as:


PR(A) = (1 - d) / N + d * Σ (PR(G) / L(G))


Where:

  • PR(A) is the PageRank of page A.

  • d is the damping factor (typically 0.85).

  • N is the total number of pages in the network.

  • Σ represents the summation over all pages G that link to page A.

  • PR(G) is the PageRank of page G.

  • L(G) is the number of outbound links on page G.



Important note: L(G) refers to the number of exit links that bring the surfer to various pages within the network including page A. These exit links are Commonly called BACKLINKS.


To start we must have a list of all pages and for each page, we must have a list of links on that page.


The image above(Network) does for us that, as long as we can visualize these interconnections is possible to start calculating the PR of A.


Notice that PR(G) must be a known value, so to solve this at the start all pages are taken and assigned to them the probability rank of 1 / N in Network this means one world where the event occurs out of the total number of possible worlds (All the pages in the Network)

So this represents an equal probability of landing on a page.


Therefore, the initial conditions are:


PR(page 1) = 1 / 3 0.3333

PR(page 2) = 1 / 3 0.3333

PR(page 3) = 1 / 3 0.3333


Given these conditions, we can now calculate the PR for each page:


PR(A) = (1 - d) / N + d * Σ (PR(G) / L(G))


PR(page 1) = 1 - 0.85 / 3 + 0.85 * Σ [ (0.3333 / 2) ] = 0.1917


0.1917 is a new value that will be used on the next iteration for PR(page 1), moreover later.

PR(page 2) = 1 - 0.85 / 3 + 0.85 * Σ [ (0.3333 / 1) + (0.3333 / 2 ) ] = 0.5


0.5 will be used on the next iteration for PR(page 2)


The last term of the summation (0.3333 / 2 ) is the default rank for page 3 over the total number of exit links (backlinks) on page 3.


PR(page 3) = 1 - 0.85 / 3 + 0.85 * Σ [ (0.3333 / 1) ] = 0.3333


Notice that at this point all page ranks have been updated from previous page ranks thus:


PR(page 1) = 0.1917

PR(page 2) = 0.5

PR(page 3) = 0.3333


This process can now be repeated again, this is what in Computer Science we call an iterative algorithm.


Also if this process is done n times eventually the change between a set of previous PRs and a new set of PRs will be quite small, when this happens the Summation of the equation meets its convergence. This of course is a smart way to define the tolerance of the algorithm, We can program an algorithm that would stop calculating values only and only if the change between New PageRank values and Previous PageRank values is less than a certain value, a sober value would be something like 0.001.


At this point, the level tolerance has been met and the algorithm can stop iterating to rank pages. All pages have been ranked by their probabilities.


Just to be clear the change can be calculated as

Change = New PageRanks - Previous PageRanks


In conclusion: this is how Google is right now ranking pages The given reasons in this article is the importance of understanding backlinks, especially well-made backlinks those that link to your page without having you link back to their pages, but also coming from likely pages


bottom of page