## On Google's PageRank

Blog topic:

The PR is almost all about incoming links and their quality. Thus, an accurate page rank determination requires detailed knowledge of the page rank of each an every back-linked page. It would therefore be a formidable task to calculate the page rank of a high profile site such as cnn.com.

Statistically, however, if you have many incoming links, they tend to have the same rank distribution (many low quality links and fewer high quality links, and so forth). So in principle, we could try to estimate the page rank using just inbound links to a site. Moreover, since the PageRank is logarithmic in the inbound links, we could for example approximate it as:

$$ PR \approx f(N_{in}) \approx a \log_{10}(N_{in}) $$

This approximation is already not too bad, but one can do somewhat better, as we'll see below. _{10}(N

_{in}). One can see from this graph that the slope varies. Hence, the PR is not exactly linear with log

_{10}(N

_{in}). This could arise from several things. For example, Google's algorithm may not be exactly logarithmic, or the assumption that the quality of the links does not depend on average on the quality of the site is probably an over simplification, or perhaps other reasons.

The linear fit has a slope of 1.12 (which implies a PR unit increase every factor 7.8 increase in the inbound links). By comparing the fit to the actual PR data, one finds that the standard deviation in PR is 1.2 PR units, and also that about 82% of the sites have a predicted PR which is the correct one or +/-1 PR unit.

_{10}(N

_{in}) and also use other available data, such as the number of pages in a site, the back-links as seen by different search engines (which would be differently sensitive to different quality pages) and also the number of links within a site. I will save you from the ugly looking fitting formula (it has 8 different terms).

The standard deviation obtained with the improved fit is only 0.85, and 91% of the sites have a predicted PR which is the correct one or +/-1 PR unit. It is probably impossible to obtain a notably better fit with the data I use. To improve the PR prediction, one would in principle require more data, such as the actual PR of the back-linked pages.

Here is a calculator to estimate the PageRank of any site you wish.

View the discussion thread.