On Google's PageRank
Blog topic:
The PR is almost all about incoming links and their quality. Thus, an accurate page rank determination requires detailed knowledge of the page rank of each an every back-linked page. It would therefore be a formidable task to calculate the page rank of a high profile site such as cnn.com.
Statistically, however, if you have many incoming links, they tend to have the same rank distribution (many low quality links and fewer high quality links, and so forth). So in principle, we could try to estimate the page rank using just inbound links to a site. Moreover, since the PageRank is logarithmic in the inbound links, we could for example approximate it as:
$$ PR \approx f(N_{in}) \approx a \log_{10}(N_{in}) $$
This approximation is already not too bad, but one can do somewhat better, as we'll see below. data:image/s3,"s3://crabby-images/7da1b/7da1b91759397829405ced42b3b45d58dbf478d0" alt=""
Figure 1 - Actual PageRank vs. inbound site links, for 120 random site. The horizontal scale is logarithmic. The nearly linear slope implies that the google PR is logarithmic as well. The linear fit gives a base 8 or so.
The linear fit has a slope of 1.12 (which implies a PR unit increase every factor 7.8 increase in the inbound links). By comparing the fit to the actual PR data, one finds that the standard deviation in PR is 1.2 PR units, and also that about 82% of the sites have a predicted PR which is the correct one or +/-1 PR unit.
data:image/s3,"s3://crabby-images/68c48/68c48e1b21e4916d824a1301448a2fcc4a1f739e" alt=""
Figure 2 - Histogram of the Predicted PR minus the actual PR for 120 random sites. We see that in just over 50% of the cases, the predicted PR the same as the real, while in an extra 40%, it is within one PR unit of the real value.
The standard deviation obtained with the improved fit is only 0.85, and 91% of the sites have a predicted PR which is the correct one or +/-1 PR unit. It is probably impossible to obtain a notably better fit with the data I use. To improve the PR prediction, one would in principle require more data, such as the actual PR of the back-linked pages.
Here is a calculator to estimate the PageRank of any site you wish.