PageRank is a patented method to assign a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by .
PageRank was developed at Stanford University by Larry Page (hence the name Page-Rank and Malseed, 2005) and Sergey Brin as part of a research project about a new kind of search engine. The project started in 1995 and led to a functional prototype, named Google, in 1998. Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine, which still has PageRank as a key element.
In other words, a PageRank results from a "ballot" among all the other pages on the World Wide Web about how important a page is. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.
Numerous academic papers concerning PageRank have been published since Page and Brin's original paper. In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank.
Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.
These eight positions are displayed next to each Website in the Google Directory. cleardot.gif is used for a zero value and a combination of two graphics pos.gif and neg.gif are used for the other 7 values. The pixel widths of the seven values are 5/35, 11/29, 16/24, 22/18, 27/13, 32/8 and 38/2 (pos.gif/neg.gif).
Alternatives to the PageRank algorithm are the HITS algorithm proposed by Jon Kleinberg and the IBM CLEVER project. Many HITS concepts are now incorporated into Teoma and Ask.com.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank in this simplistic system would thus gather to A because all links would be pointing to A.
But then suppose page B also has a link to page C, and page D has links to all three pages. The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D
In other words, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the normalized number of outbound links (it is assumed that links to specific URLs only count once per document).
The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of (the damping factor and the sum of the incoming PageRank scores).
That is,
or (N = the number of documents in collection)
So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward. The second formula above supports the original statement in Page and Brin's paper that "the sum of all PageRanks is one". Unfortunately, however, Page and Brin gave the first formula, which has led to some confusion.
Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.
The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions are all equally probable and are the links between pages.
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. However, the solution is quite simple. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.
So, the equation is as follows:
where are the pages under consideration, is the set of pages that link to , is the number of links coming from page , and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes PageRank a particularly elegant metric: the eigenvector is
\begin{bmatrix} {(1-d)/ N} \\ {(1-d) / N} \\ \vdots \\ {(1-d) / N} \end{bmatrix}
+ d
\begin{bmatrix} \ell(p_1,p_1) & \ell(p_1,p_2) & \cdots & \ell(p_1,p_N) \\ \ell(p_2,p_1) & \ddots & & \\ \vdots & & \ell(p_i,p_j) & \\ \ell(p_N,p_1) & & & \ell(p_N,p_N) \end{bmatrix}
\mathbf{R}
where the adjacency function is 0 if page does not link to , and normalised such that, for each j
i.e. the elements of each column sum up to 1.
This is a variant of the eigenvector centrality measure used commonly in network analysis.
The values of the PageRank eigenvector are fast to approximate (only a few iterations are needed) and in practice it gives good results.
As a result of Markov theory, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal where is the expectation of the number of clicks (or random jumps) required to get from the page back to itself.
The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages). The Google Directory (itself a derivative of the Open Directory Project) is an exception in which PageRank is not used to determine search results rankings.
Several strategies have been proposed to accelerate the computation of PageRank.
Various strategies to manipulate PageRank have been employed in concerted efforts to improve search results rankings and monetize advertising links. These strategies have severely impacted the reliability of the PageRank concept, which seeks to determine which documents are actually highly valued by the Web community.
Google is known to actively penalize link farms and other schemes designed to artificially inflate PageRank. How Google identifies link farms and other PageRank manipulation tools are among Google's trade secrets.
Google's home page is often considered to be automatically rated a 10/10 by the Google Toolbar's PageRank feature, but its PageRank has at times shown a surprising result of only 8/10 (which is lower than other, very few, web pages that are not related to Google) and it seems that this rating was achieved through the PageRank algorithm, and wasn't programmed into the toolbar by Google as constant.
A Web crawler may use Pagerank as one of a number of importance metrics it uses to determine which URL to visit next during a crawl of the web. One of the early working papers which was used in the creation of Google is Efficient crawling through URL ordering, which discusses the use of a number of different importance metrics to determine how deeply, and how much of a site Google will crawl. Pagerank is presented as one of a number of these importance metrics, though there are others listed such as the number of inbound and outbound links for a URL, and the distance from the root directory on a site to the URL.
Google | Search engine optimization | Reputation management
PageRank | PageRank | PageRank | PageRank | PageRank | PageRank | PageRank | PageRank | ページランク | PageRank | PageRank | PageRank | PageRank | PageRank | PageRank
This article is licensed under the GNU Free Documentation License.
It uses material from the
"PageRank".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world