Page ranking algorithm and its implementation

Python Methods and Functions

PageRank (PR) — it is the algorithm used by Google search to rank sites in search results. PageRank was named after Larry Page, one of the founders of Google. PageRank — it is a way of measuring the importance of the pages on a site. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

This is not the only algorithm used by Google to organize search results, but it is the first algorithm used by the company. and it is the most famous.

The above measure of centrality is not implemented for multigraphs.

Algorithm
PageRank Algorithm displays a probability distribution used to represent the probability that a person accidentally clicks on links to land on any particular page. PageRank can be calculated for collections of documents of any size. Several research papers suggest that the distribution is evenly distributed among all documents in a collection at the beginning of the computational process. PageRank calculations require several passes, called “iterations”, through the collection to adjust the estimated PageRank values ​​to more closely reflect the theoretical true value.

Simplified algorithm
Assume that the small universe consists of four web pages: A, B, C, and D. Links from a page to itself or multiple outbound links from one page to another page are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank across all pages was the total number of pages on the Internet at that time, so each page in this example would have an initial value of 1. However, later versions of PageRank and For the remainder of this section, assume a probability distribution between 0 and 1 Therefore, the starting value for each page in this example is 0.25.

The PageRank carried over from this page to its outbound link targets in the next iteration is split equally among all outbound links.

If the only links in the system were from pages B, C and D to A, each link would pass 0.25 PageRank to A on the next iteration, for a total of 0.75.


Suppose instead that page B contains a link to pages C and A, page C contains a link to page A, and page D contains links on all three pages. So in the first iteration, page B will pass half of its existing value, or 0.125, to page A, and the other half, or 0.125, to page C. Page C will pass all of its existing value, 0.25, only page, to which he links, A. Since D had three outbound links, it would move one third of its existing value, or approximately 0.083, to A. At the end of this iteration, page A will have a PageRank of approximately 0.458.

In other words, the PageRank assigned to an outbound link is equal to the document's own PageRank divided by the number of outbound links L ().

In general, the PageRank value for any page u can be expressed as:

,

i.e. the PageRank value for page u depends on the PageRank values ​​for each page v contained in the set Bu (the set containing all pages that link to page u) divided by the number L (v) of links from page v. The algorithm includes a damping factor to calculate the page rank. This is similar to the income tax that the government extracts from it, despite the fact that it pays it itself.

Below is the code to calculate the page rank.

def pagerank (G, alpha = 0.85 , personalization = None ,

max_iter = 100 , tol = 1.0e - 6 , nstart = None , weight = 'weight' ,

  dangling = None ) :

& quot; & quot; & quot; Returns the PageRank of nodes in a graph.

 

PageRank calculates the ranking of nodes in column G based on

Structure of incoming links. It was originally designed as a

web page ranking algorithm.

 

parameters

----------

G: graph

Chart NetworkX. Non-directional plots will be converted to oriented plots

a graph with two directed edges for each undirected edge.

 

alpha: float, optional

Damping parameter for PageRank, default = 0.85.

 

personalization: dict, optional

"Personalization vector" consisting of a dictionary with

a key for each graph node and a non-zero personalization value for each node.

Equal distribution is used by default.

  

  max_iter: integer, optional

Maximum number of iterations in the power method for solving eigenvalues.

 

tol: float, optional

used to test convergence in the power-law solver.

 

nstart: dictionary, optional

Initial value iterate PageRank for each node.

 

  weight: key, optional

Edge data key to use as weight. If No, the weights are set to 1.

 

hanging: dict, optional

Highlights should be assigned to any "dangling" nodes, that is, nodes without

any side effects The dict key is the node pointed to by the output and the key

The value is the weight of this output. Dangling nodes are given by default

crashes according to personalization vector (if not

to be specified). This must be selected to result in an irreducible transition

matrix (see notes under google_matrix) ... It may be common to have

the hanging dictate will be the same as the personalization dictate.

 

Returns

-------

Pagerank: dictionary

Dictionary of nodes with PageRank as value

 

Notes

  --- -

Eigenvector calculation is performed by the power iteration method

and has no convergence guarantee. The iteration will stop

after max_iter iterations or error

number_of_nodes (G) * Tol was achieved.

 

The PageRank algorithm was designed for directed graphs, but this

the algorithm does not check the directionality of the input graph and

  execute on undirected graphs by converting each edge to

directed graph with two edges.

 

 

"" "

  if len (G) = = 0 :

  return {}

 

if not G.is_directed ():

D = G.to_directed ()

else :

D = G

  

  # Create a copy in (right) stochastic form

W = nx.stochastic_graph (D, weight = weight)

N = W.number_of_nodes ()

 

# Select a fixed starting vector if not specified

  if nstart is None :

x = dict . fromkeys (W, 1.0 / N)

else :

# Normalized nstart vector

  s = float ( sum (nstart.values ​​()))

  x = dict ((k, v / s) for k, v in nstart.items ())

  

  if personalization is None :

 

  # Assign a single personalization vector if not specified

  p = dict . fromkeys (W , 1.0 / N)

else :

missing = set (G) - set (personalization)

if missing:

raise NetworkXError ( 'Personalization dictionary'

'must have a value for every node ... '

  ' Missing nodes% s' % missing)

s = float ( sum (personalization.values ​​()))

p = dict ((k, v / s) for k , v in personalization.items ())

 

  if dangling is None :

 

# Use personalization vector if no dangling vector is specified

dangling_weights = p

else :

missing = set (G) - set (dangling)

  if missing:

raise NetworkXError ( 'Dangling node dictionary'

'must have a value for every node. '

  ' Missing nodes% s' % missing)

s = float ( sum (dangling.values ​​()))

dangling_weights = dict ((k, v / s) for k , v in dangling.items ())

dangling_nodes = [n for n in W if W.out_degree (n, weight = weight) = = 0.0 ]

 

# power iteration: before max_iter iterations

for _ in range (max_iter):

  xlast = x

x = dict . fromkeys (xlast.keys (), 0 )

danglesum = alpha * sum (xlast [n] for n in dangling_nodes)

for n in x:

 

# this matrix multiplies looks weird because it

# do left multiplication x ^ T = xlast ^ T * W

for nbr in W [n]:

x [nbr] + = alpha * xlast [n] * W [n] [nbr] [weight]

x [n] + = danglesum * dangling_weights [n] + ( 1.0 - alpha) * p [n]

 

# convergence check, l1 norm

err = sum ([ abs (x [n] - xlast [n]) for n in x] )

if err & lt; N * tol:

return x

raise NetworkXError ( 'pagerank: power iteration failed to converge'

'in% d iterations.' % max_iter)

The above code is function that was implemented in the networkx library.
To implement the above in networkx, you will need to do the following:

& gt; & gt; & gt; import networkx as nx

& gt; & gt ; & gt; G = nx.barabasi_albert_graph ( 60 , 41 )

& gt; & gt; & gt; pr = nx.pagerank (G, 0.4 )

& gt; & gt; & gt; pr

Below is the output you get on IDLE after the required settings.

{ 0 : 0.012774147598875784 , 1 : 0.013359655345577266 , 2 : 0.013157355731377924

3 : 0.012142198569313045 , 4 : 0.013160014506830858 , 5 : 0.012973342862730735 ,

6 : 0.012166706783753325 , 7 : 0.011985935451513014 , 8 : 0.012973502696061718

9 : 0.013374146193499381 , 10 : 0.01296354505412387 , 11 : 0.013163220326063332 ,

12 : 0.013368514624403237 , 13 : 0.013169335617283102 , 14 : 0.012752071800520563

15 : 0.012951601882210992 , 16 : 0.013776032065400283 , 17 : 0.012356820581336275

18 : 0.013151652554311779 , 19 : 0.012551059531065245 , 20 : 0.012583415756427995 ,

21 : 0.013574117265891684 , 22 : 0.013167552803671937 , 23 : 0.013165528583400423 ,

24 : 0.012584981049854336 , 25 : 0.013372989228254582 , 26 : 0.012569416076848989 ,

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     27 : 0.013165322299539031 , 28 : 0.012954300960607157 , 29 : 0.012776091973397076 ,

30 : 0.012771016515779594 , 31 : 0.012953404860268598 , 32 : 0.013364947854005844

33