Page Rank#

1. Web page organization in the past#

  • Web pages were manually curated and organized.

  • Does not scale.

../_images/011.jpg
  • Next, web search engines were developed: information retrieval

  • Information retrieval focuses on finding documents from a trusted set.

../_images/021.jpg
  • Challenges for web search:

    • Who to trust given the massive variety of information sources?

      • Trustworthy pages may point to each other.

    • What is the best answer to a keyword search (given a context)?

      • Pages using the keyword in the same contexts tend to point to each other.

    • All pages are not equally important.

  • Web pages/sites on the Internet can be represented as a graph:

    • Web pages/sites are nodes on the graph.

    • Links from one page to another represents the incoming/outgoing connections between nodes.

    • We can compute the importance of nodes in this graph based on the distribution/intensity of these links.

2. PageRank: initial formulation#

  • Link as votes:

    • A page (node) is more important if it has more links.

      • Incoming or outgoing?

    • Think of incoming links as votes.

  • Are all incoming links equals?

    • Links from important pages count more.

../_images/036.png
  • Each link’s vote is proportional to the importance of its source page.

  • If page j with importance rj has n outgoing links, each outgoing link has rj/n votes.

  • The importance of page j is the sum of the votes on its incoming links.

../_images/045.png

3. PageRank: the flow model#

  • Summary:

    • A vote from an important page is worth more.

    • A page is important if it is linked to by other important pages.

../_images/054.png
  • Flow equations:

    • ry = ry/2 + ra/2

    • ra = ry/2 + rm

    • rm = ra/2

  • General equation for calculating rank rj for page j:

../_images/063.png
  • Three equations (actually two), three unknown.

    • No unique solutions

  • Additional constraint to force uniqueness:

    • ry + ra + rm = 1

  • Gaussian elimination:

    • ry = 2/5, ra = 2/5, rm = 1/5

  • Does not scale to Internet-size!

4. PageRank: matrix formulation#

  • Setup the flow equations as a stochastic adjacency matrix M

  • Matrix M of size N: N is the number of nodes in the graph (web pages on the Internet).

    • Let page i has *di outgoing links.

    • If there is an outgoing link from i to j, then

      • Mij = 1/di

      • else Mij = 0

    • Stochastic: sum of all values in a column of M is equal to 1.

    • Adjacency: non-zero value indicates the availability of a link.

  • Rank vector r: A vector with an entry per page.

    • The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.

../_images/054.png
  • Flow equations:

    • ry = ry/2 + ra/2

    • ra = ry/2 + rm

    • rm = ra/2

  • General equation for calculating rank rj for page j with regard to all pages:

../_images/073.png
  • This can be rewritten in matrix form:

../_images/083.png
  • Visualization:

    • Suppose page i has importance *ri and has outgoing links to three other pages, including page j.

../_images/093.png
  • Final perspective:

    • Also, this is why we need to study advanced math …

../_images/103.png

5. PageRank: power iteration#

  • We want to find the page rank value r

  • Power method: an iterative scheme.

    • Support there are N web pages on the Internet.

    • All web pages start out with same importance: 1/N.

    • Rank vector r: r(0) = [1/N, …,1/N]T

    • Iterate: r(t+1) = M • r

    • Stopping condition: r(t+1) - r(t) < some small positive error threshold e.

  • Example:

../_images/103.png ../_images/112.png

6. PageRank: the Google formulation#

../_images/073.png ../_images/083.png
  • The three questions

    • Does the above equation converge?

    • Does it converge to what we want?.

    • Are the results reasonable?

Challenge 6.1: Does it converge:#

  • The spider trap problem

  • Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

../_images/122.png
Solution
../_images/131.png

Challenge 6.2: Does it converge to what we want?:#

  • The dead end problem

  • Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

../_images/141.png
Solution
../_images/151.png
  • The solution: random teleport.

  • At each time step, the random surfer has two options:

    • A probably of β to follow an out-going link at random.

    • A probably of 1 - β to jump to some random page.

    • The value of β are between 0.8 to 0.9.

  • The surfer will teleport out of the spider trap within a few time steps.

  • The surfer will definitely teleport out of the dead-end.

../_images/161.png
  • Final equation for PageRank:

../_images/171.png

7. Hands-on: Page Rank in Spark#

  • Create a file called small_graph.dat with the following contents:

y y
y a
a y
a m
m a
  • This data file describes the graph in the easlier slides.

8. Hands-on: Page Rank in Spark - Hollins dataset#

  • Import the Hollins dataset into your Kaggle.

  • Hollins University webbot crawl in 2004.

  • Which page is most important (internally).