Contents
List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Rank for objects different from web page. . . . . . . . . . 2
1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4
1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11
1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19
2 Some PageRank-related problems 20
2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22
2.2 Connected-component PageRank approach . . . . . . . . . . . 30
2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32
2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35
2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37
2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementations and experimental results 43
3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58
Conclusions and future works 59
References
81 trang |
Chia sẻ: maiphuongdc | Lượt xem: 1539 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
stinction between
pages because it yields quite different results of ranks. Results become more
neutral when value of α decreases. Medium high-ranked pages do not change
much in ranks but the highest and the lowest change quite clearly. For more
specifically information, in Figure 2.3(a)
pi = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167)
and in Figure 2.3(i)
pi = (0.1681 0.2183 0.1809 0.1720 0.1773 0.0833)
so pi2 goes down to 0.2183 from 0.2637 and pi6 goes up to 0.0833 from 0.0167.
One’s changing rate is about 0.5 and another one is about 0.6. This can lead
us to a different result in searching if α is not carefully chosen. For a better
performance of system, we can choose a small value for α but the result may
be useless. For a usable result, we should choose α near to 1 but the cost may
increase sharply as in Table 1.1.
16
1.2 Basis of PageRank
1 2 3 4 5 6
0
0.2
0.4
(a)
1 2 3 4 5 6
0
0.2
0.4
(b)
1 2 3 4 5 6
0
0.2
0.4
(c)
1 2 3 4 5 6
0
0.2
0.4
(d)
1 2 3 4 5 6
0
0.2
0.4
(e)
1 2 3 4 5 6
0
0.2
0.4
(f)
1 2 3 4 5 6
0
0.2
0.4
(g)
1 2 3 4 5 6
0
0.2
0.4
(h)
1 2 3 4 5 6
0
0.2
0.4
(i)
Figure 1.3: Figures exemplifying PR vector with (a) α = .9; (b) α = .85; (c)
α = .8;
(d) α = .75; (e) α = .7; (f) α = .65; (g) α = .6; (h) α = .55; (i) α = .5
Table 1.1 shows the number of iterations needed to compute PR vector cor-
responding to each value of α.
Value of α .9 .85 .8 .75 .7 .65 .6 .55 .5
Iterations 18 16 14 12 12 10 10 8 8
Table 1.1: Iterations needed for each value of decay-factor α of SmallSet
Iterations needed for α = .9 is as double as iterations needed for α = .5.
As Figure 1.4 shows, for the biggest value of α chosen, the cost is remarkable
but if reference back Figure 1.3 the result is relatively better because we can
discriminate pages’ ranks. For the smallest value of α, the number of spending
iterations is smaller and therefore time taken is much lower but the result may
be not really good.
17
1.2 Basis of PageRank
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
8
10
12
14
16
18
Figure 1.4: Graph of iterations needed with α ∈ [.5; .9] of SmallSet
Algorithm 1.2 will discuss about the PR computation in practical use.
18
1.2 Basis of PageRank
Algorithm 1.2 Iterative method for practical PageRank computation
1: function PRACTICALPAGERANK(G) . G is the Web graph
2: α← any decay value . α is often chosen .85
3: tmp← any random vector . vector tmp is often assigned by
(
1
n
)
1×n
4: ← any threshold value . value to determine if pi converges
5: for indexRow ← 1÷ n do
6: remove rank leak and rank sink nodes
7: end for
8: construct P˜
9: while ||pi − tmp||L > do . makes pi converge to the principal
eigenvector
10: pi ← tmp
11: pi ← tmp× P˜
12: end while
13: return pi . the final PR vector
14: end function
In fact, all rank sink and rank leak nodes are removed by making them
connected by other nodes before constructing matrix P˜ . After that, we apply
the same process as in Algorithm 1.1 to get the final result. And, Algorithm
1.2 also finals this chapter.
Conclusion of chapter
In this chapter, we have gone through a key idea for making an order to objects
in our world. Objects in a database should be ranked, scientific papers should
be ranked and web pages must be ranked. Moreover, the most important thing
it brings us that we can put everything in order to easily mine for knowledge.
We also navigated about PageRank, a method used for ranking web pages,
completely about its base and arisen problems in practical usage. In the next
chapter, we will examine some matters rounding accelerating and spamming
problems.
19
CHAPTER
Some PageRank-related problems
PageRank is a famous occurrence, a trademark and a tool from
which Google benefits much. Because it is famous, many scientists
pay attention and as a result, there are many ways to tweak PageR-
ank up. Some accelerating methods for PageRank will be shown in
this chapter as a proof that PageRank is one of the biggest current
affairs in Internet-related fields. Furthermore, because a page can
easily exploit PageRank to get higher rank so we must deal with a
problem, which is called spam4.
2.1 Accelerating problems
PageRank computation is a big problem which involves many fields of study
such as storage capacity, computational complexity... In this section, we will
discuss one of the major problems rising with the computation of PageRank
which is so-called accelerating problems.
4This kind of spam is different from mail spam
20
2.1 Accelerating problems
2.1.1 Related works
The Web is becoming a massive repository for delivering information. Recent
studies [8] show that most of the users access the Web by means of a search
engine. Nielsen/NetRatings, one of the leading Internet and digital media au-
dience information and analysis services, reports that there were an estimated
151 million active Internet users in the US, for January 2004. Of these, 76%
used a search engine at least once during the month and the average time
spent searching was about 40 minutes. Therefore, to access something on In-
ternet the easiest way is to insert some keywords on a favorite search engine
and to jump to the selected Web site returned as answer. For these reasons,
we witnessed to a growing interest and in large investment efforts to improve
the quality of Web ranking algorithms. Besides, the research community has
devoted an increased attention to reduce the computation time needed by Web
ranking algorithms. Indeed, a particular attention was devoted to improve
PageRank [6, 16], the well known ranking algorithm used by Google. The
core of PageRank exploits an iterative weight assignment of ranks to the Web
pages, until a fixed point is reached. This fixed point turns out to be the com-
putation of the (dominant) eigenpairs of a matrix derived by the Web Graph
itself. Brin and Page originally suggested computing this pair using the well-
known Power method and they also gave a nice interpretation of PageRank in
term of Markov Chains.
The most recent researches about PageRank are aimed to address at least
two different needs. First, the desire to reduce the time spent to weight the
nodes of a Web Graphs containing many billions of pages which requires sev-
eral days of computation. Note that Fetterly et al. in [9] showed that 35% of the
Web changed during the course of their study (eleven downloads over a period
of 11 weeks). More recently Cho et al. [14] showed that in a week more than
25% of links are changed and 5% of "new content" is created. They conclude
that "This result indicates that search engines need to update link based rank-
ing metrics (such as PageRank) very often... a week-old ranking may not reflect
the current ranking of the pages very well. The second need, is to assign many
PageRank values to each Web page, as results of PageRank’s personalization
that was recently presented by Google as beta-service [1].
21
2.1 Accelerating problems
Previous approaches to accelerate PageRank followed three different direc-
tions. The first approach tries to fit the graph structure into main memory
by using compression techniques for the Web Graph [5]. The second approach
efficiently computes in external memory the PageRank, by using a sequence
of scan operations on the Web graph. The last direction exploits efficient nu-
merical methods to reduce the computation time. These kinds of numerical
techniques are the most promising and we have seen many intriguing results
in the last few years. Arasu et al. [3] observe that computing PageRank is
equivalent to solve a linear system and proposed to exploit iterative methods
such as the Gauss-Seidel algorithm. Kamvar et al. [12] suggest accelerating
computation by using extrapolation methods which periodically subtracts off
estimates of non-principal eigenvectors from the current iterate of the power
method. The reported speedup is about 50-300% and it is influenced by the
particular parameter settings used for computing the PageRank. Kamvar et
al. [14] note that the Web Graph’s matrix, permuted by sorting the lexico-
graphically the URLs, assumes an approximate block structure since most of
the links are intra-domains. They suggest to compute separately several "lo-
cal" PageRank vectors, one for each block and to use the concatenation of these
vectors as starting point for the computation of a global web rank. In [13], the
authors suggest to avoid the re-computation of the (sub)-components of PageR-
ank vector as soon as they arrive to a fixed point. This results in a speedup of
about 17%.
We will investigate a method motivated by cost-reducing direction as men-
tioned above in the following part. It exploits the structure of the Web graph
and arranges web pages into blocks if they are from the same host to reduce
computational cost. By introducing BlockRank [14], we can have a nice bridge
to a method proposed in the section 3.2.
2.1.2 Exploiting block structure of the Web
Overview of BlockRank Algorithm
The block structure of the web suggests a fast algorithm for computing
PageRank, wherein a “local PageRank vector” is computed for each host, giving
the relative importance of pages within a host. These local PageRank vectors
22
2.1 Accelerating problems
can then be used to form an approximation to the global PageRank vector that
is used as a starting vector for the standard PageRank computation. This is
the basic idea behind the Block-Rank algorithm, which we summarize here
and describe in this section. The algorithm proceeds as follows:
Algorithm 2.1 General steps of BlockRank algorithm
1: function BLOCKRANK(G) . G is the Web graph
2: Split the web into blocks by domain
3: Compute the Local PageRanks for each block
4: Estimate the relative importance, BlockRank, for each block
5: Form an approximate Global PageRank vector z for web pages
6: Use z as a starting vector for standard PageRank
7: end function
Before describing the steps in detail below, we should get familiar to some
notations used in this section. We will use lower-case indices (i.e. i, j) to repre-
sent indices of individual web sites, and upper case indices (i.e. I, J) to repre-
sent indices of blocks. We use the shorthand notation i ∈ I to denote that page
i ∈ block I. The number of elements in block J is denoted nJ . The graph of a
given block J is given by the nJ × nJ submatrix GJJ of the matrix G.
Computing Local PageRanks
In this section, we describe computing a “local PageRank vector” for each
block in the web. One may wonder how a block is created and that question
would be revealed in this part.
Authors of [14] took all the hyperlinks in their FULL-LARGEWEB data set,
and count how many of these links are “intra-host” links (links from a page to
another page in the same host) and howmany are “inter-host” links (links from
a page to a page in a different host). They had shown that 79.1% of the links in
dataset are intra-host links, and 20.9% are inter-host links. These intra-host
connectivity statistics are not far from the earlier results of Bharat et al. [2].
They also investigated the number of links that are intra-domain links, and
the number of links that are interdomain links. And larger majority of links
are intra-domain links (83.9%). These results make sense intuitively.
23
2.1 Accelerating problems
Take as an example the domain cs.stanford.edu. Most of the links in
cs.stanford.edu are links around the cs.stanford.edu site (such as cs.stanford.edu/admissions
or cs.stanford.edu/research). Furthermore, almost all non-navigational
links are links to other Stanford hosts, such as www.stanford.edu,library.stanford.edu
or www-cs-students.stanford.edu. One might expect that there exists
this structure even in lower levels of the web hierarchy. For example, one might
expect that pages under cs.stanford.edu/admissions/ are highly inter-
connected, and very loosely connected with pages in other sublevels, leading
to a nested block structure. This type of nested block structure can be natu-
rally exposed by sorting the link graph to construct a link matrix in the fol-
lowing way. To sort URLs lexicographically, except that as the sort key, they
reversed the components of the domain. For instance, the sort key for the URL
www.stanford.edu/home/students/would be edu.stanford.www/home/students.
The URLs are then assigned sequential identifiers when constructing the link
matrix. A link matrix contains as its (i, j)th entry a 1 if there is a link from i
to j, and 0 otherwise. This has the desired property that URLs are grouped in
turn by top level domain, domain, hostname, and finally path. The subgraph
for pages in stanford.edu appear as a sub-block of the full link matrix. In
turn, the subgraph for pages in www-db.stanford.edu appear as a nested
sub-block.
Since most blocks have very few links in and out of the block as compared
to the number of links within the block, it seems plausible that the relative
rankings of most of the pages within a block are determined by the inter-block
links. We define the local PageRank vector lJ of a block J (GJJ ) to be the result
of the PageRank algorithm applied only on block J , as if block J represented
the entire web, and as if the links to pages in other blocks did not exist. That
is:
lJ = PAGERANK(GJJ )
in which PAGERANK is exactly as what written in Algorithm 1.1.
Local PageRank accuracies
To investigate how well these local PageRank vectors approximate the rel-
ative magnitudes of the true PageRank vectors within a given host, Kamvar
et. al. ran these experiments. They computed the local PageRank vectors lJ of
24
2.1 Accelerating problems
each host in STANFORD/BERKELEY. They also computed the global PageRank
vector pi for STANFORD/BERKELEY using the standard PageRank algorithm.
After the global PageRank vector computation is completed, they then com-
pared the local PageRank scores of the pages within a given host to the global
PageRank scores of the pages in the same host. Specifically, they took the el-
ements corresponding to the pages in host J of the global PageRank vector pi,
and form the vector gJ from these elements. This newly-created vector gJ is
then normalized so that its elements sum to 1 in order to compare it to the
local PageRank vector lJ , which also has an L1 norm of 1.
Specifically,
gJ =
pij∈J
||pij∈J ||
In Table 2.1, the absolute error, ||lJ−gJ ||1, is on average 0.2383 for the hosts
in STANFORD/BERKELEY.
Error measure Average
||lJ − gJ ||1 0.2383
KDist(lJ , gJ) 0.0571
Table 2.1: The closeness of local PageRank vector to global PageRank vector
Furthermore, they conclude that rank scores induced by local PageRank is
close to the scores obtained from the global PageRank scores. To compare the
orderings, we measure the average “distance” between the local PageRank vec-
tors lJ and global PageRank segments gJ . The KDist distance measure, based
on Kendall’s-τ rank correlation and used for comparing induced rank orders,
is defined as follow.
Consider two partially ordered lists of URLs, τ1 and τ2, each of length m.
Let U be the union of the URLs in τ1 and τ2. If δ1 is U − τ1, then let τ ′1 be the
extension of τ1, where τ ′1 contains δ1 appearing after all the URLs in τ1. We
extend τ2 analogously to yield τ ′2. KDist is then defined as:
KDist(τ1, τ2) =
|{(u, v) : τ ′1, τ
′
2 disagree on order of (u, v), u 6= v}|
|U | × (|U | − 1)
In other words, KDist(τ1, τ2) is the probability that τ ′1 and τ ′2 disagree on the
relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U × U .
25
2.1 Accelerating problems
In the current work, we only compare lists containing the same sets of ele-
ments, so that KDist is identical to Kendall’s τ distance. The average distance
KDist(lJ , gJ ) is 0.0571 for the hosts in STANFORD/BERKELEY. Notice that this
distance is low. This observation means that the ordering induced by the lo-
cal PageRank is close to being correct, and thus suggests that the majority of
the L1 error in the comparison of local and global PageRanks comes from the
miscalibration of a few pages on each host. Indeed the miscalibration may be
among important pages; discussing next, this miscalibration is corrected by
the final step of our algorithm. Furthermore, the relative rankings of pages
on different hosts is unknown at this point. For these reasons, authors of [14]
do not suggest using local PageRank for ranking pages; just using as a tool
for computing the global PageRank more efficiently. And the local PageRank
results compare favorable to the global PageRank which can be obtained from
the standard PageRank algorithm. They suggest using the local PageRank
vector as a start vector for computing global PageRank vector.
It should be noted that errors of this pattern (where the majority of L1 er-
ror comes from the miscalibration of a few pages) are fixed easily, since once
these miscalibrated pages are fixed (by, for example, a few iterations of global
PageRank), the rest of the pages fall into place. Errors that are more random
take longer to fix.
Local PageRank convergence rates
Another interesting question to investigate is how quickly the local PageR-
ank scores converge. In Figure 4, they showed a histogram of the number of it-
erations it takes for the local PageRank scores for each host in their data set, in
which dangling nodes are removed and contains roughly about 70M pages with
over 600M edges, to converge to an L1 residual < 10−1. Notice that most hosts
converge to this residual in less than 12 iterations. Interestingly, there is no
correlation between the convergence rate of a host and the host’s size. Rather,
the convergence rate is primarily dependent on the extent of the nested block
structure within the host. That is, hosts with strong nested blocks are likely to
converge slowly, since they represent slow-mixing Markov chains. Hosts with
a more random connection pattern converge faster, since they represent a fast-
mixing Markov chain. This observation suggests that one could make the local
26
2.1 Accelerating problems
PageRank computations even faster by wisely choosing the blocks. That is,
if a host has a strong nested block structure, use the directories within that
host as your blocks. However, this is not a crucial issue, because we show in
Section 5 that the local PageRank computations can be performed in a dis-
tributed fashion in parallel with the crawl. Therefore, reducing the cost of the
local PageRank computations are not a bottleneck for computing PageRank
with our scheme, as the local computations can be pipelined with the crawl.
Estimating the relative importance of each block
After calculating all pages’ ranks in each host, the algorithm turns out to
compute each block’s rank, namely BlockRanks. Assume there are k blocks in
the web. The idea used to compute BlockRank is similar to the computation
of page’s rank. In [14] firstly the block graph B is created, where each vertex
corresponds to a block in the web graph. An edge between two pages in the web
is represented as an edge between the corresponding blocks (or a self-edge, if
both pages are in the same block). The edge weights are determined as follows:
the weight of an edge between blocks I and J is defined to be the sum of the
edge-weights from pages in I to pages in J in the web graph, weighted by the
local PageRanks of the linking pages in block I. That is, if P is the web graph
and li is the local Page-Rank of page i in block I, then weight of edge BIJ is
given by:
BIJ =
∑
i∈I,j∈J
Pij.li
They wrote it in matrix notations. Define the local PageRank matrix L to be
the n× k matrix whose columns are the local PageRank vectors lJ .
L =
l1 0 · · · 0
0 l2 · · · 0
...
... . . .
...
0 0 · · · lk
Define the matrix S to be the n × k matrix that has the same structure as L,
but whose nonzero entries are all replaced by 1. Then the block matrix B is
the k × k matrix:
B = LTPS
27
2.1 Accelerating problems
Notice that B is a transition matrix where the element BIJ represents the
transition probability of block I to block J . That is:
BIJ = p(J |I)
Once we have the k × k transition matrix B, we may use the standard PageR-
ank algorithm on this reduced matrix to compute the BlockRank vector b. In
this notation, each block is considered as a web page and the PageRank algo-
rithm is applied same as for web pages.
Using relative importance of blocks for computing PageRank
At this point, there are two sets of rankings, one set contains relative ranks
of web pages in each block and another set contains relative ranks of blocks in
the web graph. Within each block J , we have the local PageRanks of the pages
in the block bJ . We also have the BlockRank vector b whose elements bJ are the
BlockRank for each block J , measuring the relative importance of the blocks.
We may now approximate the global PageRank of a page j ∈ J as its local
PageRank lj , weighted by the BlockRank bJ of the block in which it resides.
That is,
pi
(0)
j = lj × bJ
In matrix notation:
pi(0) = L× b
And pi(0) is used as the start vector for the regular PR algorithm.
Conclusion of approach
More specific, this approach is presented in Algorithm 2.2.
28
2.1 Accelerating problems
Algorithm 2.2 Specific version of BlockRank
1: procedure BLOCKRANK(G) . G is the whole Web graph
2: Sort the web graph lexicographically . to induce intra-host block
3: for all block J do . compute the local PageRank vector of block J
4: lJ = PageRank(J)
5: end for
6: B = LTGS . B is a matrix
7: b = PageRank(B) . B is understood as a graph of blocks
8: pi(0) = L× b
9: pi = PageRank(G) . pi(0) is used as the start vector
10: return pi
11: end procedure
This approach has shown that the hyperlink graph of the web has a nested
block structure, something that has not yet been thoroughly investigated in
studies of the web. And, this structure is exploited to compute PageRank in
a fast manner using an algorithm which is so-called BlockRank. As shown in
Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block-
Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the
log of the L1-residual.
Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and
higher, depending on the particular scenario. There are a number of areas
for future work: finding the “best” blocks for BlockRank by splitting up what
would be slow-mixing blocks with internal nested block structure; using the
29
2.2 Connected-component PageRank approach
block structure for hyperlink-based algorithms other than web search, such as
in clustering or classification; and exploring more fully the topics of updates
and personalized PageRank.
2.2 Connected-component PageRank approach
In this section, we will discuss about advantages of using connected compo-
nents, how connected-component concept is applied in computation of PageR-
ank and we will prove some work generated on the theory side. We also propose
an approach which utilizes connected component idea from graph theory. The
proposal is so-called CCP, a short for "Connected Component PageRank".
2.2.1 Initial ideas
While taking care of Markov model [17], we pay special attention to a defini-
tion that states in Markov chains can be divided into classes; states which are
transposable are grouped into one class. This concept is similar to concept of
connected components in graph theory. Moreover, as mentioned, the web graph
is presented using adjacent matrix. This gives us a chance to apply connected
component concept into PageRank computation.
The proposed algorithm is given as follow.
Algorithm 2.3 General steps of CCP algorithm
1: function CCP(G) . G is the Web graph
2: Split the web into blocks by connected component
3: Compute the local PageRanks for each block
4: Form an approximate global PageRank vector pi for web pages by com-
bining eigen-vectors of blocks
5: end function
Figure 2.2 illustrates the idea of CCP.
30
2.2 Connected-component PageRank approach
(a) Unarranged adjacent matrix (b) Arranged adjacent matrix
Figure 2.2: Unarranged matrix and arranged matrix
Applying connected components into PageRank computation, we gain the
following advantages.
¬ Regarding computational complexity: our method follow the trend of re-
ducing cost needed for computing PageRank. We already know that time
expenditure for multiplying a vector with a matrix is O(n2) in which n is
size of common dimension. For that reason, if we compute PageRank by
constructing its adjacent matrix, we face a hard problem when n reaches
billions. By using connected components i
Các file đính kèm theo tài liệu này:
- MSc07_Nguyen_Hoai_Nam_Thesis_English.pdf