Luận văn Www and the pagerank-Related problems

Contents

List of Figures ii

List of Tables iii

Introduction iv

Acknowledgement vi

Abstract ix

List of Glossaries xi

1 Objects’ ranks and applications to WWW 1

1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Rank for objects different from web page. . . . . . . . . . 2

1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4

1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11

1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13

Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19

2 Some PageRank-related problems 20

2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21

2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22

2.2 Connected-component PageRank approach . . . . . . . . . . . 30

2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30

2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32

2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35

2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37

2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37

2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38

Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42

3 Implementations and experimental results 43

3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43

3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48

Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58

Conclusions and future works 59

References

pdf81 trang | Chia sẻ: maiphuongdc | Lượt xem: 1450 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
stinction between pages because it yields quite different results of ranks. Results become more neutral when value of α decreases. Medium high-ranked pages do not change much in ranks but the highest and the lowest change quite clearly. For more specifically information, in Figure 2.3(a) pi = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167) and in Figure 2.3(i) pi = (0.1681 0.2183 0.1809 0.1720 0.1773 0.0833) so pi2 goes down to 0.2183 from 0.2637 and pi6 goes up to 0.0833 from 0.0167. One’s changing rate is about 0.5 and another one is about 0.6. This can lead us to a different result in searching if α is not carefully chosen. For a better performance of system, we can choose a small value for α but the result may be useless. For a usable result, we should choose α near to 1 but the cost may increase sharply as in Table 1.1. 16 1.2 Basis of PageRank 1 2 3 4 5 6 0 0.2 0.4 (a) 1 2 3 4 5 6 0 0.2 0.4 (b) 1 2 3 4 5 6 0 0.2 0.4 (c) 1 2 3 4 5 6 0 0.2 0.4 (d) 1 2 3 4 5 6 0 0.2 0.4 (e) 1 2 3 4 5 6 0 0.2 0.4 (f) 1 2 3 4 5 6 0 0.2 0.4 (g) 1 2 3 4 5 6 0 0.2 0.4 (h) 1 2 3 4 5 6 0 0.2 0.4 (i) Figure 1.3: Figures exemplifying PR vector with (a) α = .9; (b) α = .85; (c) α = .8; (d) α = .75; (e) α = .7; (f) α = .65; (g) α = .6; (h) α = .55; (i) α = .5 Table 1.1 shows the number of iterations needed to compute PR vector cor- responding to each value of α. Value of α .9 .85 .8 .75 .7 .65 .6 .55 .5 Iterations 18 16 14 12 12 10 10 8 8 Table 1.1: Iterations needed for each value of decay-factor α of SmallSet Iterations needed for α = .9 is as double as iterations needed for α = .5. As Figure 1.4 shows, for the biggest value of α chosen, the cost is remarkable but if reference back Figure 1.3 the result is relatively better because we can discriminate pages’ ranks. For the smallest value of α, the number of spending iterations is smaller and therefore time taken is much lower but the result may be not really good.  17 1.2 Basis of PageRank 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 8 10 12 14 16 18 Figure 1.4: Graph of iterations needed with α ∈ [.5; .9] of SmallSet Algorithm 1.2 will discuss about the PR computation in practical use. 18 1.2 Basis of PageRank Algorithm 1.2 Iterative method for practical PageRank computation 1: function PRACTICALPAGERANK(G) . G is the Web graph 2: α← any decay value . α is often chosen .85 3: tmp← any random vector . vector tmp is often assigned by ( 1 n ) 1×n 4: ← any threshold value . value to determine if pi converges 5: for indexRow ← 1÷ n do 6: remove rank leak and rank sink nodes 7: end for 8: construct P˜ 9: while ||pi − tmp||L >  do . makes pi converge to the principal eigenvector 10: pi ← tmp 11: pi ← tmp× P˜ 12: end while 13: return pi . the final PR vector 14: end function In fact, all rank sink and rank leak nodes are removed by making them connected by other nodes before constructing matrix P˜ . After that, we apply the same process as in Algorithm 1.1 to get the final result. And, Algorithm 1.2 also finals this chapter. Conclusion of chapter In this chapter, we have gone through a key idea for making an order to objects in our world. Objects in a database should be ranked, scientific papers should be ranked and web pages must be ranked. Moreover, the most important thing it brings us that we can put everything in order to easily mine for knowledge. We also navigated about PageRank, a method used for ranking web pages, completely about its base and arisen problems in practical usage. In the next chapter, we will examine some matters rounding accelerating and spamming problems. 19 CHAPTER Some PageRank-related problems PageRank is a famous occurrence, a trademark and a tool from which Google benefits much. Because it is famous, many scientists pay attention and as a result, there are many ways to tweak PageR- ank up. Some accelerating methods for PageRank will be shown in this chapter as a proof that PageRank is one of the biggest current affairs in Internet-related fields. Furthermore, because a page can easily exploit PageRank to get higher rank so we must deal with a problem, which is called spam4. 2.1 Accelerating problems PageRank computation is a big problem which involves many fields of study such as storage capacity, computational complexity... In this section, we will discuss one of the major problems rising with the computation of PageRank which is so-called accelerating problems. 4This kind of spam is different from mail spam 20 2.1 Accelerating problems 2.1.1 Related works The Web is becoming a massive repository for delivering information. Recent studies [8] show that most of the users access the Web by means of a search engine. Nielsen/NetRatings, one of the leading Internet and digital media au- dience information and analysis services, reports that there were an estimated 151 million active Internet users in the US, for January 2004. Of these, 76% used a search engine at least once during the month and the average time spent searching was about 40 minutes. Therefore, to access something on In- ternet the easiest way is to insert some keywords on a favorite search engine and to jump to the selected Web site returned as answer. For these reasons, we witnessed to a growing interest and in large investment efforts to improve the quality of Web ranking algorithms. Besides, the research community has devoted an increased attention to reduce the computation time needed by Web ranking algorithms. Indeed, a particular attention was devoted to improve PageRank [6, 16], the well known ranking algorithm used by Google. The core of PageRank exploits an iterative weight assignment of ranks to the Web pages, until a fixed point is reached. This fixed point turns out to be the com- putation of the (dominant) eigenpairs of a matrix derived by the Web Graph itself. Brin and Page originally suggested computing this pair using the well- known Power method and they also gave a nice interpretation of PageRank in term of Markov Chains. The most recent researches about PageRank are aimed to address at least two different needs. First, the desire to reduce the time spent to weight the nodes of a Web Graphs containing many billions of pages which requires sev- eral days of computation. Note that Fetterly et al. in [9] showed that 35% of the Web changed during the course of their study (eleven downloads over a period of 11 weeks). More recently Cho et al. [14] showed that in a week more than 25% of links are changed and 5% of "new content" is created. They conclude that "This result indicates that search engines need to update link based rank- ing metrics (such as PageRank) very often... a week-old ranking may not reflect the current ranking of the pages very well. The second need, is to assign many PageRank values to each Web page, as results of PageRank’s personalization that was recently presented by Google as beta-service [1]. 21 2.1 Accelerating problems Previous approaches to accelerate PageRank followed three different direc- tions. The first approach tries to fit the graph structure into main memory by using compression techniques for the Web Graph [5]. The second approach efficiently computes in external memory the PageRank, by using a sequence of scan operations on the Web graph. The last direction exploits efficient nu- merical methods to reduce the computation time. These kinds of numerical techniques are the most promising and we have seen many intriguing results in the last few years. Arasu et al. [3] observe that computing PageRank is equivalent to solve a linear system and proposed to exploit iterative methods such as the Gauss-Seidel algorithm. Kamvar et al. [12] suggest accelerating computation by using extrapolation methods which periodically subtracts off estimates of non-principal eigenvectors from the current iterate of the power method. The reported speedup is about 50-300% and it is influenced by the particular parameter settings used for computing the PageRank. Kamvar et al. [14] note that the Web Graph’s matrix, permuted by sorting the lexico- graphically the URLs, assumes an approximate block structure since most of the links are intra-domains. They suggest to compute separately several "lo- cal" PageRank vectors, one for each block and to use the concatenation of these vectors as starting point for the computation of a global web rank. In [13], the authors suggest to avoid the re-computation of the (sub)-components of PageR- ank vector as soon as they arrive to a fixed point. This results in a speedup of about 17%. We will investigate a method motivated by cost-reducing direction as men- tioned above in the following part. It exploits the structure of the Web graph and arranges web pages into blocks if they are from the same host to reduce computational cost. By introducing BlockRank [14], we can have a nice bridge to a method proposed in the section 3.2. 2.1.2 Exploiting block structure of the Web Overview of BlockRank Algorithm The block structure of the web suggests a fast algorithm for computing PageRank, wherein a “local PageRank vector” is computed for each host, giving the relative importance of pages within a host. These local PageRank vectors 22 2.1 Accelerating problems can then be used to form an approximation to the global PageRank vector that is used as a starting vector for the standard PageRank computation. This is the basic idea behind the Block-Rank algorithm, which we summarize here and describe in this section. The algorithm proceeds as follows: Algorithm 2.1 General steps of BlockRank algorithm 1: function BLOCKRANK(G) . G is the Web graph 2: Split the web into blocks by domain 3: Compute the Local PageRanks for each block 4: Estimate the relative importance, BlockRank, for each block 5: Form an approximate Global PageRank vector z for web pages 6: Use z as a starting vector for standard PageRank 7: end function Before describing the steps in detail below, we should get familiar to some notations used in this section. We will use lower-case indices (i.e. i, j) to repre- sent indices of individual web sites, and upper case indices (i.e. I, J) to repre- sent indices of blocks. We use the shorthand notation i ∈ I to denote that page i ∈ block I. The number of elements in block J is denoted nJ . The graph of a given block J is given by the nJ × nJ submatrix GJJ of the matrix G. Computing Local PageRanks In this section, we describe computing a “local PageRank vector” for each block in the web. One may wonder how a block is created and that question would be revealed in this part. Authors of [14] took all the hyperlinks in their FULL-LARGEWEB data set, and count how many of these links are “intra-host” links (links from a page to another page in the same host) and howmany are “inter-host” links (links from a page to a page in a different host). They had shown that 79.1% of the links in dataset are intra-host links, and 20.9% are inter-host links. These intra-host connectivity statistics are not far from the earlier results of Bharat et al. [2]. They also investigated the number of links that are intra-domain links, and the number of links that are interdomain links. And larger majority of links are intra-domain links (83.9%). These results make sense intuitively. 23 2.1 Accelerating problems Take as an example the domain cs.stanford.edu. Most of the links in cs.stanford.edu are links around the cs.stanford.edu site (such as cs.stanford.edu/admissions or cs.stanford.edu/research). Furthermore, almost all non-navigational links are links to other Stanford hosts, such as www.stanford.edu,library.stanford.edu or www-cs-students.stanford.edu. One might expect that there exists this structure even in lower levels of the web hierarchy. For example, one might expect that pages under cs.stanford.edu/admissions/ are highly inter- connected, and very loosely connected with pages in other sublevels, leading to a nested block structure. This type of nested block structure can be natu- rally exposed by sorting the link graph to construct a link matrix in the fol- lowing way. To sort URLs lexicographically, except that as the sort key, they reversed the components of the domain. For instance, the sort key for the URL www.stanford.edu/home/students/would be edu.stanford.www/home/students. The URLs are then assigned sequential identifiers when constructing the link matrix. A link matrix contains as its (i, j)th entry a 1 if there is a link from i to j, and 0 otherwise. This has the desired property that URLs are grouped in turn by top level domain, domain, hostname, and finally path. The subgraph for pages in stanford.edu appear as a sub-block of the full link matrix. In turn, the subgraph for pages in www-db.stanford.edu appear as a nested sub-block. Since most blocks have very few links in and out of the block as compared to the number of links within the block, it seems plausible that the relative rankings of most of the pages within a block are determined by the inter-block links. We define the local PageRank vector lJ of a block J (GJJ ) to be the result of the PageRank algorithm applied only on block J , as if block J represented the entire web, and as if the links to pages in other blocks did not exist. That is: lJ = PAGERANK(GJJ ) in which PAGERANK is exactly as what written in Algorithm 1.1. Local PageRank accuracies To investigate how well these local PageRank vectors approximate the rel- ative magnitudes of the true PageRank vectors within a given host, Kamvar et. al. ran these experiments. They computed the local PageRank vectors lJ of 24 2.1 Accelerating problems each host in STANFORD/BERKELEY. They also computed the global PageRank vector pi for STANFORD/BERKELEY using the standard PageRank algorithm. After the global PageRank vector computation is completed, they then com- pared the local PageRank scores of the pages within a given host to the global PageRank scores of the pages in the same host. Specifically, they took the el- ements corresponding to the pages in host J of the global PageRank vector pi, and form the vector gJ from these elements. This newly-created vector gJ is then normalized so that its elements sum to 1 in order to compare it to the local PageRank vector lJ , which also has an L1 norm of 1. Specifically, gJ = pij∈J ||pij∈J || In Table 2.1, the absolute error, ||lJ−gJ ||1, is on average 0.2383 for the hosts in STANFORD/BERKELEY. Error measure Average ||lJ − gJ ||1 0.2383 KDist(lJ , gJ) 0.0571 Table 2.1: The closeness of local PageRank vector to global PageRank vector Furthermore, they conclude that rank scores induced by local PageRank is close to the scores obtained from the global PageRank scores. To compare the orderings, we measure the average “distance” between the local PageRank vec- tors lJ and global PageRank segments gJ . The KDist distance measure, based on Kendall’s-τ rank correlation and used for comparing induced rank orders, is defined as follow. Consider two partially ordered lists of URLs, τ1 and τ2, each of length m. Let U be the union of the URLs in τ1 and τ2. If δ1 is U − τ1, then let τ ′1 be the extension of τ1, where τ ′1 contains δ1 appearing after all the URLs in τ1. We extend τ2 analogously to yield τ ′2. KDist is then defined as: KDist(τ1, τ2) = |{(u, v) : τ ′1, τ ′ 2 disagree on order of (u, v), u 6= v}| |U | × (|U | − 1) In other words, KDist(τ1, τ2) is the probability that τ ′1 and τ ′2 disagree on the relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U × U . 25 2.1 Accelerating problems In the current work, we only compare lists containing the same sets of ele- ments, so that KDist is identical to Kendall’s τ distance. The average distance KDist(lJ , gJ ) is 0.0571 for the hosts in STANFORD/BERKELEY. Notice that this distance is low. This observation means that the ordering induced by the lo- cal PageRank is close to being correct, and thus suggests that the majority of the L1 error in the comparison of local and global PageRanks comes from the miscalibration of a few pages on each host. Indeed the miscalibration may be among important pages; discussing next, this miscalibration is corrected by the final step of our algorithm. Furthermore, the relative rankings of pages on different hosts is unknown at this point. For these reasons, authors of [14] do not suggest using local PageRank for ranking pages; just using as a tool for computing the global PageRank more efficiently. And the local PageRank results compare favorable to the global PageRank which can be obtained from the standard PageRank algorithm. They suggest using the local PageRank vector as a start vector for computing global PageRank vector. It should be noted that errors of this pattern (where the majority of L1 er- ror comes from the miscalibration of a few pages) are fixed easily, since once these miscalibrated pages are fixed (by, for example, a few iterations of global PageRank), the rest of the pages fall into place. Errors that are more random take longer to fix. Local PageRank convergence rates Another interesting question to investigate is how quickly the local PageR- ank scores converge. In Figure 4, they showed a histogram of the number of it- erations it takes for the local PageRank scores for each host in their data set, in which dangling nodes are removed and contains roughly about 70M pages with over 600M edges, to converge to an L1 residual < 10−1. Notice that most hosts converge to this residual in less than 12 iterations. Interestingly, there is no correlation between the convergence rate of a host and the host’s size. Rather, the convergence rate is primarily dependent on the extent of the nested block structure within the host. That is, hosts with strong nested blocks are likely to converge slowly, since they represent slow-mixing Markov chains. Hosts with a more random connection pattern converge faster, since they represent a fast- mixing Markov chain. This observation suggests that one could make the local 26 2.1 Accelerating problems PageRank computations even faster by wisely choosing the blocks. That is, if a host has a strong nested block structure, use the directories within that host as your blocks. However, this is not a crucial issue, because we show in Section 5 that the local PageRank computations can be performed in a dis- tributed fashion in parallel with the crawl. Therefore, reducing the cost of the local PageRank computations are not a bottleneck for computing PageRank with our scheme, as the local computations can be pipelined with the crawl. Estimating the relative importance of each block After calculating all pages’ ranks in each host, the algorithm turns out to compute each block’s rank, namely BlockRanks. Assume there are k blocks in the web. The idea used to compute BlockRank is similar to the computation of page’s rank. In [14] firstly the block graph B is created, where each vertex corresponds to a block in the web graph. An edge between two pages in the web is represented as an edge between the corresponding blocks (or a self-edge, if both pages are in the same block). The edge weights are determined as follows: the weight of an edge between blocks I and J is defined to be the sum of the edge-weights from pages in I to pages in J in the web graph, weighted by the local PageRanks of the linking pages in block I. That is, if P is the web graph and li is the local Page-Rank of page i in block I, then weight of edge BIJ is given by: BIJ = ∑ i∈I,j∈J Pij.li They wrote it in matrix notations. Define the local PageRank matrix L to be the n× k matrix whose columns are the local PageRank vectors lJ . L =   l1 0 · · · 0 0 l2 · · · 0 ... ... . . . ... 0 0 · · · lk   Define the matrix S to be the n × k matrix that has the same structure as L, but whose nonzero entries are all replaced by 1. Then the block matrix B is the k × k matrix: B = LTPS 27 2.1 Accelerating problems Notice that B is a transition matrix where the element BIJ represents the transition probability of block I to block J . That is: BIJ = p(J |I) Once we have the k × k transition matrix B, we may use the standard PageR- ank algorithm on this reduced matrix to compute the BlockRank vector b. In this notation, each block is considered as a web page and the PageRank algo- rithm is applied same as for web pages. Using relative importance of blocks for computing PageRank At this point, there are two sets of rankings, one set contains relative ranks of web pages in each block and another set contains relative ranks of blocks in the web graph. Within each block J , we have the local PageRanks of the pages in the block bJ . We also have the BlockRank vector b whose elements bJ are the BlockRank for each block J , measuring the relative importance of the blocks. We may now approximate the global PageRank of a page j ∈ J as its local PageRank lj , weighted by the BlockRank bJ of the block in which it resides. That is, pi (0) j = lj × bJ In matrix notation: pi(0) = L× b And pi(0) is used as the start vector for the regular PR algorithm. Conclusion of approach More specific, this approach is presented in Algorithm 2.2. 28 2.1 Accelerating problems Algorithm 2.2 Specific version of BlockRank 1: procedure BLOCKRANK(G) . G is the whole Web graph 2: Sort the web graph lexicographically . to induce intra-host block 3: for all block J do . compute the local PageRank vector of block J 4: lJ = PageRank(J) 5: end for 6: B = LTGS . B is a matrix 7: b = PageRank(B) . B is understood as a graph of blocks 8: pi(0) = L× b 9: pi = PageRank(G) . pi(0) is used as the start vector 10: return pi 11: end procedure This approach has shown that the hyperlink graph of the web has a nested block structure, something that has not yet been thoroughly investigated in studies of the web. And, this structure is exploited to compute PageRank in a fast manner using an algorithm which is so-called BlockRank. As shown in Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block- Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the log of the L1-residual. Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and higher, depending on the particular scenario. There are a number of areas for future work: finding the “best” blocks for BlockRank by splitting up what would be slow-mixing blocks with internal nested block structure; using the 29 2.2 Connected-component PageRank approach block structure for hyperlink-based algorithms other than web search, such as in clustering or classification; and exploring more fully the topics of updates and personalized PageRank. 2.2 Connected-component PageRank approach In this section, we will discuss about advantages of using connected compo- nents, how connected-component concept is applied in computation of PageR- ank and we will prove some work generated on the theory side. We also propose an approach which utilizes connected component idea from graph theory. The proposal is so-called CCP, a short for "Connected Component PageRank". 2.2.1 Initial ideas While taking care of Markov model [17], we pay special attention to a defini- tion that states in Markov chains can be divided into classes; states which are transposable are grouped into one class. This concept is similar to concept of connected components in graph theory. Moreover, as mentioned, the web graph is presented using adjacent matrix. This gives us a chance to apply connected component concept into PageRank computation. The proposed algorithm is given as follow. Algorithm 2.3 General steps of CCP algorithm 1: function CCP(G) . G is the Web graph 2: Split the web into blocks by connected component 3: Compute the local PageRanks for each block 4: Form an approximate global PageRank vector pi for web pages by com- bining eigen-vectors of blocks 5: end function Figure 2.2 illustrates the idea of CCP. 30 2.2 Connected-component PageRank approach (a) Unarranged adjacent matrix (b) Arranged adjacent matrix Figure 2.2: Unarranged matrix and arranged matrix Applying connected components into PageRank computation, we gain the following advantages. ¬ Regarding computational complexity: our method follow the trend of re- ducing cost needed for computing PageRank. We already know that time expenditure for multiplying a vector with a matrix is O(n2) in which n is size of common dimension. For that reason, if we compute PageRank by constructing its adjacent matrix, we face a hard problem when n reaches billions. By using connected components i

Các file đính kèm theo tài liệu này:

  • pdfMSc07_Nguyen_Hoai_Nam_Thesis_English.pdf