Khóa luận Ranking in social network – Social rank

Contents

Abstract . 1

Acknowledgement . 3

List of Figures. 5

Chapter 1. 6

Introduction to Social Network.6

1. Social network. 6

2. Network construction . 8

3. Network representation . 10

4. A brief introduction of graph theory. 12

5. Social network’s characteristics. 14

6. Social network analysis – SNA. 17

Chapter 2. 19

Ranking in social network – Socialrank.19

1. Introduction . 19

2. Ranking in social networks. 20

Chapter 3. 29

Ranking bloggers and Experiments .29

1. Background and Motivation. 29

2. Ranking bloggers by PageRank . 34

3. Experiment setup and Results. 35

Conclusion and Future works . 40

Biblography . 41

44 trang | Chia sẻ: maiphuongdc | Lượt xem: 2192 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Khóa luận Ranking in social network – Social rank, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

aph theory A necessary course in social network analysis is graph theory. As social networks can be represented as graphs, understanding fundamental concepts in graph theories is essential. In this section we will give some concepts that are often used when analyzing networks. More details can be found at [29]. The degree of a node is defined as the number of ties incident upon that node. In directed graph, each node has both indegree and outdegree. The indegree is the number of ties pointing to the node, whereas the outdegree is the number of ties pointing out from that nodes. A path is an alternating sequence of nodes and ties, beginning at a node and ending at a node, and which does not visit any node more than once. A walk is like a path except that there is no restriction on the number of times a point can be visited. A path is a kind of walk. A cycle is just like a path except that it starts and ends at the same point. The length of a path or walk (or cycle) is defined as the number of ties in it. A path between two nodes with the shortest length is called a shortest path (also a geodesic) between the two nodes. It is not always unique (that is, there may be several paths between the same two points that are equally short). The graph-theoretic distance between two nodes is defined as the length of the shortest path between them. A graph is connected if there exists a path (of any length) from every node to every other node. The longest possible path between any two nodes in a connected graph is n-1, where n is the number of nodes in the graph. ~ 13 ~ A node is reachable from another node if there exists a path of any length from one to the other. A connected component is a maximal sub-graph in which all nodes are reachable from every other. Maximal means that it is the largest possible sub- graph: you could not find another node anywhere in the graph such that it could be added to the sub-graph and all the nodes in the sub-graph would still be connected. For directed graphs, there are strong components and weak components. A strong component is a maximal sub-graph in which there is a path from every node to every node following all the arcs in the direction they are pointing. A weak component is a maximal sub-graph which would be connected if we ignored the direction of the arcs. A cutpoint is a vertex whose removal from the graph increases the number of components. That is, it makes some points unreachable from some others. It disconnects the graph. A cutset is a collection of points whose removal increases the number of components in a graph. A minimum weight cutset consists of the smallest set of points that must be removed to disconnect a graph. The number of points in a minimum weight cutset is called the point connectivity of a graph. If a graph has a cutpoint, the connectivity of the graph is 1. The minimum number of points separating two nonadjacent points s and t is also the maximum number of point- disjoint paths between s and t. A bridge is an edge whose removal from a graph increases the number of components (disconnects the graph). An edge cutset is a collection of edges whose removal disconnects a graph. A local bridge of degree k is an edge whose removal causes the distance between the endpoints of the edge to be at least k. ~ 14 ~ The edge-connectivity of a graph is the minimum number of lines whose removal would disconnect the graph. The minimum number of edges separating two nonadjacent points s and t is also the maximum number of edge-disjoint paths between s and t. 5. Social network’s characteristics In the late of 1950s, two mathematicians Erdös and Rényi created a great important theory in graph by modeling many real world networks by a special type of graph – random graph. To create a random graph with n nodes and m ties, they put n nodes next to each other, take pair of node at random and tie them together, the process continues until the graph has m ties. Erdös and Rényi realize that “when m is small, the graph is likely to be fragmented into many small clusters” (components), “as m increases the components grow”. For m > n/2, all nodes are connected to each other [31]. Beside regular and random graph, the two extreme types of graph, network analysts also study some other types of networks, two most important of them are small world and scale free networks. 5.1. Small world networks The experiments conducted by Stanley Milgram and his colleagues for social networks of people in the United States raising the concept of “small world”. The phrase captures the initial surprise between two strangers (“What a small world”) when they realize that they are indirectly connected to one another through mutual friends. People in Kansas and Nebraska were asked to direct letters to strangers in Boston by forwarding them to friends who thought might know the strangers in Boston. And half of the letters were successfully delivered through no more than five intermediaries. Another experiments show that there might be “six degrees of separation” between any two individuals in the world. ~ 15 ~ The research was groundbreaking in that it showed that human society is a small world network characterized by shorter than expected path lengths [16]. In small world network, most nodes can be reached from every other by a small number of hops or steps. Figure 6: six degrees of separation Source: ~ 16 ~ Figure 7: Real world example of small world networks. (a) science coauthor network, (b) connected pages on a part of the internet, (c) biochemical pathway network, and (d) New York state electric power grid. Figure 7 (a), (b), (c) are from www.nd.edu/~networks/publications.html#talks0001 by Barabasi, Oltvai, Jeong et al. Figure 6 (d) is from ~ 17 ~ 5.2. Scale free networks Many real world networks are scale free, which means the network will not change its properties no matter how many nodes it has. The degree distribution of scale free networks follows the Yule-Simon distribution – a power law relationship defined by P(k) ~ k-α, where P(k) denotes the probability that a node in the network connects with k other nodes, the coefficient α may very approximately from 2 to 3 for most cases, but some times it can takes a value between 1 and 2 [36]. Scale free networks have some highly connected hubs and the rest of nodes are of low degree. The hubs are thought to serve some specific purposes in their networks, although this depends greatly on the domain. The structure of this kind of network allows the ability of fault tolerant. Because the random occurrence of failures and the number of small degree nodes are enormous, the likelihood that a hub would be affected is negligible. Even if such even occurs, the networks will not lose its connectedness, because of the remaining hubs. This property make scale free network highly stable and robust [36]. 6. Social network analysis – SNA SNA [15] deals with mapping and measuring the nodes and relations between the nodes in a social network. As stated previously, the nodes might be people, organizations, etc, and relations might be friendship, kinship, or water flow. Social network analysis has become a key technique in modern sociology, anthropology, geography, social psychology, sociolinguistics, information science, communication studies, organizational studies, economics, and biology as well. For over a century, people have used social network metaphor to model complex sets of relationships between actors of social systems at all scale. ~ 18 ~ Analysts reason from whole to part, from structure to relation to individual, from behavior to attitude [33]. So why do we have to study social networks and what we can learn about their structure? The reason is that the structure of a network affects its functionalities. For example, the topology of social networks affects the way diseases spread through a population. Consider a city water distribution network, delivering water to households via pipes and junctions. Accidental or malicious instructions can cause contaminants to spread over the network. Jure Leskovec and colleagues at Carnegie Melon University has proposed an algorithm to select a few locations to install sensors in order to detect these contaminants as soon as possible so that it minimizes the population consuming contaminants [18]. The topology of power grid affects the stability and robustness of its power transmission; the power failure in Cleveland, Ohio on August 14, 2003 is an example. When occurred, the shutting down of nuclear power plants in New York State and Ohio led to widespread power blackouts in many parts of the Northeastern United States and Southeastern Canada through an interconnecting grid system, which affected about 50 million people [16]. ~ 19 ~ Chapter 2 Ranking in social network – Social rank 1. Introduction The evolution of the Internet has led to the need of a mechanism to find information in an efficient way, and information retrieval systems (search engines in particular) are born for this reason. Information Retrieval is a science of searching for information in documents. The system is often very complex, and consists of four main parts: crawlers, document processors, indexers and search. Crawlers (or spiders) collect documents and store them on disks. Documents are usually web pages on the internet. Then, these documents will be sent to document processors to process and extract necessary information, for example, remove un-wanted part of documents (advertises, header, footer, etc), extract title, body, descriptions, etc Information extracted from the process will be stored back to disks in a special structure that helps rapid access and search for a query. Web front-end is a interface that standing in front of users, receives queries from users, contacts the search component and present the results back to users. Refer to [28] for more details about information retrieval. As you can imagine, because of the huge size of the internet (Google has about 60 billion documents in its index [17] in 2005), the number of results for each query might be millions, or even billions, for example, the query “search engine” returns over 200 million results from Google. Users do not have time and patient to through every page to find an interested answer for their question. The process of rearrange the results so that the most relevant documents appear first in the list is called ranking. Ranking helps users find their need-answer without much effort, that mean in a rapid and efficient way. ~ 20 ~ Many ranking algorithms have been proposed to rank documents based on context, that mean whether the query terms match in the title, or in the body of a document; how early and how often they appear in a document. It seems that a document is more relevant to a query if the query terms appear in the title rather than in the body of the document. We often use freshness when searching for news. Freshness is a measure of how fresh a document is. Freshness help search engines rank documents based on the time it is created. We also measure the quality of a document to improve ranking. So what is quality? Let me explain it by some examples. Because of your search policy, you might want to favor web sites from government more than those from education, and focus on pages from education more than pages from the left. In other words, you assume pages from government have higher quality than pages from education, and pages from education have higher quality than the left. And another assumption is that home pages’ quality are higher than quality of pages that from a deeper position. It is very useful in the context that users search for a product or a company, the homepage of the company or the product should be on top. For instance, the query “Microsoft” should return the homepage of Microsoft Corporation rather than the page of some users that complains about the Microsoft products. To calculate the quality of a document, we can borrow some algorithms from social network analysis, which use the link structure to determine the importance of individuals in the network. 2. Ranking in social networks In social network analysis, one fundamental problem is ranking individuals in society according to their implicit importance, for example the power or influence, determined by the topology of the networks. Precisely, given a social ~ 21 ~ network, the purpose is producing a rank point, which is a non negative value, assigned to each individual [7]. We measure the importance of individuals in concepts of centrality, including Degree centrality, Betweenness centrality, closeness centrality and Eigenvector centrality. And we focus on Eigenvector centrality with the two most popular variants, PageRank and HITS. 2.1. Centrality Figure 8: The Kite Network Developed by David Krackhardt, source: ~ 22 ~ 2.1.1. Degree Centrality Degree Centrality is measured by the number of direct connections a node has. As illustrated in the kite network above, Diane has the most direct connections in the network, making her the most active node in the network. For several contexts, it is better to have many connections, but not always right. What really matters is where those connections lead to and how they connect the otherwise unconnected. Diane has connections only to others in her clique, who are already connected to each other [26]. If the network is directed, we classify the degree centrality into two types: in-degree and out-degree. In-degree of a node is the number of connections coming from other nodes to the node, whereas out-degree of a node is the count of connections starting from the node [34]. 2.1.2. Betweenness Centrality A node is said to be in high betweenness position if it connects many pairs of actors in the network, or stands in many shortest paths between any pairs of actors. If two non adjacent actors j and k want to interact and actor i is on the path between j and k, then i may have some control over the interactions between j and k. Betweenness measures this control of i over other pairs of actor. Thus, if i is on the paths of many such interactions, then i is an important actor [9]. In the network above, Heather has few direct connections but she has one of the best locations in the network, she is between the two most important clusters. As a cut-point in the shortest path connecting two other nodes, Heather could control the flow of information or exchange of resources. Besides the powerful role Heather has, she is a single point of failure [26]. ~ 23 ~ 2.1.3. Closeness Centrality The idea of closeness centrality is that an actor is central if it can easily interact with all other actors. Look at Figure 8, Fernando and Garth have fewer connections than Diane, but their positions allow them to interact and communicate with all the nodes in the network more quickly than anyone else. They have high closeness centrality, they have the shortest paths to all others, or they are close to everyone else [26]. Computing the betweenness and closeness centralities of all the nodes in a network needs calculating the shortest paths between all pairs of nodes on the network. This takes O(V3) time with Floyd-Warshall algorithm. On a sparse graph, the more efficient Johnson’s algorithm takes O(V2logV + VE) time [34]. 2.1.4. Eigenvector Centrality Eigenvector centrality is a measure of the importance of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes [34]. 2.2. PageRank & HITS Nearly all the major Web search engines today use link analysis to improve their search results. HITS and PageRank are two of the most popular link analysis algorithms. Both were developed around 1998 and both have dramatically improved the search business. 2.2.1. HITS HITS stands for Hyperlink-Induced Topic Search, a link analysis algorithm proposed by John Kleinberg from Cornell University during his postdoctoral studies at IBM Almade [5, 8, 9, 24]. Specifically, a root set R of nodes is constructed by sending a query q to an information retrieval system. Any node ~ 24 ~ u V (V is the set of nodes in the graph) that has an inlink or outlink to any node r R, that is eur E or eru E, is also included as well. The additional nodes and the root set together form the base set Vq. We now remove any nepotistic links, or eliminates edges that connect nodes from the same host. The remaining edges are called Eq. So we have just constructed a query-specific graph Gq = (Vq, Eq). The HITS algorithm is based on a pattern Kleinberg noticed among Web pages. Some pages that have many in-links are called authorities, which contain definitive high-quality information. Other pages serve as hubs or portal pages. Kleinberg noticed that good hubs seemed to point to good authorities and good authorities were pointed to by good hubs. Every page is both a hub and authority, but with different strengths. So he decided to give each page both a hub score hi and an authority score ai. In fact, for every page i he defined the hub score at iteration k, hi(k), and the authority score, ai(k), as: ai(k) = hi(k) = for k = 1, 2, 3, … To calculate the scores for a page, he started with a uniform scores for all pages, such as hi = 1/n and ai = 1/n where n is the number of pages or nodes in Gq. The hub and authority scores are iteratively refined until convergence to stationary values. In practice, we use two column vectors h and a representing hub and authority scores, where the ith component of the vectors holding the scores of page i. We also use adjacency matrices for representing graphs, let L be the matrix for the graph (Lij = 1 if page i links to page j , and 0, otherwise). The equation of HITS now become ~ 25 ~ a(k) = LT h(k-1) and h(k) = La(k) Using some algebra, we have a(k) = LTLa(k-1) h(k) = LLTh(k-1) Where LTL is called the hub matrix and LLT is the authority matrix. The HITS algorithm deals with solving the eigenvector problems LTLa = a and LLTh = h, where is the largest eigenvalue of LTL and LLT, a and h are corresponding eigenvectors. There are many issues when calculating these score beyond the basic linear algebra, such as convergence, existence, uniqueness, and numerical computation. Several research has proposed some modifications to HITS, and each bringing various advantages and disadvantages [5, 8]. 2.2.2. Pagerank PageRank [5, 8, 9, 21, 24, 27] is one of the most important ranking techniques used in many search engines. The ranking algorithm was proposed by Lawrence Page, the co-founder of the Google Inc. Pagerank is the heart of Google, that made Google really different from the other search engines at the time it came out. The reasons made PageRank the most popular is not because of its simple, robust and reliable in measuring the importance of web pages, but it also gain advantages with other ranking techniques in that it is query independent and content independent. Moreover, it can be calculated offline using only the web graph structure and then used later . The philosophy of PageRank can be described as follows: consider a random surfer that starts from a random page, and at every time randomly chooses the next page by clicking on one of the hyperlinks in the current page. We could define the rank of a page as the fraction of time that the surfer spent on that page ~ 26 ~ on the average. More clearly, important pages, i.e., pages that are linked by many other pages, or by few important pages, will be visited more frequently. If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. The solution for this problem is quite simple, we assume that the random surfer picks another page at random and continue surfing again, with probability (1- α), if it arrives a sink page. α is called the damping factor [27]. Figure 9: An example showing how pagerank works Source: In general case, the PageRank value for any page u can be computed as: PR(u) = ~ 27 ~ Initially, each page u has the same PageRank value as ( Where (v,u) means there is a link from page v to page u, and L(v) is the number of (out) links from page v. For example, consider a small set of four web pages A, B, C and D. The algorithm assigns equally with each page an estimated PageRank score of 0.25 If pages B,C and D each only link to A, they would contribute 0.25 PageRank value to A. The PageRank for A is now computed as PR(A) = PR(B) + PR(C) + PR(D) = 0.75 In the other case, suppose page B also link to page C and page D has links to all pages, so B gives 0.125 rank value to each page A and C, and one third of D’s PageRank is added to the total PageRank of A. PR(A) = + + PR(B) = PR(C) = + Now applying the random surfer model, we have another equation of PageRank for a page u: PR (u) = + For most of the cases, we let α = 0.85, which was estimated from the frequency that an average surfer uses the bookmark feature in his or her browser, this value was original suggested by Lawrence Page when he announced the algorithm. ~ 28 ~ Like HITS, we can write the process using matrix notation. Let P(k)T be the row vector of PageRank at the iteration k, and H is a row normalized hyperlink matrix, i.e., hij = , if there is a link from page I to page j, and hij = 0, otherwise. So, the summation equation for PageRank can be written as P(k+1)T = P(k)TH The convergence of PageRank is very fast. To calculate PageRank on a large 322 million pages, we need only 52 iterations. The convergence on half the pages takes roughly 45 iterations. For a dataset of size n, the PageRank algorithm could converge in log2(n) iterations [21]. ~ 29 ~ Chapter 3 Ranking bloggers and Experiments 1. Background and Motivation From the beginning of the Internet, there are lots of types of information sharing systems, including the Web. Recently, online social networks have become considerably popular and received great attention of people. Statistics from ComScore in 2007 for seven services: My Space, Facebook, Bebo, Orkut, Hi5, Friendster, and Tagged, showed the greatest overall growth. They tracked the numbers of users in the period from June 2006 to June 2007 to benchmark the growth of these social networks. Facebook has grown 270 percent, from 14,083,000 users to 52,167,000. Bebo is shown to grow about 172 percent, and Orkut approximately grown 78 percent. Friendster is growing almost as quickly as MySpace: 65 percent versus 72 percent, according to ComScore’s statistics [11]. We have also known some social networks as blogs, the word inherited from the phrase “web log”, which describes the activities of people writing down their information, such as their everyday life activities, interests, etc, and then people created the verb (we) “blog”. Hence, the people participating in social networks are so called bloggers. In order to participate in a social network, people have to register an account with a site, and be asked for some optional information such as birthday, place of residence, or interests, which is added to the profile at their blog. Three common activities on social networks are making friends, blogging, and commenting. Because the user’s links, along with her profile, are visible to those who visit the user’s account, users are able to explore the social network by ~ 30 ~ following user-to-user links, browsing the information, and any contributed content of visited users, make friends and leave comments as they go. Search is one of the most active research areas nowadays and there is a lot of literature available on the topic. Many breakthrough developments have been observed in the specific area of web search, but the authors of this works are unaware of any existing publication discussing social network specific search problem. In a research [23] collaborated between Google Engineering Belo Horizonte and Federal University of Minas Gerais, Brazil, they created a friendship network from Orkut, an online social networking site with over 40 million users, and developed a searching algorithm that balance

Các file đính kèm theo tài liệu này:

K49_Nguyen_Huu_Binh_Minh_Thesis_English.pdf