Chapter 1
Table of Contents
Abstract.
List images. 5
List tables. 7
Chapter 1: Peer to Peer and Ranking Problem . 5
1.1. Peer to Peer . 5
1.1.1. Peer to Peer overview . 5
1.1.2. Architecture of Peer to Peer Systems .7
1.1.3. Distributed hash tables. 8
1.2. Ranking in Peer to Peer networks. 9
1.2.1. Introduction.
1.2.2. Ranking Roles.
1.2.3. Research’s important objects .
Chapter 2: Ranking on DHT Peer to Peer Networks. 11
2.1. Chord Protocol . 11
2.2. Pagerank. 12
2.2.1. Description. 12
2.2.2. Algorithms . 13
2.3. Distributed Computing . 17
2.2.1. Introduction. 17
2.2.2. Algorithms .
2.4 if-idf.18
Chapter 3: Building a new algorithm for ranking in chord networksError! Bookmark not define
3.1. Targets and Missions of Research .
3.2. Idea.
3.2.1. Major problems to exploit .
3.2.2. Ranking Idea .
Chapter 4: Ranking on Details .
4.1. Ranking algorithm .
4.2. Ranking’s features .
Chapter 5: Evaluation . 50
Chapter 6: Related Work . 52
Chapter 7: Contributions and future work. 53
References . 54
59 trang |
Chia sẻ: maiphuongdc | Lượt xem: 1771 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Khóa luận Research on node ranking in peer-to-peer networks, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
s index, Google evaluates the PageRank
score. When Google increases the document quantity in its collection, PageRank
initial approximation for all document reduction.
The convention use obtains tastelessly, in several clicks and switch after random page
a random surf rider’s model. The page PageRank value reflection random surfrider
will land in that page through the click in the link opportunity. May understand that
takes the condition is the page, and the transition is equally all possible and is between
the page link Markov chain.
If the page link to other data, it has not become the water trough, and terminates
the random surfing the process. However, the explanation is quite simple. If the
random surf rider arrives at the water trough page, it picks another URL stochastically,
and continues again the surfing.
When calculates PageRank, the page has not linked outward the supposition and in
collection other data of connections. Therefore their PageRank score is divided evenly
18
Research on Node Ranking – Peer to Peer …. Hoàng Cường
19
in other data. In other words, is fair with is not water trough's page, these random
transitions increase to net's all knots, when remaining possible usual d = 0.85,
estimated that uses their browser' from a frequency common surfrider; s bookmark
characteristic.
Therefore, the equality is as follows:
where p1,p2,...,pN are the data under consideration, M(pi) is the set of data that link
to pi, L(pj) is the number of outbound links on page pj, and N is the total number of
data.
jacency matrix. This makes PageRank a particularly elegant metric: the
eigenvector is
The PageRank values are the entries of the dominant eigenvector of
the modified ad
where R is the solution of the equation
where the adjacency function is 0 if page pj does not link to pi, and
normalized such that, for each i
,
i.e. the elements of each column sum up to 1 (for more details see
the computation section below). This is a variant of the eigenvector centrality measure
sed commonly in network analysis.
the PageRank eigenvector are fast to approximate (only a few iterations are
needed).
u
Because of the large eigengap of the modified adjacency matrix above, the
values of
Research on Node Ranking – Peer to Peer …. Hoàng Cường
As a result of the Markov theory, may display page PageRank be the possibility
is in that page after many clicks. This accidentally equals t − 1 t is expectation of) the
place request's click (or jumps willfully quantity obtains from the page returns to itself.
The major object is it favors a older page, because is new, the very good first
page, will not even have many links, only if it will be an existing stand (is a stand part
crowded wrap page which will connect, for example Wikipedia). The Google table of
contents (itself derivative opening table of contents project) allows the user to look in
the category the PageRank sorting result. The Google table of contents is PageRank
determined directly the demonstration order Google provides only service. In Google'
the s other search service (e.g. its main net search) PageRank uses in considering the
relevance in search result demonstration dozens of data. Several strategies proposed
that accelerates PageRank the computation. Operated PageRank various strategies to
arrange to use diligently together the improvement search result ranking and decides
as the currency to do to the link the advertisement.
These strategies have attacked the PageRank concept reliability severely, seeks
determined that which documents in fact take seriously by the net community. Google
knew that the punishment the link farm which and other plans designs inflates
artificially PageRank. Google starts in December, 2007 to punish effectively sells the
paid text link the stand. How does Google identify the link farm and other PageRank
operational tool is in Google' In; s business secret.
2.3 Distributed Computing
The distributed computing is the computer science area research distributional
system. Distributional system through a computer network service including many
autonomous computers. The computer achieves a common goal mutually according to
the order interaction. The computer program which runs in the distributional system
said that a distributed program, the distribution programming writes such program' s
process. And the distributed computing mentions the use distributional system
explanation estimate question.
In the distributed computing, the question is divided many responsibilities, the
computer explains everybody.
20
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 2.3: Distributed Nodes Graph example
We pass use computer’s hope automation; s many responsibilities held
responsible with answer the type: We hope to ask the question, and the computer
should cause the answer. In the computer science theoretically, is called the estimate
question like this voluntarily. It is estimated that the question has each template
including the instance is an explanation officially together. The example is the
question which we asked that and the explanation is anticipates the answer to these
questions.
(How does the theory computer science seek needs to understand the estimate
question possibly through use that the complex theory solution computer (the
computability theory) and high efficiency computation). In the tradition, said the
question perhaps through the use solution computer, if perhaps we design all concrete
instances are correct explanation algorithm causes. Perhaps such algorithm possibly
implements the computer program which runs in an general calculator: Studies from
the input question instance's holiday eye, carries out some computation, and causes the
explanation to adopt the product.
Formalism for example random access ' perhaps the s machine or the universal
Turing machine use the achievement to carry out such algorithm continuously general
calculator' s abstraction model. In many computer situations, consistent and distributed
computing area research similar question or execution interaction process system
computer: Which estimate question how can solve in such network and the high
efficiency place? However, it is not obvious in concurrent or the distributional system
situation, “solves the problem is all meanings”
2.4 Computing PageRank in a distributed system
21
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Lectured the net graph in distribution system's recent research work to divide
into messes up the website or the domain case. The net is molded takes many messes
up the network server. Is divided in net's ultra link two categories, the internal cut-off
link and the mutual server link. The internal server link is between the page link in the
server, and these links use in calculating on each server's place PageRank intermediate
vector. The mutual server link is between the page link with the different server, and
they use in calculating ServerRank. ServerRank surveys the different network server's
relative importance. The server which submits is being merged finally from many
network server's result causes an arrangement ultra link name list.
The ranking algebra proposed that deals with the ranking in the different
granularity level, is utilized possibly also in gathering the place ranking and the stand
ranking obtains the global ranking. Has in one disperses the system fully in the
PageRank approximation work, each of the same generation is autonomous, and
perhaps of the same generation mutually overlaps. Was proposing the JXP algorithm,
each of the same generation calculates the place PageRank score, then meets other of
the same generations and increases it gradually through the exchange information
willfully about the global net graph knowledge, then recomputation in place of the
same generation's PageRank score.
This conference and the recomputation process is duplicated, collects the
enough information until of the same generation. If of the same generation meets the
sufficient number of times exchange information finally, JXP score polymerization to
the real global PageRank score. Supposes is each page of out degree in global graph
awareness. However, these operations are providing the approximation the focal point
are the global graph, in centralized system or distribution system.
2.5. tf-idf
The tf–idf weight (term frequency–inverse document frequency) is a weight
often used in information retrieval and text mining. This weight is a statistical measure
used to evaluate how important a word is to a document in a collection or corpus. The
importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf–
idf weighting scheme are often used by search engines as a central tool in scoring and
ranking a document's relevance given a user query.
One of the simplest ranking functions is estimated by summing the tf-idf for
each query term; many more sophisticated ranking functions are variants of this simple
model.
Motivation
Suppose we have a set of English text documents and wish to determine which
document is most relevant to the query "the brown cow." A simple way to start out is
by eliminating documents that do not contain all three words "the," "brown," and
"cow," but this still leaves many documents. To further distinguish them, we might
count the number of times each term occurs in each document and sum them all
together; the number of times a term occurs in a document is called its term frequency.
22
Research on Node Ranking – Peer to Peer …. Hoàng Cường
However, because the term "the" is so common, this will tend to incorrectly emphasize
documents which happen to use the word "the" more, without giving enough weight to
the more meaningful terms "brown" and "cow". Also the term "the" is not a good
keyword to distinguish relevant and non-relevant documents and terms like "brown"
and "cow" that occur rarely are good keywords to distinguish relevant documents from
the non-relevant documents. Hence an inverse document frequency factor is
incorporated which diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
Mathematical details
The term count in the given document is simply the number of times a given
term appears in that document. This count is usually normalized to prevent a bias
towards longer documents (which may have a higher term count regardless of the
actual importance of that term in the document) to give a measure of the importance of
the term ti within the particular document dj. Thus we have the term frequency,
defined as follows.
where ni,j is the number of occurrences of the considered term (ti) in document dj, and
the denominator is the sum of number of occurrences of all terms in document dj.
The inverse document frequency is a measure of the general importance of the
term (obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of that quotient).
with
• | D | : total number of documents in the corpus
• : number of documents where the term ti appears (that is
). If the term is not in the corpus, this will lead to a division-by-zero. It
is therefore common to use
Then
A high weight in tf–idf is reached by a high term frequency (in the given document)
and a low document frequency of the term in the whole collection of documents; the
weights hence tend to filter out common terms. The tf-idf value for a term will always
be greater than or equal to zero.
23
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Example
Consider a document containing 100 words wherein the word cow appears 3 times.
Following the previously defined formulas, the term frequency (TF) for cow is then
0.03 (3 / 100). Now, assume we have 10 million documents and cow appears in one
thousand of these. Then, the inverse document frequency is calculated as log(10 000
000 / 1 000) = 4. The TF-IDF score is the product of these quantities: 0.03 × 4 = 0.12.
Chapter 3:
Building a new algorithm for ranking nodes in chord
networks
3.1 Targets and Missions of Research
The P2P deployment is the building distributed search network. Proposed that
the system support discovers with retrieval all results, but lacks the essential
information to arrange them. User, however, is mainly to most is related and is not
all most possible results to be interested. The use random sampling, we expand the
well-known information retrieval ranking algorithm class they may apply like this in
this distributed establishment. How do we analyze the ceiling our method, and the
quota our system scale along with the document, the system size, the inquiry
document to ties the mapping correctly (uniform to non-uniform) and the type
increases the digit (rarely to universal deadline). Our analysis and the imitation
indicated that a) these extensions are the high efficiency, and possibly calls with a
ceiling to the large-scale system, and b) uses the result accuracy which and the
centralized implementation the distributed ranking obtains is comparable.
3.2 Idea:
24
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Table 3.2.1: The Pagerank converge and HITS converge
When I write a small code, I see that’s the converges of two algorithms: HITS-
Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the
experiment’s image..
experiment test with Graph 1000 nodes.
The blue line is the converge ( iterators ) of Pagerank
The red line is the converge ( iterators ) of HITS
Table 3.2.2: The Pagerank converge increasing to fast
25
Research on Node Ranking – Peer to Peer …. Hoàng Cường
I also see that’s the converges of the algorithm: Pagerank.
It’s can be see as the experiment’s image that’s the time to calculate converge
of Pagerank algorithm. It goes to much faster. It can’t be done in peer to peer network
( the fast increase of time is more than the fast increase of graph’s size a lot )
The When I write a small code, I see that’s the converges of two algorithms:
HITS-Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the
experiment’s image..
The Graph with 1000 nodes.
The blue line is the converge ( iterators ) of Pagerank
The red line is the converge ( iterators ) of HITS
Image taken from the experimental results, the convergence of the Pagerank
algorithm. 1000 nodes network simulation, 2000 nodes, 4000 nodes. (Execution time
calculation)
Light blue line is the only way to calculate the time convergence of the
Pagerank algorithm is at the network node 4000. Dark blue line is the only way to
calculate the time convergence of the Pagerank algorithm is at the network node
2000. The red dot is the only way to calculate the time convergence of the Pagerank
algorithm is at the network node 1000.
Computing time increases when the number of exponential increase network node
Want to calculate all the nodes in the network takes several performance (larger
network node -> lose greater efficiency (greater than performance node added)
Table 3.2.3: Pagerank convergence are not steady when Epsilon small
Image taken from the experimental results of the convergence Pagerank
algorithm. Network Simulation 1000 node. (Execution time calculation)
26
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Dark blue line includes the red dot is the only way to calculate the time
convergence of the Pagerank algorithm is at the network node 1000 under the
different conditions randomly.
With a little error Epsilon, convergence of k Pagerank is completely
stable. (Change gap)
3.2.4: HITS convergence ( steady+ take lots of time more than Pagerank)
Image taken from the experimental results the convergence of HITS
algorithm. Network simulation is 1000 nodes. (Terms Iterator going to converge).
Dark blue line includes the red dot is the only way to calculate the time
convergence of HITS algorithm is at 1000 node network under the different conditions
randomly.
With a little error Epsilon, convergence of HITS stable than PageRank.
Idea:
Combined HITS and Pagerank. HITS method n nodes filter out content
authentication best results) and used to calculate the Pagerank that n nodes. (n very
little compared to the total number of network nodes)
Advantages of this new approach:
• Exact result
• Accurate
• Computing easier.
• Easy feasible
• faster
Let’s going to see what’s happened to get a system which has some features like that
in deeply later. Basically, let’s analyze a simple example in Google search engine:
27
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 3.2: Google almost is not exact
In the results we can see that the topic results. They’re different, and no relate. And so
some are exact, but some results are not exactly ( many ).
3.2 Ranking Idea:
Customized semantic query answering, personalized search (Google is the first
of the crucial search engines to take personalized results on a massive scale. Weighing
a number of factors including but not limited to user history, bookmarks, community
behavior and site click-through rate and stickiness, Google is providing results that are
specific to what they believe you are searching for. Currently this service is only
available to those who are logged into their Google account), focused crawlers and
localized search engines frequently focus on ranking the Nodes contained within a
sub-graph of the global Peer to Peer Nodes graph. The claim for these utilizations is to
estimate PageRank-style scores expeditiously on the sub-graph, i.e., the ranking must
manifest the global link structure of the Peer to Peer Node graph exactly ( which
means calculating returns true global pagerank-scored order ) but it must do so without
bearing the high overhead affiliated to global computation. We propose a framework
of an exact unraveling and an analogous solution for computing ranking on a sub-
graph. without running PageRank on the whole Nodes Graph. So, Approaching sub-
graph and multiple sub-graphs respectively is the main discovery in this ranking
research.
28
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 3.2.2: Intersect idea
The Rank_local_idea algorithm is an rigorous scientific method with the
supposing that the scores of “outside” Nodes are known. We analyse that the
Rank_local_idea values for nodes in the sub-graph converge. Since the PageRank-
style values of out nodes may not expectedly be achievable, we introduce the
Pagerank_local algorithm to evaluate score results for the sub-graph.
We use graph EG ( Outside Graph ) to symbolize a major graph. Both
Rank_local_idea and Pagerank_local put on the set of outside nodes with an outside
node graph EG and magnify the sub-graph with links to graph EG. They also modify
the PageRank-style transition matrix ( The transition possibility distribution be
symbolizeed by a matrix ) with respect to graph EG.
We analyze the L1 distance between Rank_local_idea scores and
Pagerank_local scores of the sub-graph and show that it is within a constant factor of
the L1 distance of the outside nodes (e.g., the true PageRank scores and uniform
scores assumed by Pagerank_local). We compare Pagerank_local and a stochastic
complementation approach (SC), a current best solution for this problem, on different
types of sub-graphs. Pagerank_local has similar or superior performance to SC and
typically improves on the runtime performance of SC by an order of magnitude or
better. We demonstrate that Pagerank_local provides a good approximation to
PageRank for a variety of sub-graphs.
3.3 Idea finding a subset
Many assignments have to find salient and diverse items to satisfy the user’s
information need. This problem has been addressed in document summarization,
information retrieval, and various data mining applications. In this paper, we present a
new method that can select salient and diverse items from many information contents.
We formulate the mission as a graph summarization problem: given a weighted graph,
how can we find a subset of salient and diverse vertices to summarize the graph,
subject to some pre-specified constraints. We present a linear programming based
approximate algorithm to optimize an objective function which takes into account both
salience and diversity. The method was applied to two missions:
29
Research on Node Ranking – Peer to Peer …. Hoàng Cường
• multi-document summarization, which brings out salient sentences from
sentence graphs;
• mining symbolizeative experts in the data-mining community from co-author
graphs. In comparison to the state-of-the-art graphed based methods, our
method produces superior results in these applications.
3.4: Finding ranking factors
Image 3.4: Factor Percent
30
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Chapter 4:
Ranking algorithm details
4 .1: Idea
The explosion of files sharing available on the Peer to Peer Networks has made
the ranking of peer to peer systems an high-priced but unavoidable component. The
more users try to search, the more system overloads. Since hyper-bandwidth-links
from one node to another usually points to an “Authorization” or “Commendation”,
bandwidth-link analysis plays a essential role in elect the “importance” of nodes in
network. PageRank and HITS are two foremost approaches in this area. PageRank
iteratively estimates the score of a page based on the scores of its parent
Image 4.1: Bandwidth is the key of ranking trusted
Data as above equation
31
Research on Node Ranking – Peer to Peer …. Hoàng Cường
HITS separates the role of each node into a hub or authority. In the HITS
algorithm, the first step is to retrieve the set of results to the search query. The
computation is performed only on this result set, not across all Graph Nodes.
Authority and hub values are defined in terms of one another in a mutual
recursion. An authority value is computed as the sum of the scaled hub values that
point to that page. A hub value is the sum of the scaled authority values of the pages it
points to. Some implementations also consider the relevance of the linked nodes.
The algorithm performs a series of iterations, each consisting of two basic
steps:
• Authority Update: Update each node's Authority score to be equal to the
sum of the Hub Scores of each node that points to it. That is, a node is
given a high authority score by being linked to by nodes that are
recognized as Hubs for information.
• Hub Update: Update each node's Hub Score to be equal to the sum of
the Authority Scores of each node that it points to. That is, a node is
given a high hub score by linking to nodes that are considered to be
authorities on the subject.
The hub score estimates the value of its bandwidth-links to other nodes and the
authority score estimates the importance of the node. These algorithms are expensive
because of the number of node data/objects involved in the computation. On 15
November 2008, The Pirate Bay announced that it had reached over 25 million unique
peers. And according to the Pirate Bay statistics, as of December 2009, The Pirate Bay
has over 4 million registered users, although registration is not necessary to download
torrents. To make ranking feasible, and to manifest the diversity of node users'
information needs, peer to peer networking applications such as semantic search
("Semantic search seeks to improve search accuracy by understanding searcher intent
and the contextual meaning of terms as they appear in the searchable dataspace,
whether on the Web or within a closed system, to generate more relevant results.
Author Seth Grimes lists "11 approaches that join semantics to search", and
Hildebrand et al. provide an overview that lists semantic search systems and identifies
other uses of semantics in the search process.), focused crawlers (A focused crawler or
topical crawler is a web crawler that attempts to download only web data that are
relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that
only the topic is given, while focused crawling also assumes that some labeled
examples of relevant and not relevant data are available. ), localized search engines,
and personalized search
Các file đính kèm theo tài liệu này:
- KLTN_HoangCuong_06020044.pdf