CreateLatticeIcrementally procedure (G, M, I) receives all data sets (set of objects G includes
files, set of attributes M includes terms in files, correlation I belongs to G, M). AddIntent is the algorithm
in Bottom-Up direction, assigned by {0, M}. It means that, the BottomConcept contains all terms of
lattice L (row 02). The process starts with updating BottomConcept to the bottom of lattice (row 03).
With each object g of the set of objects G (with each file of set of files), AddIntent call procedure
to add gradually the concepts to the concept-lattice, transmit to AddIntent with three parameters: g’
(intent, set of terms in a file), BottomConcept concept (set of terms in files) and lattice L (row 04, 05).
In the procedure-body, AddIntent creates concept (and connections with other concepts), For. End For
loop of the procedure takes in turn each concept of the set of created concepts – to update to Extent, row
06. The finished procedure is the created lattice
27 trang |
Chia sẻ: honganh20 | Ngày: 04/03/2022 | Lượt xem: 360 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Tóm tắt Luận án Some searching techniques for entities based on implicit semantic relations and context - Aware query suggestions, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
na group; Apple - iPhone; ... In the real world, there are
questions like: "knowing Fansipan is the highest mountain in Vietnam, which is the highest mountain in India?",
"If Biden is president-elect the United States, who is the most powerful person in Sweden? ”,
By the keyword search engine, according to statistics, queries are often short, ambiguous, and poly-
semantic [1-6]. Approximately 71% of web search queries contain names of entities, as statistics [7, 8]. If the user
enters the entities like "Vietnam", "Hanoi", "French", then the search engine only results in documents that contain
the above keyboards, but does not immediately answer "Paris". Because of looking for entities only, query
extending and query rewriting techniques are not applied to the type of the implicit semantic relation in the entity
pair. Therefore, a new search morphology is researched. The pattern of the search query is in the form of: (A, B),
(C, ?), in which (A, B) is the source entity pair, (C, ?) is the target entity pair. At the same time, the two pairs (A,
B), (C, ?) have a semantic similarity. In other words, when the user enters the query (A, B), (C,?), the search engine
has the duty of listing entities D so that each entity D satisfies the condition of semantic relation with C as well as
the pair (C, D) have similarity relation with the pair (A, B). With an input consisting of only 3 entities: "Vietnam",
"Hanoi", "France", the semantic relation "is the capital" is not indicated in the query
2.2. Method for searching entities is based on implicit semantic relations
2.2.1. Architecture – Modeling
The concept of searching entities through implicit semantic relation is the most obvious distinction for
search engines based on keywords. Figure 2.1 simulates a query consisting of only three entities, query = (Vietnam,
Mekong), (China, ?).
Write the convention: q = {(A, B), (C,
?)}, where (Vietnam, Mekong) is a pair of source
entities, (China, ?) is a pair of target entities. The
search engine is responsible for identifying the
entity ("?") that has a semantic relation with the
"China" entity, and the entity pair (China, ?) must
be similarly related to the entity pair (Vietnam,
Mekong). Note that the above query does not
explicitly contain the semantic relation between
the two entities. This is because semantic
relations are expressed in various ways around
the pair of entities (Vietnam, Mekong), such as
"the longest river", "big river system", "the
largest basin", etc.
Figure 2.1: Implicit Semantic Relation Search with input consisting of 3 entities
9
Due to the fact that the query consists of only three entities that do not include semantic relations, the
model is called the implicit semantic relation search model.
In case IRS does not find A, B or C, the keyword search engine will be applied.
The morphology of search for entities based on implicit semantic relations must determine the semantic
relation between the two entities and calculate the similarity of the two entity pairs, since that, give the answer to
the unknown entity (entity "?"). On a specific corpus, in general, Implicit Relational Search (IRS) consists of three
main modules: The extracting module of the sematic relations from the corpus; Clustering module of semantic
relations; Calculating module of similar relations between two entity pairs. In practice, the IRS model consists of
two phases: online phase: meeting the real-time search, and offline phase: processing, calculating, storing semantic
relations and similarity relations, in which, the extracting and clustering modules of the semantic relations are in
the off-line phase of the IRS model.
Extracting module of the semantic relations: From the corpus, this module extracts the patterns (the root
sentence contains pairs of entities and context) as illustrated above: A the longest river B, where A, B are 2 named
entities. The pattern set obtained will consist of different patterns, similar patterns, or patterns of different lengths
and terms, but the same semantic expression. For example: A is the largest river of B, A is the river of B has the
largest basin, or B has the longest river as A, etc.
Clustering module of semantic relations : From the obtained pattern set, clustering is performed to identify
clusters of contexts, where each context in the same clusters has a semantic similarity. Make a table of the pattern
indexes and the corresponding entity pairs.
Calculating module of similar relations between two entity pairs is in the online phase of the IRS model.
Pick up the query q = (A, B), (C, ?), IRS will search the entity pair (A, B) and the corresponding semantic relation
(context) set in the index table. From the obtained semantic relation set, find the pairs of entities (C, Di) associated
with this relation. Apply the Cosine measure to calculate and rank the similarity between (A, B) and (C, Di), and
give a list of ranked entities Di to answer the query.
Considering q = {(Vietnam, Mekong),
(China,?)}, the IRS finds a cluster containing
pairs of entities (Vietnam, Mekong) and the
corresponding semantic relation: "longest river"
(from the original sentence: "The Mekong is the
longest river in Vietnam"). This cluster also
contains a similar semantic relation: "the largest
river", in which the relation: "largest river" is
associated with the entity (China, Changjiang)
(from the original sentence: "Changjiang is
river the biggest in China ”). The IRS will put
"Truong Giang" in the list of candidates, rank
semantic relations according to the measure of
similarity, and return results to the searcher.
Figure 2.2: General structure of IRS.
Entity – pairs &
Context
Corpus
Inverted Index for IRS
Extracting
semantic relations
Clustering
semantic relations
Calculating the
relation similarity
(RelSim)
Candidate
answers
Ranked answers
Q
u
er
y
:
{
(V
iệ
t
N
am
,
M
ê
K
ô
n
g
),
(
T
ru
n
g
Q
u
ố
c,
?
)}
10
2.2.2. Extracting module of the semantic relations
Receiving the input query q = (A, B), (C, ?), the general structure of IRS is modelized:
Filter-Entities (Fe) filters/seeks candidate set S containing entity pairs (C, Di) that are related to
the input entity pair (A, B):
Fe(q, D) = Fe({(A, B), (C, ?)}, D) = {
1, 𝑖𝑓𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡(𝑞, 𝐷)
0, 𝑒𝑙𝑠𝑒
(2.1)
Hàm Rank-Entities (Re) ranks the entities Di, Dj in the candidate set S by RelSim (Relational
Similarity), whichever has higher RelSim is ranked lower (i.e. closer with the top or higher rank):
∀ Di, Dj ∊ S, if:
RelSim((A, B),(C, Di)) > RelSim((A, B),(C, Dj)) : Re(q, Di) < Re(q, Dj) (2.2)
2.2.3. Clustering module of semantic relations
The clustering process converts "similar" elements into a cluster. In the semantic entity search model, the
elements in the cluster are semantically similar context sentences. Similarity is a quantity used to compare two or
more elements with each other, reflecting the correlation between two elements. Therefore, the thesis generalizes
the measurements of terms similarity; similarity based on vector space; semantic similarity - of the two contexts.
a) Measurements of the similarity between 2 context
Terms-similarity
Zaro function: Winkler4. Distance Zaro calculates the similarity between 2 strings a, b:
SimZaro(a,b) = {
0,𝑖𝑓𝑚 = 0
1
3
(
𝑚
|𝑎|
+
𝑚
|𝑏|
+
𝑚−𝑠𝑘𝑖𝑝
𝑚
) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(2.6)
Constrast model: As proposed by Tversky (“Features of similarity”, Psychological Review, 1977)5,
applying a contrast model to calculate the similarity between two sentences a, b:
Sim(a, b)=α*f(a∩b) − 𝛽*f(a-b) − γ*f(b−a) (2.8)
Jaccard distance: Sim(a, b) =
|𝑎∩𝑏|
|𝑎∪𝑏|
(2.9)
Similarity based on vector space: Euclidean, Manhattan, Cosine.
Semantic similarity
According to the Distribution Hypothesis [17]: Words occurring in the same context tend to have
semantic similarity. Because Vietnamese language does not have Vietwordnet system to calculate the
semantic similarity between 2 terms, the thesis uses PMI correlation to measure and evaluate the semantic
similarity between two sentences (context).
PMI (Pointwise Mutual Information) method: Proposed by Church and Hanks [1990]. Based on the
probability that co-occurs between 2 terms t1, t2 in the corpus, PMI(t1, t2) are calculated by the formula:
PMI(t1, t2) = log2(
𝑷(𝒕𝟏,𝒕𝟐)
𝑷(𝒕𝟏).𝑷(𝒕𝟐)
) (2.16)
b) Clustering module of semantic relations
Applying PMI to improve the similarity measure according to the Distribution Hypothesis:
4 https://en.wikipedia.org/wiki/Jaro-Winkler_distance
5
11
SimDH(p,q) = Cosine(PMI(p, q)) =
∑ (𝑃𝑀𝐼(𝑤𝑖,𝑝)∙𝑃𝑀𝐼(𝑤𝑖,𝑞))𝑖
||𝑃𝑀𝐼(𝑤𝑖,𝑝)||||𝑃𝑀𝐼(𝑤𝑖,𝑞)||
(2.25)
The similarity by terms of 2 context p, q:
Simterm(p, q) =
∑ (weighti(p)∙weighti(q))
n
i=1
||weight(p)||||weight(q)||
(2.26)
Measurement of the combined similarity:
Sim(p,q) = Max(SimDH(p, q),Simterm(p, q)) (2.27)
c) Clustering algorithm:
- Input: Set P = {p1, p2,, pn}; Cluster threshold: θ1, heuristic threshold: θ2; Dmax: Cluster diameter;
Sim_cp: Results of combined similarity measurement function, apply the formula (2.27).
- Output: Set of clusters: Cset (ClusterID; context; weight of each context; pair of respective entities).
Program Clustering_algorithm
01. Cset = {}; iCount=0;
02. for each context pi ∈ P do
03. Dmax = 0; c* = NULL;
04. for each cluster cj ∈ Cset do
05. Sim_cp=Sim(pi,Centroid(cj))
06. if (Sim_cp > Dmax) then
07. Dmax = Sim_cp; c* ← cj;
08. end if
09. end for
10. if (Dmax > θ1) then
11. c*.append(pi)
12. else
13. Cset ∪= new cluster{c*}
14. end if
15. if (iCount > θ2) then
16. iCount++;
17. exit Current_Proc_Cluster_Alg();
18. end if
19. end for
20. Return Cset;
@CallMerge_Cset_from_OtherNodes()
2.2.4. Modules calculating the relational similarity between two pairs of entities
The module calculating the relational similarity between two pairs of entities that perform two tasks:
Filtering (searching) and ranking. As illustrated in 3.1, the input query q = (A, B), (C, ?), through the inverted
index, IRS executes the function Filter-Entities Fe to filter (search) out candidate sets having entity pairs (C, Di)
and the corresponding context, such that (C, Di) similar to (A, B). Then, it executes the function Rank-Entities Re
to rank the entities Di, Dj within the candidate set according to RelSim measure (Relational Similarity), finally -
which results in list of ranked {Di}.
Filter-Entities algorithm: Filter to find the candidate set containing the answer:
Input: Query q = (A, B)(C, ?)
Output: Candidate set S (includes Di entities and corresponding context);
Program Filter_Entities
01. S = {};
02. P(w) = EntPair_from_Cset.Context();
12
03. for each context pi ∈ P(w) do
04. W(p) = Context(pi).EntPairs();
05. If (W(p) contains (C:Di)) then S ∪= W(p);
06. end for
07. retufn S
After executing Filter-Entities, a subset of the entities Di and corresponding context are obtained. RelSim
only processes and calculates on the very small subset. In addition, RelSim uses the threshold α to eliminate
entities Di with low RelSim values.
With: Fe(q,D) = Fe({(A, B),(C,?)}, D):
𝐹𝑒(𝑞, 𝐷𝑖) = {
1, 𝑖𝑓𝑅𝑒𝑙𝑆𝑖𝑚((𝐴, 𝐵), (𝐶, 𝐷𝑖)) > α
0, 𝑒𝑙𝑠𝑒
(2.29)
Rank-Entities function: Rank-Entities Algorithm is responsible for calculating RelSim:
Input: Candidate set S and:
- Source entity pair (A, B), denote s; Candidate entities (C, Di), denote c;
- Contexts corresponding to s, c; The resulting cluster set: Cset;
- Known entities A, B, C corresponding cluster set containing A, B, C are identified;
- Threshold α (compare RelSim value); Threshold α is set during testing the program;
- Initialize the dot product (β); used-context set (γ);
Output: List of answers (ranked entity list) Di;
Denotations:
- P(s), P(c) given in formula (2.19), (2.20);
- f(s, pi), f(c, pi), ɸ(s), ɸ(c) given in (2.21), (2.22);
- γ: Variable (set of context) keep the considered contexts;
- q: Temporary/Intermediate variable (Context);
- Ω: Cluster;
Program Rank_Entities
01. for each context pi ∈ P(c) do
02. if (pi ∈ P(s)) then
03. β ← β + f(s, pi)·f(c, pi)
04. γ ← γ ∪ {p}
05. else
06. Ω ← cluster contains pi
07. max_co-occurs = 0;
08. q← NULL;
09. for each context pj ∈ (P(s)\P(c)\γ) do
10. if (pj ∈ Ω) & (f(s, pj) > max_co-occurs)
11. max_co-occurs ← f(s, pj);
12. q ← pj;
13. end if
14. end for
15. if (max_co-occurs > 0) then
16. β ← β + f(s, q)·f(c, pi)
17. γ ← γ ∪ {q}
18. end if
13
19. end if
20. end for
21. RelSim ← β/L2-norm(ɸ(s), ɸ(c))
22. if (RelSim ≥ α) then return RelSim
Algorithm interpretation: In case two pairs of source and target entities have the same semantic
relationship (sharing the same context, statement 1-2): pi ∊ P(s) ∩ P(c), calculate the dot product as a modified
version of standard Cosine similarity formula.
In the case of pi ∊ P(c) but pi ∉ P(s), the algorithm finds the context pj (or temporary variable, q, line 12),
where pi, pj belong to the same cluster. The loop body (from statements 10-13) chooses the context pj has largest
frequency of co-occurrence with the s. Under the Distribution Hypothesis, the more pairs of entities two contexts
pi, pj co-occur in, the higher Cosine similarity between the two vectors. As the cosine value is higher, pi, pj are
more similar. Therefore, the pair (C, Di) is more accurate and semantically consistent with the source entity pair
(A, B).
The sequence of statements from 15-18 calculate the dot product. Statements from 21-22 calculate the
RelSim value. From the set of RelSim value, whichever entities Di have RelSim higher will be ranked lower (in
the closer top, or higher rank). Finally, the result set Di is the answer list for the query that the end-user wants to
find.
2.3. Experiment Results - Evaluation
2.3.1. Dataset
The dataset is built from the empirical sample dataset, based on four entity subclasses named: PER; ORG;
LOC and TIME;
2.3.2. Test - Parameter adjustment
To evaluate the effectiveness of the
Rank_Entities clustering and ranking algorithm,
Chapter 2 changes the values θ1 and α, then
calculates the Precision, Recall, F-Score
measures corresponding to each value of α, θ1.
Figure 2.3 shows that at α = 0.5, θ1 = 0.4, the F-
Score score has the highest value.
Figure 2.3: F-Score value corresponding to each changed value of α, θ1.
Giải thuật Rank_Entities dòng 22 (if (RelSim ≥ α) return RelSim) cũng cho thấy, khi α nhỏ thì số lượng
ứng viên tăng, có thể có nhiễu, đồng thời thời gian xử lý real-time tốn chi phí thời gian, do hệ thống xử lý nhiều
truy vấn ứng viên. Ngược lại khi α lớn thì giá trị Recall nhỏ, kéo theo F-Score giảm đáng kể.
2.3.3. Evaluation with MRR (Mean Reciprocal Rank)
For the query Q, if the first correct answer rank in the query q ∈ Q is rq, then the MRR measurement of Q
is calculated:
MRR(Q) =
1
|𝑄|
∑
1
𝑟𝑞
𝑞∈𝑄 (2.33)
14
With 4 entity subclasses: PER; ORG; LOC and TIME; the method is based on the co-current frequency
(f) reaching the average value MRR ≈ 0.69; meanwhile, PMI-based method is 0.86. This shows that PMI helps
improve the accuracy of semantic similarity better than the co-current frequency of context-pair entities.
Figure 2.4: Compare PMI with f: frequency (co-current) based on MRR.
2.3.4. Experimental system
The dataset was downloaded from Viwiki (7877 files) and Vn-news (35440 files). The goal of
selecting source Viwiki and Vn-news because these datasets contain samples of named entities (Named Entity).
After reading, extracting file content, separating paragraphs and sentences (main-sentences, sub-
sentences), 1572616 sentences are obtained.
The general labels of NER (Named Entity Recognition) include: PER: Name of person; ORG: Name of
organization; LOC: Name of place; TIME: Time type; NUM: Number type; CUR: Currency; PCT: Percentage
type; MISC: Another entity type; O: Not an entity.
By using the algorithm for extracting context stored in the database, after performing the processing steps
and restriction conditions, Database remains with 404507 context sentences. From this set of context, the clustering
algorithm of semantic relations collects 124805 clusters.
Figure 2.5: IRS experiment with B-PER entity label
Đo evaluate the accuracy, experimentally performed 500 queries to test, the results showed an
accuracy of about 92%.
Table 2.3: Examples of experimental results with input q = {A, B, C} and output D
ID A B C D
15
.. German Angela Merkel Israel Benjamin Netanyahu
.. Harry Kane Tottenham Messi Barca
.. Hoàng Công Lương Hòa Bình Thiên Sơn RO
2.4. Conclusion
The ability to infer information/knowledge is not determined by similar inference is one of the natural
abilities of human. Chapter 2 presents an an Implicit Relational entity search model (IRS) that simulates the
above possibility. The IRS model searches for information/knowledge from an unfamiliar domain and does not
require keyword in advance, using a similar example (similarity relation) from a familiar domain. The main
contribution of Chapter 2: Build the entity search technique based on hidden semantic relation using clustering
method to improve search efficiency. At the same time, the thesis proposes the measure of combined similarity -
terms and distribution hypothesis; From the proposed measurement, and at the same time applying heuristic to
cluster algorithm with improving cluster quality.
CHAPTER 3: CONTEXT-AWARE QUERY SUGGESTION
3.1. Problem
In the field of sugessting the query, traditional approaches like session-based, document-click based, and
so on. Performing Query Logs to generate the suggestion. The approach to "Suggesting context-aware query by
session data mining and click-throught documents" (short call: "context-aware approach" by Huanhuan Cao et al
[9], [10]) is one new approach - this approach considers the queries immediately before the query just entered (the
current query) as a search context, in order to "capture" the user's search intent, in order to provide exact
suggestions worth more. Obviously, the preceding layer of query has a semantic relationship with the current
query. Next, do mining for queries that immediately follow the current query - like a list of suggestions. This
method makes use of the "knowledge" of the community, because the query layer immediately follow the current
query reflects the problems that users often ask after the current query.
The main contributions of chapter 3 include:
1) Apply context-aware techniques, build an vertical search engine that applies context-aware in its own
knowledge base domain (aviation data).
2) Propose to measure combinatorial similarity in the contextual query suggestion problem to improve
the quality of suggestion.
In addition, chapter 3 also has additional experimental contributions: i) Integrating Vietnamese speech
recognition and synthesis as an option into the search engine to create a voice-search system, with speech
interaction. ii) Apply the Concept-lattice structure to classify the returned result set.
3.2. Context-aware method
3.2.1. Definition - Terminology
Search session: Is a continuous sequence of queries. Query strings are represented in
chronological order. Each session corresponds to one user.
General session structure: {sessionID; queryText; queryTime; URL_clicked}
Context: specifies adjacent string before the current query. In a user's search session, context is
the query string immediately preceding the query entered.
Query-layer before qcurrent ↔ ngữ cảnh) .. qcurrent .. (Query-layer after qcurrent )
16
3.2.2. Architecture - Modeling
The ideology of Context-
aware based on two phases: online
and offline, generalized: During a
search session (online phase), the
context-aware waits the current query
and looks at the preceding query
string standing before the current
query as a context.. More precisely,
this process is interpreted to the
concept sequence - this concept
sequence expressed searching
intention of users .
Figure 3.4: Context-aware query suggestion model.
When a search context is obtained, the system performs a match against the built-in context set (phase
offline, the built-in context set is pre-processed on the query set in the past - Query Logs. About structural data
and storage, the built-in context set is stored on a suffix tree data structure). A maximum matching provides a list
of candidates, a list of issues that most users often ask about after the queries they already entered. After the
ranking step, the candidate list becomes a suggestion list.
3.2.4. Offline phase – Clustering algorithm
The idea of clustering algorithm: The algorithm scans all queries in Query Logs once, the clusters will be
generated during the scan. Each cluster is initially initialized with a query, and then expanded gradually by similar
queries. The expansion process stops when the cluster diameter exceeds the threshold Dmax. Because each cluster
is seen as a concept, so the cluster set is the concept set.
Input: Query Logs Q, threshold Dmax;
Output: Set of clusters: Cset;
program Context_Aware_Clustering_alg
// Khởi tạo mảng dim_array[d] = Ø, ∀d (d: document được click)
// Mảng dim_array chứa số chiều các vectors.
01. for each query qi ∈ Q do
02. θ = Ø;
03. for each nonZeroDimension d of 𝑞𝑖⃗⃗⃗ do
04. θ ∪= dim_array[d];
C = arg minC’∈C-Setdistance(qi, C’);
05. if (diameter(C∪{qi}) ≤ Dmax)
06. C ∪= qi; cập nhật lại đường kính và tâm cụm C;
07. else
C = new cluster({qi}); Cset ∪= C;
08. for each nonZeroDimension d of 𝑞𝑖⃗⃗⃗ do
09. if (C ∉ dim_array[d])
dim_array[d] ∪= C;
end for
17
10. return Cset;
3.3.6. Analyze pros and cons
Advantages:
- Context-aware issue is a novel approach. Performing query suggestions, almost all traditional
approaches are often taken the classical queries which existed in Query Logs for proposals. This kind
of queries can only proposed similar or related queries to the current query, rather than giving trends
about which communities often asked after the current query. Likewise, there is no approach which
places the previous queries in preceding of the current query into a search context - as a seamless
expression for the intentions of the users. The context-aware technique, above all, is the idea suggested
by the problems that users often asked after the current query, which is a unique, efficient, and a “smart
focusing” on the field of query suggestions.
Disadvantages:
- When the user enters the first query or some of the queries are new (new compared to past queries) or
not even new - the meaning does not present in the frequent concept string (for example, in data-set,
with 2 conceptual strings c2c3 and c1c2c3, the algorithm for determining the frequently equence is
c2c3, in this case - the user enters c1). Context-aware approach is not generated the suggestion even
though c1 was in the past (already in QLogs).
- Each cluster (each concept) consists of a group of similar queries. The similarity measure is only based
on URL click without basing on similarity of term, which can significantly affect the quality of
clustering technique. includes a group of similar queries. Similarity measure is only based on URL click
without basing on similarity of term.
- Constraint each query belongs to a cluster (concept): This point of view is not reasonable and unnatural
for a polymorphic query like "tiger" or "gladitor", or many other polymorphic words in Vietnamese
langguage, etc.
- Besides, just only query suggestion without considering URL recommendation or document
suggestion. Likewise, “click-through” orientation but does not use clicked Urls information in search
context (when searching on the suffix tree, the input of Concept sequence consists of queries only).
- On the bipartite graph, on the vertex set Q, the vectors are quite sparse (low dimensions), the set of
click URLs also encounter sparse data (URL click sparse), when the vectors are sparse, the quality of
clustering will affected.
- In clustering algorithm, when Query Logs is large, or the number of dimensions of each vector is large,
the dim_array [d] array will be very large, requiring a large amount of memory to be executed.
In fact, in any one search session, the user enters one or more queries, likewise, the user may n
Các file đính kèm theo tài liệu này:
- tom_tat_luan_an_some_searching_techniques_for_entities_based.pdf