LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 1 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Basic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Deterministic Finite Automata . . . . . . . . . . . . . . . . . . . . 6
1.1.4 The Galois Field GF(pm) . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Digital Image Steganography . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Exact Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Searchable Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CHAPTER 2 DIGITAL IMAGE STEGANOGRAPHY BASED ON THE
GALOIS FIELD USING GRAPH THEORY AND AUTOMATA . . . . . 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 The Digital Image Steganography Problem . . . . . . . . . . . . . . . . . . 18
2.3 A New Digital Image Steganography Approach . . . . . . . . . . . . . . . . 19
2.3.1 Mathematical Basis based on The Galois Field . . . . . . . . . . . . 19
2.3.2 Digital Image Steganography Based on The Galois Field GF(pm)
Using Graph Theory and Automata . . . . . . . . . . . . . . . . . . 21
2.4 The Near Optimal and Optimal Data Hiding Schemes for Gray and Palette
Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
CHAPTER 3 AN AUTOMATA APPROACH TO EXACT PATTERN
MATCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 The New Algorithm - The MRc Algorithm . . . . . . . . . . . . . . . . . . 41
3.3 Analysis of The MRc Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
CHAPTER 4 AUTOMATA TECHNIQUE FOR THE LONGEST
COMMON SUBSEQUENCE PROBLEM . . . . . . . . . . . . . . . . . . 56
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
98 trang |
Chia sẻ: honganh20 | Lượt xem: 405 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Research on development of methods of graph theory and automata in steganography and searchable encryption, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
scheme (2, 9, 8) and the optimal
data hiding scheme (1, 5, 4) for gray and palette images with qcolour = 3. Security analyses
indicated that the application of these schemes to the process of hiding a finite sequence
of secret data in an image can be prevented from brute-force attacks. Since these are blind
analyses, the problem of steganalysis attacks such as visual or chi-square attacks, etc is
being studied. In the near future, author will complete this work.
In comparison with Chang et al.’s approach [18], the embedding and extracting time of
the proposed approach are about 3.38 and 4.24 times faster than that of theirs, respectively.
Through experimental results, we see that the efficiency (ER = 0.86 bpp, the average
value of PSNR is 55.84 dB) of the near optimal data hiding scheme (2, 9, 8) for gray images
with qcolour = 3 is indeed better than the efficiency of the HCIH scheme (ER = 0.75 bpp,
the average value of PSNR is 46.77 dB) in [104]. For palette images, values of ER can
be chosen appropriately to achieve acceptable quality of the stego images by applying the
near optimal data hiding scheme (2, 9, 8) with qcolour = 3 and the optimal data hiding
scheme (1, 2n − 1, n) with qcolour = 1. However, the values of ER is still much lower than
ERmax. So, the problem of improving the quality of stego images for palette images will
be discussed in future work.
An interesting question arises as to whether there exists the optimal data hiding scheme
(2, 8, 8) for 8-bit gray image with qcolour = 3.
To increase the data security of the proposed data hiding schemes, Chapter 5 will study
on the problem of combining cryptography and steganography for SE.
38
CHAPTER 3
AN AUTOMATA APPROACH TO EXACT
PATTERN MATCHING
This chapter proposes a flexible approach using automata to design an effective
algorithm for exact pattern matching in practice, and compares it with some of the most
efficient algorithms, such as AOSO, EBOM, FJS, FSBNDM, HASHq, LBNDM, SA,
BMH-SBNDM, SBNDMq, TVSBS. These results are based on the concept of the degree
of appearance introduced by P. T. Huy et al. in 2002, recalled in Section 1.3 of Chapter
1. Theoretical analyses and experimental results show that in practice the proposed
algorithm is faster than the above mentioned algorithms in most of the given cases of
patterns and alphabets.
The results of Chapter 3 have been published in [T2].
3.1 Introduction
Pattern matching is a classic problem in computer science and one of the most cited
problems in string processing algorithms. The applications of pattern matching algorithms
are used daily to access information, such as in the search engine Google, database queries,
search and replace in text editing systems, etc.
At present, there are two research directions for the problem that are exact and
approximate pattern matching. Moreover, based on the number of patterns found by the
algorithm, the pattern matching problem can be divided into single and multiple pattern
matching.
Along with the development of Internet-based applications, especially when cloud
computing becomes the promising trend in the information technology area, we face a
challenge that is the cloud security problem considered in Section 1.5. In SE solutions to
this problem, SE for exact single pattern matching (called exact pattern matching
popularly) is a new class. The techniques for this class have been introduced based on
approaches to [41, 89] or algorithms for [26] exact pattern matching.
This chapter only focuses on addressing the exact pattern matching problem, recalled in
Section 1.3 of Chapter 1. Research on applying the proposition of the chapter for solving
this problem to SE will be introduced in Chapter 5.
Since 1977, with the two famous publications of the Boyer-Moore [15] and
Knuth-Morris-Pratt [64] exact pattern matching algorithms, there have been about
hundreds, if not thousands, of papers published that dealt with the exact pattern
matching problem [34].
In 2013, S. Faro and T. Lecroq reviewed the 85 exact pattern matching algorithms
published during the 2000 - 2010 period and presented experimental results to compare
them. According to this evaluation, they listed the 10 most efficient sequential algorithms in
practice, as well as their best results. They were AOSO, EBOM, FJS, FSBNDM, HASHq,
LBNDM, SA, BMH-SBNDM, SBNDMq, TVSBS (for short, called 10 algorithms) [33].
39
Nearly all of these algorithms are concerned with how to slide the window. They
scan the text with the help of a window, actually the window is a substring of the text
whose size equals the length of the pattern. For each window of x, they try to find
an occurrence of p in x by comparing the letters of the window with the letters of p
(FJS [15, 34, 64, 90], HASHq for q = 3; 5; 8 [33, 68, 98], TVSBS [10, 84, 91]) or moving
the state of an automaton when one letter of window is scanned at a time (AOSO [35],
EBOM [3, 16, 32], FSBNDM [3, 16, 32, 33], LBNDM [14, 25, 30, 43, 45, 74, 79], SA [7],
BMH-SBNDM [14, 25, 33, 43, 45, 74, 79], SBNDMq for q = 2; 4; 6; 8 [30, 33]) in a certain
way. After each occurrence of p, or a mismatch occurs at any position being scanned in
window, they shift the window to the right by a certain number of positions depending
on the approach used by the algorithms. This mechanism is called the sliding window
mechanism [33] and repeated until the right end of the window does not belong to x. At
the beginning of the matching, the left ends of the window and the text are aligned. The
sliding window mechanism is shown in the following figure.
Window
Window
Window
Độ mờ = 4 (mức xuất hiện p tại vị trí đang duyệt trên xâu đích x)
..... c d b c b a d b c c.....
b c b a c
Xâu đích x:
Xâu mẫu p:
(khúc đầu của p)
The degree of fuzziness of x in y at the position being scanned is equal to 4
At the beginning of the matching
Shift the window to the right
At the end of the matching
Figure 3.1. Sliding window mechanism
According to the author’s knowledge, the 10 algorithms have some common features as
follows. They first do not take advantage of the relationship between the size of the pattern
and the alphabet, then they are not flexible to shift the window regularly to the right by
the most possible maximum number of positions with high probability (except for HASHq,
SBNDMq). Secondly, the appearance of a part of the pattern is not immediately reflected
or updated at a position being scanned in the text, hence when the windows overlap, the
letters of the text could still be scanned more than once (except for SA, but this algorithm
does not use the sliding window mechanism). This chapter’s work will be to handle these
two situations.
The worst case lower bound of the pattern matching problem is O(n). The first
algorithm to achieve the bound was proposed by Morris and Pratt in 1970, afterwards
improved by Knuth et al. in 1977 [33, 64].
As we know that theoretical analyses in the worst case are not sufficient to predict actual
running time accurately for exact pattern matching algorithms [10, 91]. For example, two of
the best algorithms listed in [33], EBOM [32] and TVSBS [91], have O(mn) time complexity
in the worst case, where m and n are lengths of the pattern and the text. So, for the exact
40
pattern matching problem, authors mainly mention efficiency of algorithms in practice.
Theoretically, it is not possible to make the time complexity less than the linear level,
but in fact, the running time might decrease by reducing the number of letters of the text
accessed. To solve this problem, we need to avoid scanning an arbitrary letter of the text
repeatedly and shift the window regularly to the right by a possible maximum number of
letters. Based on the concept of the degree of appearance [52], this chapter proposes a
flexible approach to design an algorithm dealing with two above analyses successfully in
practice.
In the new approach of the chapter, an automaton corresponding to a given pattern to
accept the pattern is constructed, where a state of the automaton is a degree of fuzziness.
This tool will reflect and update the appearance of the longest prefix of the pattern in the
text at any position. If the longest prefix of the pattern is the pattern, then a occurrence
of the pattern in the text is found. A new window is always scanned from left to right.
However, the window scanning process is only started when the right end substring of
length c (for short, called c block) of the window belongs to the pattern, where c is a
given positive integer, 1 ≤ c ≤ m. Further, the window scanning process will be stopped
immediately if the current state of the automaton holds a given condition. These techniques
make the window be shifted more regularly to the right by the most possible maximum
number of positions with high probability. The first position scanned in each new window
is determined based on the last occurrence of the c block in the pattern. Depending on this
position, the initial state of the automaton is also set up newly. These help the proposed
algorithm to scan each letter of the text at most once and reflect exactly the appearance
of a part of the pattern at any position in the text.
The total number of all letters of y accessed by the proposed algorithm is n + 2c for
1 ≤ c ≤ m in the worst case and is cnm−c+1 for 1 ≤ c ≤ m in the best case. Experimental
results show that in practice the proposed algorithm is faster than the 10 algorithms in
most of the cases of the given patterns and alphabets.
The rest of the chapter is organized as follows. Section 3.2 proposes a new approach
using an automaton to design an exact pattern matching algorithm (the MRc algorithm).
Some theoretical analyses of the MRc algorithm are discussed in Section 3.3. Experimental
results comparing the MRc algorithm with the 10 algorithms [33] are presented in Section
3.4. Finally, Section 3.5 draws conclusions from the approach and experimental results.
3.2 The New Algorithm - The MRc Algorithm
From the degree of appearance recalled in Section 1.3 of Chapter 1, this section shows a
way to build an automaton corresponding to a given pattern to accept the pattern (Theorem
3.2), presents a new approach by using the automaton to design an exact pattern matching
algorithm (the MRc algorithm).
According to the design of the automaton, each state of the automaton is a degree of
appearance. So, to determine the degree of appearance, this section gives the following
concept.
Definition 3.1. Let p be a pattern of length m. Then Nextp of p is a function such that
Nextp : {1, . . . ,m} → {0, . . . ,m− 1} defined by
Nextp(l) = max{|u||u is both a proper prefix and a suffix of p[1..l]}
for l ∈ {1, . . . ,m}.
41
Example 3.1. Given a pattern p = acbac. Then Nextp of p is determined as follows.
l 1 2 3 4 5
Nextp 0 0 0 1 2
The following algorithm is used to compute the Nextp of the given pattern p.
Algorithm 1:
Input: An arbitrary pattern p with length m.
Output: The Nextp of p.
Nextp[1] = 0; (3.1)
For l = 2 to m Do
{
k = Nextp[l − 1]; (3.2)
While (k > 0 and p[k + 1] 6= p[l]) k = Nextp[k]; (3.3)
If (k == 0 and p[k + 1] == p[l]) Nextp[l] = 1; (3.4)
If (k == 0 and p[k + 1] 6= p[l]) Nextp[l] = 0; (3.5)
If (k 6= 0) Nextp[l] = k + 1; (3.6)
}
The correctness of the Algorithm 1 is confirmed by the following theorem.
Theorem 3.1. For any given pattern p of length m, Algorithm 1 correctly computes the
Nextp of x.
Proof. By an application of Definition 3.1, Nextp[1] = 0, then (3.1) is true. Consider
l = 2, then k = Nextp[l − 1] = Nextp[1] = 0. Thus, clearly, either (3.4) or (3.5) correctly
computes Nextp[2] by Definition 3.1. Assume that Algorithm 1 correctly computes Nextp[l]
for all 2 ≤ l < m. We must prove that it is also true for l + 1. By (3.2), k = Nextp[l],
at the end of (3.3), then either k = 0 or there exists a maximum number k such that
p[1..k + 1] = p[l − k..l]. In the case k = 0 the proof is analogous to the case l = 2. In the
contrary case, by Definition 3.1, therefore, (3.6) correctly computes the Nextp of p.
Base the function Nextp of p, the following lemma offers a way to compute the degree
of appearance of p at any position when the degree of appearance of p at the adjacent
position ahead in the text x is given.
Lemma 3.1. Let p be a pattern, x be a text over the alphabet Σ and suppose that the
degree of appearance of p in x at the position i equals l, 0 ≤ l ≤ |p|. Then the degree of
appearance l′ at the position i + 1 in x is computed by the formula l′ = Appearancep(l, a),
where a = xi+1 and the function Appearancep corresponding to p is defined by
Appearancep(l, a) =
0 l = 0, a 6= p[1]; or a /∈ p, (3.7a)
1 l = 0, a = p[1], (3.7b)
l + 1 0 < l < |p|, a = p[l + 1], (3.7c)
Appearancep(Nextp(l), a) 0 < l < |p|, a 6= p[l + 1]; or l = |p|. (3.7d)
42
Proof. From Definition 1.6, (3.7a), (3.7b) and (3.7c) are true. (3.7d) has two cases.
Consider the first case of (3.7d) 0 < l < |p| and a 6= p[l + 1], then l = Nextp(l), if
a 6= p[l+1] then the statement l = Nextp(l) is repeated until one of the possibilities (3.7a),
(3.7b) and (3.7c) occurs, as the proof above leads to this case is true. In the second case
of (3.7d) l = |p|, since Nextp(l) < l, then the statement l = Nextp(l) is done the once and
it follows 0 < l < |p|. Then at this time, it means that the second case returns to the first
case. So, this case is true.
Next, the section constructs an automaton corresponding to a given pattern to accept
the pattern in the following theorem by the Lemma 3.1.
Theorem 3.2. Let p be a pattern of length m and Mp = (Σ, Qp, q0, δp, Fp) corresponding
to p be an automaton over the same alphabet Σ, where
• The set of states Qp = {0, 1, . . . ,m},
• The initial state q0 = 0,
• The set of final states Fp = {m},
• The transition function δp : Qp × Σ → Qp such that δp(q, a) = Appearancep(q, a),
where the function Appearancep corresponding to p as given in Lemma 3.1,
• To accept an input string, the transition function δp is extended as follows:
δp : Qp × Σ∗ → Qp
such that ∀q ∈ Qp, ∀s ∈ Σ∗, ∀a ∈ Σ, δp(q, as) = δp(δp(q, a), s) and δp(q, ) = q.
Then Mp accepts the pattern p.
Proof. By Lemma 3.1, clearly Appearancep(i, p[i+ 1]) = i+ 1 for all 0 ≤ i < m. Based on
the above construction of the Mp, for any input pattern p,
δp(q0, p) = δp(δp(0, p[1]), p[2..m]) = δp(Appearancep(0, p[1]), p[2..m]) =
= δp(Appearancep(1, p[2]), p[3..m]) = . . . = δp(Appearancep(m− 1, p[m]), ) =
= δp(m, ) = m ∈ F.
Thus Mp accepts p.
Example 3.2. Consider a pattern p = abcba and the automaton Mp given as in Theorem
3.2, denote an arbitrary letter, which is not in p, by the symbol #. Then transition function
δp is represented by the following table.
δp a b c #
0 1 0 0 0
1 1 2 0 0
2 1 0 3 0
3 1 4 0 0
4 5 0 0 0
5 1 2 0 0
43
To make the window slide more regularly while the text is being scanned, this section
uses the breaking points defined as follows.
Definition 3.2. Let p be a pattern and x be a text over the alphabet Σ. Consider a
position i in x for 1 ≤ i < |x| and the degree of appearance of p in x at positions i and
i+ 1 are q and q′, respectively. Then the position i is called a breaking point if one of the
two following conditions are satisfied.
(a) q = 0,
(b) q′ ≤ q < |p|.
Notice that if the condition (a) in Definition 3.2 only satisfies, then breaking points are
called type a breaking points.
One of problems that needs to solve in the new approach is to avoid scanning each letter
of the text repeatedly. So, this section gives some terminologies and a new concept below.
Given a positive integer c, a string of length c is called a c block. A c block is called
(resp. not) to be in p, denoted by c block ∈ p (resp. c block /∈ p), if the c block is (resp.
not) a substring of p. For a given positive integer c, 1 ≤ c ≤ m and c ≤ i ≤ m, the substring
p[i− c+1..i] is called a c block of p at position i, denoted by c blockip. In particular, c = 1,
then c block is only a letter.
Definition 3.3. Let p be a pattern and z be a c block of p, where c is a positive integer for
1 ≤ c ≤ |p|. Let i be some position in p for c ≤ i ≤ |p|. Then i is called the last position of
appearance of z in p, denoted by Posp(z), if z = c block
i
p and ∀j > i, j ≤ |x|, z 6= c blockjp.
The breaking point occurs, do the next test jump Set q = 0
Window
A backtracking position
b,c,#
c,#
a
b,#
a
c,#
a
b
a
b,c,#
c,#
a b c b a 0 1 2 5
0
3
0
4
0
A c_block p
x
A c_block p, slide the window and do the next test jump
Figure 3.2. The basic idea of the proposed approach
Based on the automaton Mp and the two concepts of the breaking point and Posp, the
basic idea of the proposed approach to exact pattern matching is presented as follows. The
proposed approach is shown in Figure 3.2.
1. Use the automaton Mp to reflect and update the degree of appearance of p in x at
any position. If the degree of appearance of p at some position equals |p|, then mark an
occurrence of p in x.
2. Take advantage of the relationship between the size of the pattern p and the alphabet
Σ flexibly, and use the breaking point. These make the approach flexible and the window
be shifted more regularly to the right by the most possible maximum number of positions
with high probability.
3. A new window is scanned from left to right. The window scanning process is only
started when the right end c block of the window belongs to p (do the test jump). The
44
first position scanned in the window (called the backtracking position) depends on the
Posp value of the c block and the initial state of Mp will be set up again. These make
the proposed approach scan each letter of the text at most once and reflect exactly the
appearance of a part of the pattern at any position in the text. The window scanning
process will be stopped if the breaking point occurs. At this moment, one next window
is determined (slide the window) and immediately do the next test jump. Repeat this
mechanism until reach out of x. This mechanism helps the proposed approach to design a
pattern matching algorithm effectively.
From the above approach, this section designs a new exact pattern matching algorithm.
The pseudo-code for the algorithm (called the MRc algorithm) is shown as follows.
Algorithm 2 (the MRc algorithm):
Input: Two arbitrary pattern p and text x.
Output: All occurrences of the pattern p in x.
q = 0; (3.8)
jump = |p|; (3.9)
While (jump ≤ |x|)
{
If (c blockjumpx ∈ p) (3.10)
{
bt = Posp(c block
jump
x ); (3.11)
If (q == 0) i = jump− bt+ 1; (3.12)
Else If (bt ≤ |p| − q′)
{
q = 0; (3.13)
i = jump− bt+ 1; (3.14)
}
Else
{
q = q′; (3.15)
i = jump− |p|+ q′ + 1; (3.16)
}
q′ = δp(q, x[i]); (3.17)
Do
{
q = q′;
If (q == |p|) Mark an occurrence of p at i− |p|+ 1 in x; (3.18)
i+ +; (3.19)
If (i ≤ |x| and q 6= 0) q′ = δp(q′, x[i]); (3.20)
Else Break;
} While (q 6= 0 and q′ > q or q == |p|); (3.21)
45
If (q == 0) jump = i− 1; (3.22)
Else jump = i− q′; (3.23)
}
Else
{
jump = jump− c+ 1; (3.24)
If (q 6= 0) q = 0; (3.25)
}
jump = jump + |p|; (3.26)
}
The performing steps of the MRc for c = 1 is illustrated in the following example.
Example 3.3. Consider a pattern p = abcba and the automaton Mp given by Theorem
3.2, denote an arbitrary letter, which is not in p, by the symbol #. Then the transition
diagram of Mp is shown in Figure 3.3. Here, the MR1 is used to find p in x. It is easy to
compute Posp values for the given p above, Posp(a) = 5, Posp(b) = 4, Posp(c) = 3. Let
a text x = ababcbadabeegatkau. Then the whole process of the pattern matching of the
MR1 is presented in Table 3.1 as follows.
The breaking point occurs, do the next test jump Set q = 0
Window
A backtracking position
b,c,#
c,#
a
b,#
a
c,#
a
b
a
b,c,#
c,#
a b c b a 0 1 2 5
0
3
0
4
0
A c_block p
x
A c_block p, slide the window and do the next test jump
Figure 3.3. The transition diagram of the automaton Mp, p = abcba
Table 3.1. The performing steps of the MR1 algorithm
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x[i] a b a b c b a d a b e e g a t k a u
Jump
Test
Back
jump=5
c ∈ p
i = 3
jump=13
g /∈ p
jump=18
u /∈ p
q=0 Ignore 1 2 3 4 5 0 Ignore
The correctness of the MRc algorithm is guaranteed by the following theorem.
46
Theorem 3.3. For any given pattern p and text x, the MRc algorithm finds all occurrences
of the pattern p in x.
Proof. Assume that the process of the scanning x starts at the position 1 and i is the
position being scanned in x. Algorithm 2 is only correct if the following two cases are true:
(1) The state q of automaton Mp is the degree of appearance of p in x at position i.
(2) The process of the scanning y does not ignore any occurrences of p.
Suppose that if (1) and (2) are true, then when q = m at the position i, this means p
occurs at the position i − m + 1 in x by Definition 1.6, at this time the occurrence is
marked immediately by (3.18). So all occurrences of p in x are not never ignored.
Prove (1): The process of the scanning x only occurs if c blockjumpx ∈ p. Starting with i
determined by (3.11), (3.12), (3.14) and (3.16), the state of the automaton Mp is always
initialized newly before scanning x by (3.8), (3.13), (3.15) and (3.25), after each time x
is scanned by (3.17) and (3.20), i increases by one unit by (3.19) when the conditions in
(3.20) and (3.21) hold. Since δp is defined as in Theorem 3.2, this leads to q is always the
degree of appearance of p in x at position i. So (1) is true.
Prove (2): Note that jump is a position in x to test whether or not c blockjumpx is p. If
c blockjumpx /∈ p in (3.10), then the substring x[jump −m + 1..jump] of length m does not
contain p, but c blockjump+1x can belong to p. Thus x[jump −m + 1..jump-c+1] must not
be scanned by (3.9), (3.24) and (3.26), it means that the window is shifted to the right
by m − c + 1 letters and jump is always the left end of this new window. So (2) is true.
Conversely, if c blockjumpx ∈ p in (3.10), then the next process of the scanning x starts
with the new current position i determined by (3.11), (3.12), (3.14) and (3.16). By the
Definitions 1.6, 3.3 and the way to determine i, all occurrences of p are not ignored. Thus
(2) is true and the proof is complete.
3.3 Analysis of The MRc Algorithm
This section considers some theoretical results which analyze the MRc algorithm
(Propositions 3.1, 3.2 and 3.3, Theorem 3.4). Then the section provides an explanation of
why the MRc algorithm has advantage than the others listed in [33].
Propostion 3.1. Let p be a pattern of length m and x be a text of length n over the
alphabet Σ. Let c be a positive integer constant such that 1 ≤ c ≤ m. Then the MRc
algorithm requires cnm−c+1 letters of x accessed in the best case.
Proof. In the best case all letters of x are not matched in whole process of the pattern
matching, x is only accessed by (3.10) in the MRc algorithm. Let T (n) be the number of
all letters of x accessed by the MRc algorithm. Since each time (3.10) is performed and
c blockjumpx /∈ p, the window is shifted by m−c+1 positions and c letters of x are accessed.
Thus T (n) = cnm−c+1 .
Propostion 3.2. Let p be a pattern of length m and x be a text of length n over the
alphabet Σ. Let c be a positive integer constant such that 1 ≤ c ≤ m. Then MRc algorithm
requires n+ 2c letters of x accessed in the worst case.
Proof. Similarly, as in proof of Proposition 3.1, T (n) is the number of all letters of x
accessed by the MRc algorithm. The worst case occurs when all letters are matched in
47
whole process of the pattern matching. By (3.12), (3.17), (3.20) and (3.21), x is scanned
once for every letter, this implies that n letters are accessed. Clearly, the two statements
(3.10) and (3.11) in the MRc algorithm are only performed once in the whole process of
the matching, then there are 2c letters of x accessed by them. Thus T (n) = n+ 2c.
Denote the probability of an arbitrary event by P .
Propostion 3.3. Let p be a pattern of length m over the alphabet Σ. If |Σ| ≥ 4 and
8 ≤ m ≤ 2048, then there exists c, 1 ≤ c ≤ 8 such that for an arbitrary c block z over the
alphabet Σ, P (z ∈ x) ≤ 2−5 with a uniform distribution over the alphabet Σ.
Proof. Set A = P (z ∈ p). Let d be the number of different c blocks in p. Then for a
uniform distribution over Σ, clearly we have
A = P (z ∈ p) = d|Σ|c ≤
m− c+ 1
|Σ|c .
Consider 1 ≤ c, then
A ≤ m|Σ|c .
To have A ≤ 2−5, we let
m
|Σ|c ≤ 2
−5,
then
2−5|Σ|c ≥ m. (3.27)
To (3.27) holds with |Σ| ≥ 4 and 8 ≤ m ≤ 2048, we choose c, 1 ≤ c ≤ 8 such that
22c−5 ≥ m. We let 22c−5 ≥ 2048, then we can choose c = 8.
T
Các file đính kèm theo tài liệu này:
- research_on_development_of_methods_of_graph_theory_and_autom.pdf