Research on development of methods of graph theory and automata in steganography and searchable encryption

LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER 1 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . 4

1.1 Basic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Deterministic Finite Automata . . . . . . . . . . . . . . . . . . . . 6

1.1.4 The Galois Field GF(pm) . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Digital Image Steganography . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Exact Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Searchable Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

CHAPTER 2 DIGITAL IMAGE STEGANOGRAPHY BASED ON THE

GALOIS FIELD USING GRAPH THEORY AND AUTOMATA . . . . . 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 The Digital Image Steganography Problem . . . . . . . . . . . . . . . . . . 18

2.3 A New Digital Image Steganography Approach . . . . . . . . . . . . . . . . 19

2.3.1 Mathematical Basis based on The Galois Field . . . . . . . . . . . . 19

2.3.2 Digital Image Steganography Based on The Galois Field GF(pm)

Using Graph Theory and Automata . . . . . . . . . . . . . . . . . . 21

2.4 The Near Optimal and Optimal Data Hiding Schemes for Gray and Palette

Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

CHAPTER 3 AN AUTOMATA APPROACH TO EXACT PATTERN

MATCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 The New Algorithm - The MRc Algorithm . . . . . . . . . . . . . . . . . . 41

3.3 Analysis of The MRc Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

CHAPTER 4 AUTOMATA TECHNIQUE FOR THE LONGEST

COMMON SUBSEQUENCE PROBLEM . . . . . . . . . . . . . . . . . . 56

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

98 trang | Chia sẻ: honganh20 | Lượt xem: 514 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Research on development of methods of graph theory and automata in steganography and searchable encryption, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

scheme (2, 9, 8) and the optimal data hiding scheme (1, 5, 4) for gray and palette images with qcolour = 3. Security analyses indicated that the application of these schemes to the process of hiding a finite sequence of secret data in an image can be prevented from brute-force attacks. Since these are blind analyses, the problem of steganalysis attacks such as visual or chi-square attacks, etc is being studied. In the near future, author will complete this work. In comparison with Chang et al.’s approach [18], the embedding and extracting time of the proposed approach are about 3.38 and 4.24 times faster than that of theirs, respectively. Through experimental results, we see that the efficiency (ER = 0.86 bpp, the average value of PSNR is 55.84 dB) of the near optimal data hiding scheme (2, 9, 8) for gray images with qcolour = 3 is indeed better than the efficiency of the HCIH scheme (ER = 0.75 bpp, the average value of PSNR is 46.77 dB) in [104]. For palette images, values of ER can be chosen appropriately to achieve acceptable quality of the stego images by applying the near optimal data hiding scheme (2, 9, 8) with qcolour = 3 and the optimal data hiding scheme (1, 2n − 1, n) with qcolour = 1. However, the values of ER is still much lower than ERmax. So, the problem of improving the quality of stego images for palette images will be discussed in future work. An interesting question arises as to whether there exists the optimal data hiding scheme (2, 8, 8) for 8-bit gray image with qcolour = 3. To increase the data security of the proposed data hiding schemes, Chapter 5 will study on the problem of combining cryptography and steganography for SE. 38 CHAPTER 3 AN AUTOMATA APPROACH TO EXACT PATTERN MATCHING This chapter proposes a flexible approach using automata to design an effective algorithm for exact pattern matching in practice, and compares it with some of the most efficient algorithms, such as AOSO, EBOM, FJS, FSBNDM, HASHq, LBNDM, SA, BMH-SBNDM, SBNDMq, TVSBS. These results are based on the concept of the degree of appearance introduced by P. T. Huy et al. in 2002, recalled in Section 1.3 of Chapter 1. Theoretical analyses and experimental results show that in practice the proposed algorithm is faster than the above mentioned algorithms in most of the given cases of patterns and alphabets. The results of Chapter 3 have been published in [T2]. 3.1 Introduction Pattern matching is a classic problem in computer science and one of the most cited problems in string processing algorithms. The applications of pattern matching algorithms are used daily to access information, such as in the search engine Google, database queries, search and replace in text editing systems, etc. At present, there are two research directions for the problem that are exact and approximate pattern matching. Moreover, based on the number of patterns found by the algorithm, the pattern matching problem can be divided into single and multiple pattern matching. Along with the development of Internet-based applications, especially when cloud computing becomes the promising trend in the information technology area, we face a challenge that is the cloud security problem considered in Section 1.5. In SE solutions to this problem, SE for exact single pattern matching (called exact pattern matching popularly) is a new class. The techniques for this class have been introduced based on approaches to [41, 89] or algorithms for [26] exact pattern matching. This chapter only focuses on addressing the exact pattern matching problem, recalled in Section 1.3 of Chapter 1. Research on applying the proposition of the chapter for solving this problem to SE will be introduced in Chapter 5. Since 1977, with the two famous publications of the Boyer-Moore [15] and Knuth-Morris-Pratt [64] exact pattern matching algorithms, there have been about hundreds, if not thousands, of papers published that dealt with the exact pattern matching problem [34]. In 2013, S. Faro and T. Lecroq reviewed the 85 exact pattern matching algorithms published during the 2000 - 2010 period and presented experimental results to compare them. According to this evaluation, they listed the 10 most efficient sequential algorithms in practice, as well as their best results. They were AOSO, EBOM, FJS, FSBNDM, HASHq, LBNDM, SA, BMH-SBNDM, SBNDMq, TVSBS (for short, called 10 algorithms) [33]. 39 Nearly all of these algorithms are concerned with how to slide the window. They scan the text with the help of a window, actually the window is a substring of the text whose size equals the length of the pattern. For each window of x, they try to find an occurrence of p in x by comparing the letters of the window with the letters of p (FJS [15, 34, 64, 90], HASHq for q = 3; 5; 8 [33, 68, 98], TVSBS [10, 84, 91]) or moving the state of an automaton when one letter of window is scanned at a time (AOSO [35], EBOM [3, 16, 32], FSBNDM [3, 16, 32, 33], LBNDM [14, 25, 30, 43, 45, 74, 79], SA [7], BMH-SBNDM [14, 25, 33, 43, 45, 74, 79], SBNDMq for q = 2; 4; 6; 8 [30, 33]) in a certain way. After each occurrence of p, or a mismatch occurs at any position being scanned in window, they shift the window to the right by a certain number of positions depending on the approach used by the algorithms. This mechanism is called the sliding window mechanism [33] and repeated until the right end of the window does not belong to x. At the beginning of the matching, the left ends of the window and the text are aligned. The sliding window mechanism is shown in the following figure. Window Window Window Độ mờ = 4 (mức xuất hiện p tại vị trí đang duyệt trên xâu đích x) ..... c d b c b a d b c c..... b c b a c Xâu đích x: Xâu mẫu p: (khúc đầu của p) The degree of fuzziness of x in y at the position being scanned is equal to 4 At the beginning of the matching Shift the window to the right At the end of the matching Figure 3.1. Sliding window mechanism According to the author’s knowledge, the 10 algorithms have some common features as follows. They first do not take advantage of the relationship between the size of the pattern and the alphabet, then they are not flexible to shift the window regularly to the right by the most possible maximum number of positions with high probability (except for HASHq, SBNDMq). Secondly, the appearance of a part of the pattern is not immediately reflected or updated at a position being scanned in the text, hence when the windows overlap, the letters of the text could still be scanned more than once (except for SA, but this algorithm does not use the sliding window mechanism). This chapter’s work will be to handle these two situations. The worst case lower bound of the pattern matching problem is O(n). The first algorithm to achieve the bound was proposed by Morris and Pratt in 1970, afterwards improved by Knuth et al. in 1977 [33, 64]. As we know that theoretical analyses in the worst case are not sufficient to predict actual running time accurately for exact pattern matching algorithms [10, 91]. For example, two of the best algorithms listed in [33], EBOM [32] and TVSBS [91], have O(mn) time complexity in the worst case, where m and n are lengths of the pattern and the text. So, for the exact 40 pattern matching problem, authors mainly mention efficiency of algorithms in practice. Theoretically, it is not possible to make the time complexity less than the linear level, but in fact, the running time might decrease by reducing the number of letters of the text accessed. To solve this problem, we need to avoid scanning an arbitrary letter of the text repeatedly and shift the window regularly to the right by a possible maximum number of letters. Based on the concept of the degree of appearance [52], this chapter proposes a flexible approach to design an algorithm dealing with two above analyses successfully in practice. In the new approach of the chapter, an automaton corresponding to a given pattern to accept the pattern is constructed, where a state of the automaton is a degree of fuzziness. This tool will reflect and update the appearance of the longest prefix of the pattern in the text at any position. If the longest prefix of the pattern is the pattern, then a occurrence of the pattern in the text is found. A new window is always scanned from left to right. However, the window scanning process is only started when the right end substring of length c (for short, called c block) of the window belongs to the pattern, where c is a given positive integer, 1 ≤ c ≤ m. Further, the window scanning process will be stopped immediately if the current state of the automaton holds a given condition. These techniques make the window be shifted more regularly to the right by the most possible maximum number of positions with high probability. The first position scanned in each new window is determined based on the last occurrence of the c block in the pattern. Depending on this position, the initial state of the automaton is also set up newly. These help the proposed algorithm to scan each letter of the text at most once and reflect exactly the appearance of a part of the pattern at any position in the text. The total number of all letters of y accessed by the proposed algorithm is n + 2c for 1 ≤ c ≤ m in the worst case and is cnm−c+1 for 1 ≤ c ≤ m in the best case. Experimental results show that in practice the proposed algorithm is faster than the 10 algorithms in most of the cases of the given patterns and alphabets. The rest of the chapter is organized as follows. Section 3.2 proposes a new approach using an automaton to design an exact pattern matching algorithm (the MRc algorithm). Some theoretical analyses of the MRc algorithm are discussed in Section 3.3. Experimental results comparing the MRc algorithm with the 10 algorithms [33] are presented in Section 3.4. Finally, Section 3.5 draws conclusions from the approach and experimental results. 3.2 The New Algorithm - The MRc Algorithm From the degree of appearance recalled in Section 1.3 of Chapter 1, this section shows a way to build an automaton corresponding to a given pattern to accept the pattern (Theorem 3.2), presents a new approach by using the automaton to design an exact pattern matching algorithm (the MRc algorithm). According to the design of the automaton, each state of the automaton is a degree of appearance. So, to determine the degree of appearance, this section gives the following concept. Definition 3.1. Let p be a pattern of length m. Then Nextp of p is a function such that Nextp : {1, . . . ,m} → {0, . . . ,m− 1} defined by Nextp(l) = max{|u||u is both a proper prefix and a suffix of p[1..l]} for l ∈ {1, . . . ,m}. 41 Example 3.1. Given a pattern p = acbac. Then Nextp of p is determined as follows. l 1 2 3 4 5 Nextp 0 0 0 1 2 The following algorithm is used to compute the Nextp of the given pattern p. Algorithm 1: Input: An arbitrary pattern p with length m. Output: The Nextp of p. Nextp[1] = 0; (3.1) For l = 2 to m Do { k = Nextp[l − 1]; (3.2) While (k > 0 and p[k + 1] 6= p[l]) k = Nextp[k]; (3.3) If (k == 0 and p[k + 1] == p[l]) Nextp[l] = 1; (3.4) If (k == 0 and p[k + 1] 6= p[l]) Nextp[l] = 0; (3.5) If (k 6= 0) Nextp[l] = k + 1; (3.6) } The correctness of the Algorithm 1 is confirmed by the following theorem. Theorem 3.1. For any given pattern p of length m, Algorithm 1 correctly computes the Nextp of x. Proof. By an application of Definition 3.1, Nextp[1] = 0, then (3.1) is true. Consider l = 2, then k = Nextp[l − 1] = Nextp[1] = 0. Thus, clearly, either (3.4) or (3.5) correctly computes Nextp[2] by Definition 3.1. Assume that Algorithm 1 correctly computes Nextp[l] for all 2 ≤ l < m. We must prove that it is also true for l + 1. By (3.2), k = Nextp[l], at the end of (3.3), then either k = 0 or there exists a maximum number k such that p[1..k + 1] = p[l − k..l]. In the case k = 0 the proof is analogous to the case l = 2. In the contrary case, by Definition 3.1, therefore, (3.6) correctly computes the Nextp of p. Base the function Nextp of p, the following lemma offers a way to compute the degree of appearance of p at any position when the degree of appearance of p at the adjacent position ahead in the text x is given. Lemma 3.1. Let p be a pattern, x be a text over the alphabet Σ and suppose that the degree of appearance of p in x at the position i equals l, 0 ≤ l ≤ |p|. Then the degree of appearance l′ at the position i + 1 in x is computed by the formula l′ = Appearancep(l, a), where a = xi+1 and the function Appearancep corresponding to p is defined by Appearancep(l, a) =  0 l = 0, a 6= p[1]; or a /∈ p, (3.7a) 1 l = 0, a = p[1], (3.7b) l + 1 0 < l < |p|, a = p[l + 1], (3.7c) Appearancep(Nextp(l), a) 0 < l < |p|, a 6= p[l + 1]; or l = |p|. (3.7d) 42 Proof. From Definition 1.6, (3.7a), (3.7b) and (3.7c) are true. (3.7d) has two cases. Consider the first case of (3.7d) 0 < l < |p| and a 6= p[l + 1], then l = Nextp(l), if a 6= p[l+1] then the statement l = Nextp(l) is repeated until one of the possibilities (3.7a), (3.7b) and (3.7c) occurs, as the proof above leads to this case is true. In the second case of (3.7d) l = |p|, since Nextp(l) < l, then the statement l = Nextp(l) is done the once and it follows 0 < l < |p|. Then at this time, it means that the second case returns to the first case. So, this case is true. Next, the section constructs an automaton corresponding to a given pattern to accept the pattern in the following theorem by the Lemma 3.1. Theorem 3.2. Let p be a pattern of length m and Mp = (Σ, Qp, q0, δp, Fp) corresponding to p be an automaton over the same alphabet Σ, where • The set of states Qp = {0, 1, . . . ,m}, • The initial state q0 = 0, • The set of final states Fp = {m}, • The transition function δp : Qp × Σ → Qp such that δp(q, a) = Appearancep(q, a), where the function Appearancep corresponding to p as given in Lemma 3.1, • To accept an input string, the transition function δp is extended as follows: δp : Qp × Σ∗ → Qp such that ∀q ∈ Qp, ∀s ∈ Σ∗, ∀a ∈ Σ, δp(q, as) = δp(δp(q, a), s) and δp(q, ) = q. Then Mp accepts the pattern p. Proof. By Lemma 3.1, clearly Appearancep(i, p[i+ 1]) = i+ 1 for all 0 ≤ i < m. Based on the above construction of the Mp, for any input pattern p, δp(q0, p) = δp(δp(0, p[1]), p[2..m]) = δp(Appearancep(0, p[1]), p[2..m]) = = δp(Appearancep(1, p[2]), p[3..m]) = . . . = δp(Appearancep(m− 1, p[m]), ) = = δp(m, ) = m ∈ F. Thus Mp accepts p. Example 3.2. Consider a pattern p = abcba and the automaton Mp given as in Theorem 3.2, denote an arbitrary letter, which is not in p, by the symbol #. Then transition function δp is represented by the following table. δp a b c # 0 1 0 0 0 1 1 2 0 0 2 1 0 3 0 3 1 4 0 0 4 5 0 0 0 5 1 2 0 0 43 To make the window slide more regularly while the text is being scanned, this section uses the breaking points defined as follows. Definition 3.2. Let p be a pattern and x be a text over the alphabet Σ. Consider a position i in x for 1 ≤ i < |x| and the degree of appearance of p in x at positions i and i+ 1 are q and q′, respectively. Then the position i is called a breaking point if one of the two following conditions are satisfied. (a) q = 0, (b) q′ ≤ q < |p|. Notice that if the condition (a) in Definition 3.2 only satisfies, then breaking points are called type a breaking points. One of problems that needs to solve in the new approach is to avoid scanning each letter of the text repeatedly. So, this section gives some terminologies and a new concept below. Given a positive integer c, a string of length c is called a c block. A c block is called (resp. not) to be in p, denoted by c block ∈ p (resp. c block /∈ p), if the c block is (resp. not) a substring of p. For a given positive integer c, 1 ≤ c ≤ m and c ≤ i ≤ m, the substring p[i− c+1..i] is called a c block of p at position i, denoted by c blockip. In particular, c = 1, then c block is only a letter. Definition 3.3. Let p be a pattern and z be a c block of p, where c is a positive integer for 1 ≤ c ≤ |p|. Let i be some position in p for c ≤ i ≤ |p|. Then i is called the last position of appearance of z in p, denoted by Posp(z), if z = c block i p and ∀j > i, j ≤ |x|, z 6= c blockjp. The breaking point occurs, do the next test jump Set q = 0 Window A backtracking position b,c,# c,# a b,# a c,# a b a b,c,# c,# a b c b a 0 1 2 5 0 3 0 4 0 A c_block p x A c_block p, slide the window and do the next test jump Figure 3.2. The basic idea of the proposed approach Based on the automaton Mp and the two concepts of the breaking point and Posp, the basic idea of the proposed approach to exact pattern matching is presented as follows. The proposed approach is shown in Figure 3.2. 1. Use the automaton Mp to reflect and update the degree of appearance of p in x at any position. If the degree of appearance of p at some position equals |p|, then mark an occurrence of p in x. 2. Take advantage of the relationship between the size of the pattern p and the alphabet Σ flexibly, and use the breaking point. These make the approach flexible and the window be shifted more regularly to the right by the most possible maximum number of positions with high probability. 3. A new window is scanned from left to right. The window scanning process is only started when the right end c block of the window belongs to p (do the test jump). The 44 first position scanned in the window (called the backtracking position) depends on the Posp value of the c block and the initial state of Mp will be set up again. These make the proposed approach scan each letter of the text at most once and reflect exactly the appearance of a part of the pattern at any position in the text. The window scanning process will be stopped if the breaking point occurs. At this moment, one next window is determined (slide the window) and immediately do the next test jump. Repeat this mechanism until reach out of x. This mechanism helps the proposed approach to design a pattern matching algorithm effectively. From the above approach, this section designs a new exact pattern matching algorithm. The pseudo-code for the algorithm (called the MRc algorithm) is shown as follows. Algorithm 2 (the MRc algorithm): Input: Two arbitrary pattern p and text x. Output: All occurrences of the pattern p in x. q = 0; (3.8) jump = |p|; (3.9) While (jump ≤ |x|) { If (c blockjumpx ∈ p) (3.10) { bt = Posp(c block jump x ); (3.11) If (q == 0) i = jump− bt+ 1; (3.12) Else If (bt ≤ |p| − q′) { q = 0; (3.13) i = jump− bt+ 1; (3.14) } Else { q = q′; (3.15) i = jump− |p|+ q′ + 1; (3.16) } q′ = δp(q, x[i]); (3.17) Do { q = q′; If (q == |p|) Mark an occurrence of p at i− |p|+ 1 in x; (3.18) i+ +; (3.19) If (i ≤ |x| and q 6= 0) q′ = δp(q′, x[i]); (3.20) Else Break; } While (q 6= 0 and q′ > q or q == |p|); (3.21) 45 If (q == 0) jump = i− 1; (3.22) Else jump = i− q′; (3.23) } Else { jump = jump− c+ 1; (3.24) If (q 6= 0) q = 0; (3.25) } jump = jump + |p|; (3.26) } The performing steps of the MRc for c = 1 is illustrated in the following example. Example 3.3. Consider a pattern p = abcba and the automaton Mp given by Theorem 3.2, denote an arbitrary letter, which is not in p, by the symbol #. Then the transition diagram of Mp is shown in Figure 3.3. Here, the MR1 is used to find p in x. It is easy to compute Posp values for the given p above, Posp(a) = 5, Posp(b) = 4, Posp(c) = 3. Let a text x = ababcbadabeegatkau. Then the whole process of the pattern matching of the MR1 is presented in Table 3.1 as follows. The breaking point occurs, do the next test jump Set q = 0 Window A backtracking position b,c,# c,# a b,# a c,# a b a b,c,# c,# a b c b a 0 1 2 5 0 3 0 4 0 A c_block p x A c_block p, slide the window and do the next test jump Figure 3.3. The transition diagram of the automaton Mp, p = abcba Table 3.1. The performing steps of the MR1 algorithm i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 x[i] a b a b c b a d a b e e g a t k a u Jump Test Back jump=5 c ∈ p i = 3 jump=13 g /∈ p jump=18 u /∈ p q=0 Ignore 1 2 3 4 5 0 Ignore The correctness of the MRc algorithm is guaranteed by the following theorem. 46 Theorem 3.3. For any given pattern p and text x, the MRc algorithm finds all occurrences of the pattern p in x. Proof. Assume that the process of the scanning x starts at the position 1 and i is the position being scanned in x. Algorithm 2 is only correct if the following two cases are true: (1) The state q of automaton Mp is the degree of appearance of p in x at position i. (2) The process of the scanning y does not ignore any occurrences of p. Suppose that if (1) and (2) are true, then when q = m at the position i, this means p occurs at the position i − m + 1 in x by Definition 1.6, at this time the occurrence is marked immediately by (3.18). So all occurrences of p in x are not never ignored. Prove (1): The process of the scanning x only occurs if c blockjumpx ∈ p. Starting with i determined by (3.11), (3.12), (3.14) and (3.16), the state of the automaton Mp is always initialized newly before scanning x by (3.8), (3.13), (3.15) and (3.25), after each time x is scanned by (3.17) and (3.20), i increases by one unit by (3.19) when the conditions in (3.20) and (3.21) hold. Since δp is defined as in Theorem 3.2, this leads to q is always the degree of appearance of p in x at position i. So (1) is true. Prove (2): Note that jump is a position in x to test whether or not c blockjumpx is p. If c blockjumpx /∈ p in (3.10), then the substring x[jump −m + 1..jump] of length m does not contain p, but c blockjump+1x can belong to p. Thus x[jump −m + 1..jump-c+1] must not be scanned by (3.9), (3.24) and (3.26), it means that the window is shifted to the right by m − c + 1 letters and jump is always the left end of this new window. So (2) is true. Conversely, if c blockjumpx ∈ p in (3.10), then the next process of the scanning x starts with the new current position i determined by (3.11), (3.12), (3.14) and (3.16). By the Definitions 1.6, 3.3 and the way to determine i, all occurrences of p are not ignored. Thus (2) is true and the proof is complete. 3.3 Analysis of The MRc Algorithm This section considers some theoretical results which analyze the MRc algorithm (Propositions 3.1, 3.2 and 3.3, Theorem 3.4). Then the section provides an explanation of why the MRc algorithm has advantage than the others listed in [33]. Propostion 3.1. Let p be a pattern of length m and x be a text of length n over the alphabet Σ. Let c be a positive integer constant such that 1 ≤ c ≤ m. Then the MRc algorithm requires cnm−c+1 letters of x accessed in the best case. Proof. In the best case all letters of x are not matched in whole process of the pattern matching, x is only accessed by (3.10) in the MRc algorithm. Let T (n) be the number of all letters of x accessed by the MRc algorithm. Since each time (3.10) is performed and c blockjumpx /∈ p, the window is shifted by m−c+1 positions and c letters of x are accessed. Thus T (n) = cnm−c+1 . Propostion 3.2. Let p be a pattern of length m and x be a text of length n over the alphabet Σ. Let c be a positive integer constant such that 1 ≤ c ≤ m. Then MRc algorithm requires n+ 2c letters of x accessed in the worst case. Proof. Similarly, as in proof of Proposition 3.1, T (n) is the number of all letters of x accessed by the MRc algorithm. The worst case occurs when all letters are matched in 47 whole process of the pattern matching. By (3.12), (3.17), (3.20) and (3.21), x is scanned once for every letter, this implies that n letters are accessed. Clearly, the two statements (3.10) and (3.11) in the MRc algorithm are only performed once in the whole process of the matching, then there are 2c letters of x accessed by them. Thus T (n) = n+ 2c. Denote the probability of an arbitrary event by P . Propostion 3.3. Let p be a pattern of length m over the alphabet Σ. If |Σ| ≥ 4 and 8 ≤ m ≤ 2048, then there exists c, 1 ≤ c ≤ 8 such that for an arbitrary c block z over the alphabet Σ, P (z ∈ x) ≤ 2−5 with a uniform distribution over the alphabet Σ. Proof. Set A = P (z ∈ p). Let d be the number of different c blocks in p. Then for a uniform distribution over Σ, clearly we have A = P (z ∈ p) = d|Σ|c ≤ m− c+ 1 |Σ|c . Consider 1 ≤ c, then A ≤ m|Σ|c . To have A ≤ 2−5, we let m |Σ|c ≤ 2 −5, then 2−5|Σ|c ≥ m. (3.27) To (3.27) holds with |Σ| ≥ 4 and 8 ≤ m ≤ 2048, we choose c, 1 ≤ c ≤ 8 such that 22c−5 ≥ m. We let 22c−5 ≥ 2048, then we can choose c = 8. T

Các file đính kèm theo tài liệu này:

research_on_development_of_methods_of_graph_theory_and_autom.pdf