Research on developing method of mining fuzzy association rules based on linguistic information and its application

The experimental results are compared with the previously published results in Table

3.10, which statistics the number of frequent set with support threshold from 20% to 80%.

Table 3.11 is the experimental results with three methods: the proposed method using

multi-granular representation, the single- granular representation method proposed in

chapter 3 and Herrera method (2009). The results show that the method of using Multigranular representation for 1-ItemSet number is better than the number with the other two

methods (as shown in Figure 4.3). Apparently, (a list of comparative properties: coverage,

overlap is presented in section 3.3.3) and the methods used for comparison are all

performed with single-granular representation. The experimental results show the

advantages of using multi- granular representation and HA, reinforcing the research results23

related to the use of multi- granular representation (some works published in some in

recent years using multi- granular representations

26 trang | Chia sẻ: honganh20 | Lượt xem: 766 | Lượt tải: 1

Bạn đang xem trước 20 trang tài liệu Research on developing method of mining fuzzy association rules based on linguistic information and its application, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

≤ m, 𝐿𝑟: The set of universal itemsets is hedge label r (itemset) 1 ≤ r ≤ m 8 The algorithm of mining database based on HA for quantitative value is carried out as follows: Input: Transaction database D, HAs for the fuzzy attribute, Min_sup and Min_conf Output: Fuzzy AR Step 1: Convert the quantitative values vj (k) of the transaction Aj in D (i), where i belongs to the range (1, N). For vj (k) , if vj (k) is outside 1 of the 2 ends (2 labels of the maximum and minimum hedge), there will be only one hedge label for that end in vj (k) . In contrast vj (k) is represented by two successive hedge labels with the smallest value segment on the value field of vj (k) , each label corresponding to a value representing the membership degree fjk (i) (j = 1, 2) in vj (k) with that hedge label. This attribute is computed as the distance of vj (k) to the value represented by the corresponding hedge label. Step 2: Carry out the algorithm of compressed transactions (Algorithm 1) with the fuzzy DB obtained in the step 1. As a result of this step, we have the compressed transaction and quantification table. Similar to the Apriori algorithm, we apply the algorithm to the compressed DB to create a frequent ItemSets. Step 3: Based on the value in TL1 of the quantification table, value in TL1is the support of các 𝑅𝑗𝑘 . If 𝑆𝑢𝑝(𝑅𝑗𝑘) ≥ min_𝑠𝑢𝑝, then Rjk is put into L1. Step 4: If L1 ≠ ∅, go to the next step; if L1 = ∅ the algorithm is ended. Step 5: The algorithm that builds the frequent itemset of level r from the frequent itemset of level r −1 by choosing 2 frequent itemsets of level r −1 when these 2 itemsets are different from each other in only one set. After joining these two itemsets, we have the candidate itemset 𝐶𝑟. Before using the compressed DB to compute the support degree of itemsets in 𝐶𝑟, we can eliminate some candidates without revising compressed DB, based on the value of TLr in the quantification table. Step 6: Approve compressed DB basing on the formula (4) in order to compute the support degree of each itemset in Cr . If there is any itemset which has the support degree appropriate with minimum support, it is taken to Lr Step 7: Follow the next steps and repeat frequent itemsets with greater levels, which are produced with form (or +1), the frequent itemset S with the item (𝑠1, 𝑠2, , 𝑠𝑡 , , 𝑠𝑟+1) trong 𝐶𝑟+1, 1 ≤ 𝑡 ≤ 𝑟 + 1. (a) Compute the support degree sup(S) of S in the transaction; (b) If 𝑆𝑢𝑝(𝑆) ≥ 𝑀𝑖𝑛_𝑠𝑢𝑝 then S is taken to 𝐿𝑟+1 Step 8: If Lr +1 is null, then the next step is carried out; in contrast, propose r = r + 1, step 6 and step 7 are repeated. Step 9: Give the AR from the collected frequent itemset. 2.5. The experimental results The experimental results were performed with two algorithms: the proposed algorithm and the Apriori fuzzy algorithm in C # programming language and run tests on computers with the following details: Intel (R) Core (TM) i5 CPU 1.7GHz, 6GB RAM. In this chapter, the thesis uses two DB for testing: FAM95 and STULONG. 2.5.1. Experiment with the database FAM95 In Table 2.4, the statistics of the number of AR are obtained by three methods: the method used: uncompressed DB, compressed DB, and compressed DB and quantitative table. With the support of 20%, 30% of the number of combined rules of the proposed 9 thesis method is different from the method using the Apriori algorithm, with the support degree from 40% to 70%, the number of AR collected by the three methods is the same. Table 2.4: Number of AR obtained with confidence 80% Minimum Support (%) Uncompressed DB Compress DB Compressed DB, and Quantification table 20 238 255 255 30 98 94 94 40 34 34 34 50 18 18 18 60 6 6 6 70 2 2 2 Table 2.5: AR obtained with 60% support and 80% confidence Association Rules Support Confidence Uncompressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 92% 97% 2 { VL_INCFAM } ==> { VL_INCHEAD } 92% 98% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 98% 4 { LY_AGE } ==> { VL_INCFAM } 70% 99% 5 { VL_INCHEAD, LY_AGE } ==> { VL_INCFAM } 69% 99% 6 { VL_INCFAM, LY_AGE } ==> { VL_INCHEAD } 69% 99% Compress DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 99% 4 { LY_AGE } ==> { VL_INCFAM } 69% 100% 5 { VL_INCHEAD, LY_AGE } ==> { VL_INCFAM } 69% 100% 6 { VL_INCFAM, LY_AGE } ==> { VL_INCHEAD } 69% 99% Compressed DB, and Quantification table 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 99% 4 { LY_AGE } ==> { VL_INCFAM } 69% 100% 5 { LY_AGE, VL_INCHEAD } ==> { VL_INCFAM } 69% 100% 6 { LY_AGE, VL_INCFAM } ==> { VL_INCHEAD } 69% 99% Table 0.3: AR obtained with 70% support and 80% confidence Association Rules Support Confidence Uncompressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 92% 97% 2 { VL_INCFAM } ==> { VL_INCHEAD } 92% 98% Compressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% Compressed DB, and Quantification table 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% It can be seen in Table 2.5, Table 2.6, the number of combination rules of the three trials (with uncompressed DB, a compressed DB without using a quantitative table, a compression DB using a quantitative table) has the same number. In Table 2.5, the 10 comparisons of each rule of the three corresponding methods show that the support and confidence of each law are different but not significant. Figure 0.2: Processing duration with compressed database Figure 2.3 presents the real-time comparison of the fuzzy Apriori algorithm with the uncompressed DB and the processing duration with the compressed DB without the quantitative table. In figure 2.4, the algorithm duration is compared with compression DB using quantitative table and compression DB without the quantitative table. The time it took to compress the DB was 135 seconds, the number of transactions obtained after compression was 2402 transactions. Experimental results show 60% confidence. The thesis tested with two algorithms: The association rule with the approach of HA [2] and the proposed thesis algorithm is compressing fuzzy DB in the direction of HA. The test results show that the proposed DB compression method gives faster results than the method proposed in [2]. Moreover, the values of frequent itemsets are similar to those using the uncompressed DB. 2.5.2. Experiment with the database STULONG In Table 2.7, the statistics of the number of association rules are obtained by three methods: the method using uncompressed DB, compressed DB, and compressed DB and quantitative table. Table 0.4: The number of AR obtained with confidence 80% Minimum Support (%) Uncompressed DB, Compressed DB Compressed DB, and Quantification table 5% 7822 8188 8185 10% 5076 5532 5527 20% 2149 2528 2528 30% 1096 1348 1318 40% 587 599 599 50% 248 287 287 60% 107 155 155 70% 75 75 75 80% 23 35 35 Comment: The number of AR of the proposed methodology of the thesis using compressed DB and the one using the quantification table and without using the quantitative table is the same. Table 0.5: Comparison between the processing duration of AR mining with the confidence 80% Minimum Support (%) Uncompressed DB, Compressed DB Compressed DB, and Quantification table 5% 669 41.4 41.4 10% 580 26.4 26.3 20% 187 8.3 8.3 30% 72 3.6 3.5 40% 26 1.1 1.1 50% 8 0.4 0.4 60% 3 0.2 0.2 70% 1 0.1 0.1 Table 2.9 and Table 2.10 show that the number of combination rules of three trials (with uncompressed DB, compressed DB without using quantification table, compressed 11 DB using quantification table) is the same. In Table 2.9, Table 2.10 comparing each rule of the three methods presents that the support and confidence of each rule are different but not significant. Table 0.6: AR obtained with the support 85% and the confidence 80% STT Association rules Support Confidence Uncompressed DB 1 { LL_A5 } ==> { LH_A2 } 86 % 97 % 2 { LH_A2 } ==> { LL_A5 } 86 % 93 % 3 { LL_A5 } ==> { VH_A1 } 88 % 99 % 4 { VH_A1 } ==> { LL_A5 } 88 % 91 % 5 { LH_A2 } ==> { VH_A1 } 92 % 99 % 6 { VH_A1 } ==> { LH_A2 } 92 % 95 % 7 { LL_A5, VH_A1 } ==> { LH_A2 } 85 % 97 % 8 { LH_A2, VH_A1 } ==> { LL_A5 } 85 % 93 % 9 { LH_A2, LL_A5 } ==> { VH_A1 } 85 % 100 % Compressed DB 1 { LL_A5 } ==> { LH_A2 } 88 % 99 % 2 { LH_A2 } ==> { LL_A5 } 88 % 95 % 3 { LL_A5 } ==> { VH_A1 } 88 % 100 % 4 { VH_A1 } ==> { LL_A5 } 88 % 91 % 5 { LH_A2 } ==> { VH_A1 } 92 % 100 % 6 { VH_A1 } ==> { LH_A2 } 92 % 95 % 7 { LL_A5, VH_A1 } ==> { LH_A2 } 87 % 99 % 8 { LH_A2, VH_A1 } ==> { LL_A5 } 87 % 95 % 9 { LH_A2, LL_A5 } ==> { VH_A1 } 87 % 100 % Compressed DB, and Quantification table 1 { B3 } ==> { A4 } 92 % 100 % 2 { A4 } ==> { B3 } 92 % 95 % 3 { E2 } ==> { A4 } 88 % 100 % 4 { A4 } ==> { E2 } 88 % 91 % 5 { E2 } ==> { B3 } 88 % 99 % 6 { B3 } ==> { E2 } 88 % 95 % 7 { B3, E2 } ==> { A4 } 87 % 100 % 8 { A4, E2 } ==> { B3 } 87 % 99 % 9 { A4, B3 } ==> { E2 } 87 % 95 % Table 0.7: AR obtained with the support 90% and the confidence 80% STT Association rule Support Confidence Uncompressed DB 1 { LH_A2 } ==> { VH_A1 } 92 % 99 % 2 { VH_A1 } ==> { LH_A2 } 92 % 95 % Compressed DB 1 { LH_A2 } ==> { VH_A1 } 92 % 100 % 2 { VH_A1 } ==> { LH_A2 } 92 % 95 % Compressed DB, and Quantification table 1 { B3 } ==> { A4 } 92 % 100 % 2 { A4 } ==> { B3 } 92 % 95 % 12 2.6. Conclusion In this chapter, thesis researches HA and develops the transaction DB compression algorithm used for solving fuzzy AR. With this approach, close transactions are combined to form new transactions, reducing the size of the input DB. Transaction DB compression algorithm was tested on the DB: FAM95 and STULONG. Experimental results with 2 DB showed that the proposed method of DB compression gave faster results than the proposed method in [2] and the values of frequent itemsets were similar to those when we used the uncompressed DB. The content of this chapter is shown in the works [i, ii]. In this chapter, the thesis uses HA with single-granular representations for properties with the same parameters. In order to improve the efficiency of combining AR and to find more meaningful rules, in Chapter 3, the thesis studies and proposes methods to optimize fuzzy parameters to suit each attribute with single-granular and multi-granular representation. CHAPTER 3. FUZZY PARTITIONS OF ATTRIBUTES BASED ON GRANULAR REPRESENTATION OF HEDGE ALGEBRA In this chapter, the thesis presents some ways of fuzzy domain division and proposes the method of fuzzy domain division using HA theory based on single-granular and multi- granular representation. HA allowed to model and design linguistic elements along with semantics based on fuzzy sets. The thesis proposes optimal algorithms of MFs based on HA theory for the problem of fuzzy AR mining. The experimental results show that the results of the proposed methods have advantages more than some previously proposed methods. 3.1. Partitioning the domain values of the attributes 3.1.1. Introduction The partition of determined domain to the quantitative attributes of an input data set is as follows: Propose that there is a certain domain of a given attribute (only qualitative attributes are considered). Each quantitative attribute has a definite domain (or value domain) that is the domain on the real number axis including the values that the quantitative attribute can receive. The requirement is to divide the attribute domain into granulars, and each granular has a language label denoted by a fuzzy set. In the fuzzy set theory approach, the authors divide the attribute value domain into fuzzy sets, and adjust the parameters of fuzzy sets. The assignment of linguistic labels to fuzzy sets is based on the designer's intuition. HA is derived from a linguistic awareness. Then, it can design linguistic terms and semantics based on their fuzzy set. 3.1.2. Discrete quantitative attribute There are two ways of dividing a definite domain of attributes into fuzzy and clear sub domains. The division into subdomains is evident in the following example: If A is a quantitative & discrete attribute or a categorical attribute with a finite range of values {v1, v2, , vk} and k is small, the attributes will be changed. This becomes k binary properties of the form A_V1, A_V2, A_Vk. The value of a record in the field A_Vi is 1 if the value of that record of the original attribute A is equal to 𝑣𝑖 . In the remaining cases, the value of A_Vi is 0. If A is a quantitative & continuous attribute or A is a discrete quantitative attribute or a category attribute with a value domain of form {v1, v2, , vp} (large p), we will reflect map into q binary properties , , , < 13 𝐴: startq. . endq >. The value of a record in the field will be 1 if the value of that record at the original A attribute is in the range [starti. . endi], otherwise it will be 0 . In mining fuzzy AR, we need to divide the value domain of attributes into fuzzy domains in which each fuzzy domain usually associated with a MF and linguistic label. The way of dividing a defined domain into fuzzy sub-domains has more advantages and is the way that the thesis uses . Therefore, it will be elaborated in section 3.1.3. 3.1.3. Divide the value domain of the attributes based on the fuzzy set approach Some common methods of fuzzy domain division: a) Random partitioning: A fixed number of sub-domains (partitions) is chosen and divide an item into equal areas. This method is simple and effective when there is not enough information about the DB. b) Partitioning Partition clustering: Using clustering method to find out the fuzzy set, this method takes into account the diversity of data distribution. c) Dynamically constrained partitioning: The fuzzy domain division helps us build MFs for fuzzy domains. Each MF usually has parameters to adjust the membership of values into the fuzzy domain. Optimizing the parameters of the MFs plays an important role in mining fuzzy AR. Therefore, some studies use evolutionary algorithms to increase its optimization. 3.2. Method of fuzzy partitioning by granular representation on HA In this section, the thesis presents the domain division method to determine quantitative attributes based on the approach of HA by single-granular and multi-granular representation of data. HA gives us a good mathematical structure built on the domain of identified attributes, helps us not only have a simple domain partition defined, but also allows us to attach the semantic of fuzzy subdomains to linguistic labels which it represents, always ensures the natural order of the linguistic labels. Moreover, according to the thesis, partition based on HA is always a strong one. With this approach, the AR mined will reflect better and more diverse knowledge hidden in the information explored, from highly generalized knowledge to highly separated knowledge, more detailed meet the needs of the manager. 3.2.1. Partitioning of attribute domain values using single-granular representation With some results related to the fuzzy domain of elements of HA, in section 1.2.4, it can be seen that there is a way to compute the degree of membership of any value in the numerical DB into the fuzzy sets used subdivision of fuzzy domain of section [25, 26]. Apparently, on the defined domain of item, it may be normalized to range [0,1], any value is between two quantification of semantics values of two consecutive fuzzy domains or coinciding with a price). The geographic property value of a fuzzy interval is due to the nature that creates the defined domain partition of the fuzzy spaces. Thus, the distance between the 𝑥𝑖𝑗 values and the two quantification of semantics values can be used to ccompute the attribute of 𝑥𝑖𝑗 into fuzzy sets represented by those fuzzy compartments (in case of overlapping with a quantification of semantics value, there is only 1 degree of membership): the smaller the distance is, the greater the degree of membership becomes, if they are identical, it can be considered as reaching 1. In Figure 3.1, quantification of semantics values are used to divide the defined domain of the attributes into fuzzy domains. Corresponding to each fuzzy domain, the construction of triangles represents MFs of the fuzzy set with 1 vertex whose coordinates are (𝜐(𝑥𝑖), 1), the other two vertices 14 are on the defined domain, with corresponding coordinates (𝜐(𝑥𝑖−1),0), (𝜐(𝑥𝑖+1), 0), where 𝜐(𝑥𝑖−1), 𝜐(𝑥𝑖), 𝜐(𝑥𝑖+1) are 3 consecutive quantification of semantics values Figure 0.1). Figure 0.1: Constructing partitioning of attribute domain values based on HA approach It can be seen that these two constructions are essentially equivalent. Indeed, suppose that we have a point E which is an arbitrary point on the axis representing the domain of the property 𝐼𝑖 . Then, in the first way, the distance 𝐸𝜈(𝑥2) and 𝐸𝜈(𝑥3) will be used to determine the degree of membership of E on the fuzzy set ZZs represented by the MFs - triangle 𝜈(𝑥1) 𝐵 𝜈(𝑥3) and 𝜈(𝑥2) 𝐶 𝜈(𝑥4), through normalization so that the membership is always in the range [0,1]. As for the second way, we have EG and EF are the dependence of E on these two fuzzy sets. Apparently, as EG is parallel to 𝜈(𝑥2) 𝐵 so 𝐸𝐺 𝜈(𝑥2)𝐵 = 𝐸 𝜈(𝑥3) 𝜈(𝑥2)𝜈(𝑥3) . Similarly, 𝐸𝐹 𝑣(𝑥3)𝐶 = 𝜈(𝑥2)𝐸 𝜈(𝑥2)𝜈(𝑥3) . Moreover, 𝜈(𝑥2) 𝐵 = 𝜈(𝑥3) 𝐶 = 1 so in the end we have 𝐸𝐹 𝐸𝐺 = 𝐸 𝜈(𝑥2) 𝐸 𝜈(𝑥3) . Thus, it is easy to deduce that the two ways of attaching these degrees of membership are equivalent. It also emphasizes how to attach the degree of membership by HA is proper in terms of perception. The way of constructing MFs or equivalents are fuzzy sets to divide a defined domain of attributes based on the HA approach and it has the following advantages: - Because the construction using HA is based on the semantics that humans feel, it can be emotionally seen that the built-in MFs reflect the semantics of the fuzzy set that it represents. - It is easy to see that the coverage of the MFs is good (always covering the defined domain). From this we see if we need to optimize the suitability level of MF, we only need to optimize the overlap level and the coverage of the MF. The problem of optimizing the parameters of HA according to overlap and usefulness can be solved by a GA algorithm. - The parameters to be managed during construction are few (each parameter is a triangle, the value of LBP), when changing the initial parameters of HA, it is easy to redefine the new MF and the MF remains the same. measurements overlap and cover the same. This method is simple and reasonable. 3.2.2. Partitioning of attribute domain values using multi- granular representation The fuzzy domain subdivision method based on HA approach using single granular representation has advantages as presented, there are still limitations related to the semantics of the data. According to HA, the MFs we create above are based on the partition in terms of the same length. That means the AR that we discover only include terms of the same length, which reduces the Figure 0.2: Partitioning of attribute domain values using single- granular representation 15 meaning of the discoverable rules. If we are not very concerned about the data semantics, we simply divide the domain in a very mechanically defined way (as most fuzzy set approaches do), then the proposed method is single granular representation using HA is presented in section 3.2.1 which is quite good. However, considering the semantics of data - which is extremely important to acquire good knowledge in association rule mining - we must take a deeper approach. It is possible to construct semantic fuzzy domains in order to create the partitions but this way is not very exact because the partitions created are not unique. In this chapter, the thesis chooses an approach based on data representation according to the multi granular structure. With this method, in order to improve the knowledge of AR, the combined rules will be more productive. Figure 0.3: Multi-level granular structure On the ideological side, using the multi- granular representation, as mentioned, gives us a more diverse view about input information. The construction, representation and use of a granular structure often adhere to the rules of multilevel and and multiview. The multilevel rule is the advantage brought about by the granular structure shown in the multi-granulars understanding and representation. Diversity rules are associated with the objective existence of data (information granulars) and the subjective view of the data- using researcher, whereby at each level of the granular structure, information can divide in different ways. With granular computation following the two rules mentioned above, we have a structured view of data, which is both systematic and simpler in solving data mining problems. In addition, it is very important in the research direction according to the HA's approach of the thesis, granular calculation and attached to it is the representation of multi-granulars data according to the above rules to satisfy the requirements of interpretation. The requirement is that the division of granulars needs to preserve the natural linguistic order (such as "young" <"old", then when dividing further, every part of the "young" language label, such as "fairly" young "must be smaller than every part of" middle-aged ", such as" quite old ", ie" quite young "<" quite old "and preserving the general-private relationship, the higher the general magnetic, the more fuzzy its price contain fuzzy set prices of more specific words; multi-granulars are structures that can satisfy both of these requirements.An important highlight is that with HA approach, the transition to multi-level and multi- level granular calculations will be a completely simple form that the thesis will demonstrate later. For fuzzy set theory (according to L.Zadeh), one of the limitations of the methods using multi-granular representations is that sometimes it is not easy to select the MFs because there is little basis for defining MFs at different levels, and it is not possible to construct a binding between them. Most of this determination is based on empirical experience. At the same time, conducting calculations with different levels of data will complicate the process, resulting in a much greater cost of time and memory. Figure 0.4: Partitioning the value domain of the attributes based on the multi-granular representation 16 In contrast, with HA, the design of fuzzy partition on the value domain of attributes at different levels of multi-granular representation is easy because it is in the way of building HA itself. In HA theory, for each value domain of the attributes, we only need to define the fuzzy parameter set of HA, we can determine the fuzzy distance of all the terms through the calculation formula to determine whether the rank how long the word is (ie, no matter how much the term is in the multi-granular representation system). Decentralization is one of the main ways that GrC uses in the construction of HA. According to HA, each term from x with the length k can be partitioned into elements ℎ𝑖𝑥 (where ℎ𝑖 is all the hedges of HA being considered) with the length k + 1. It can be said HA is a very suitable tool for ccomputing multi-granulars. Figure 3.4 is an example of 3 granulars formulated based on semantic quantitative values of HA. The granular of level 0 has 3 MFs, the granular level 1 has 4 MFs, and the particle level 2 has 6 MFs. 3.3. The optimal method of fuzzy parameters based HA for the mining AR Figure 0.5: Optimal partition search scheme for attribute domain identification and AR mining To find the optimal MF for fuzzy AR mining, the previous authors have used several criteria to evaluate the MFs for attributes. Specifically, the suitability of the set MF used to divid

Các file đính kèm theo tài liệu này:

research_on_developing_method_of_mining_fuzzy_association_ru.pdf