Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents

Table of Content

Introduction.1

Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis.3

1.1. Introduction.3

1.2. The Early Methods.5

1.2.1. Latent Semantic Analysis.5

1.2.2. Probabilistic Latent Semantic Analysis.8

1.3. Latent Dirichlet Allocation.11

1.3.1. Generative Model in LDA.12

1.3.2. Likelihood.13

1.3.3. Parameter Estimation and Inference via Gibbs Sampling.14

1.3.4. Applications.17

1.4. Summary.17

Chapter 2. Frameworks of Learning with Hidden Topics.19

2.1. Learning with ExternalResources: Related Works.19

2.2. General Learning Frameworks.20

2.2.1. Frameworks for Learning with Hidden Topics.20

2.2.2. Large-Scale Web Collections as Universal Dataset.22

2.3. Advantages of the Frameworks.23

2.4. Summary.23

Chapter 3. Topics Analysis of Large-Scale Web Dataset.24

3.1. Some Characteristics of Vietnamese.24

3.1.1. Sound.24

3.1.2. Syllable Structure.26

3.1.3. Vietnamese Word.26

3.2. Preprocessing and Transformation.27

3.2.1. Sentence Segmentation.27

3.2.2. Sentence Tokenization.28

3.2.3. Word Segmentation.28

3.2.4. Filters.28

3.2.5. Remove Non Topic-Oriented Words.28

3.3. Topic Analysis for VnExpress Dataset.29

3.4. Topic Analysis for Vietnamese Wikipedia Dataset.30

3.5. Discussion.31

3.6. Summary.32

Chapter 4. Deployments of General Frameworks.33

4.1. Classification with Hidden Topics.33

4.1.1. Classification Method.33

4.1.2. Experiments.36

4.2. Clustering with Hidden Topics.40

4.2.1. Clustering Method.40

4.2.2. Experiments.45

4.3. Summary.49

Conclusion.50

Achievements throughout the thesis.50

Future Works.50

References.52

Vietnamese References.52

English References.52

Appendix: Some Clustering Results.

67 trang | Chia sẻ: maiphuongdc | Lượt xem: 2088 | Lượt tải: 4

Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

external resources, these methods can be roughly classified into 2 categories: those make use of unlabeled data, and those exploit structured or semi-structured data. The first category is commonly referred under the name of semi-supervised learning. The key argument is that unlabeled examples are significantly easier to collect than labeled ones. One example of this is web-page classification. Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the Computer Science faculty pages, or all the course home pages at some university. To train such a system to automatically classify web pages, one would typically rely on hand labeled web pages. Unfortunately, these labeled examples are fairly expensive to obtain because they require human effort. In contrast, the web has hundreds of millions of unlabeled web pages that can be inexpensively gathered using a web crawler. Therefore, we would like the learning algorithms to be able to take as much advantage of the unlabeled data as possible. Semi-supervised learning has been received a lot of attentions in the last decade. Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether the word “plant” means a living organism or a factory in a given context. Rosenberg et. al (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector. In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Expectation Maximization (EM). Jones (2005) used co-training, co- EM and other related methods for information extraction from text. Besides, there were a lot of works that applied Transductive Support Vector Machines (TSVMs) to use unlabeled data for determining optimal decision boundary. The second category covers a lot of works exploiting resources like Wikipedia to support learning process. Gabrilovich et. al. (2007) [16] has demonstrated the value of using Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts. Banerjee et al (2007)[3] also extract titles of Wikipedia articles and use them as features for clustering short texts. Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application. 20 This chapter describes frameworks for learning with the support of topic model estimated from a large universal dataset. This topic model can be considered background knowledge for the domain of application. It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words, thus partially overcome the limitations of different word choices in text. 2.2. General Learning Frameworks This section presents general frameworks for learning with the support of hidden topics. The main motivation is how to gain benefits from huge sources of online data in order to enhance quality of the Text/Web clustering and classification. Unlike previous studies of learning with external resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models like LSA, pLSA, and LDA. The underlying idea of the frameworks is that for each learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics discovered from that data collection. 2.2.1. Frameworks for Learning with Hidden Topics Corresponding to two typical learning problems, i.e. classification and clustering, we describe two frameworks with some differences in the architectures. a. Framework for Classification Figure 2.1. Classification with Hidden Topics Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate. As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions. 21 Several machine-learning methods have been applied to text classification including decision trees, neural networks, support vector machines, etc. In the typical applications of machine-learning methods, the training data is passed to a learning phrase. The result of the learning step is an appropriate classifier capable of categorizing new documents. However, in the cases such as the training data is not as much as expected or the data to be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier. Inspired by this fact, we propose a framework that enables us to enrich both training and new coming data with hidden topics from available large dataset so as to enhance the performance of text classification. Classification with hidden topics is described in Figure 2.1. We first collect a very large external data collection called “universal dataset”. Next, a topic analysis technique such as pLSA, LDA, etc. is applied to the dataset. The result of this step is an estimated topic model which consists of hidden topics and the probability distributions of words over these topics. Upon this model, we can do topic inference for training dataset and new data. For each document, the output of topic inference is a probability distribution of hidden topics – the topics analyzed in the estimation phrase – given the document. The topic distributions of training dataset are then combined with training dataset itself for learning classifier. In the similar way, new documents, which need to be classified, are combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier. b. Framework for Clustering Figure 2.2. Clustering with Hidden Topics Text clustering is to automatically generate groups (clusters) of documents based on the similarity or distance among documents. Unlike Classification, the clusters are not known 22 previously. User can optionally give the requirement about the number of clusters. The documents will later be organized in clusters, each of which contains “close” documents. Clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative (“bottom- up”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones. Divisive algorithm begin with the while set and divide it into smaller ones. Distance measure, which determines how similarity of two documents is calculated, is a key to the success of any text clustering algorithms. Some documents may be close to one another according to one distance and further away according to another. Common distance functions are the Euclidean distance, the Manhattan distance (also called taxicab norm or 1-norm) and the maximum norm, just to name here but a few. Web clustering, which is a type of text clustering specific for web pages, can be offline or online. Offline clustering is to cluster the whole storage of available web documents and does not have the constraint of response time. In online clustering, the algorithms need to meet the “real-time condition”, i.e. the system need to perform clustering as fast as possible. For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time- consuming. The question here is how to enhance the quality of clustering for such document snippets in online web clustering. Inspired by the fact those snippets are only small pieces of text (and thus poor in content) we propose the framework to enrich them with hidden topics for clustering (Figure 2.2). This framework and topic analysis is similar to one for classification. The difference here is only due to the essential differences between classification and clustering. 2.2.2. Large-Scale Web Collections as Universal Dataset Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them – the phrase of analyzing topics for previously collected dataset. Here are some important considerations for this phrase: - The degree of coverage of the dataset: the universal dataset should be large enough to cover topics underlined in the domain of application. - Preprocessing: this step is very important to get good analysis results. Although there is no general instruction for all languages, the common advice is to remove as much as possible noise words such as functional words, stop words or too frequent/ rare words. 23 - Methods for topic analysis: Some analyzing methods which can be applied have been mentioned in the Chapter 1. The tradeoff between the quality of topic analysis and time complexity should be taken into account. For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time” condition. 2.3. Advantages of the Frameworks - The general frameworks are flexible and general enough to apply in any domain/language. Once we have trained a universal dataset, its hidden topics could be useful for several learning tasks in the same domain. - This is particularly useful for sparse data mining. Spare data like snippets returned from a search engine could be enriched with hidden topics. Thus, enhanced performance can be achieved. - Due to learning with smaller data, the presented methods require less computational resources than semi-supervised learning. - Thank to the nice generative model for analyzing topics for new documents (in the case of LDA), we have a natural way to map documents from term space into topic space. This is really an advantage over heuristic-based mapping in the previous approaches [16][3][10]. 2.4. Summary This chapter described two general frameworks and their advantages for learning with hidden topics: one for classification and one for clustering. The main advantages of our frameworks are that they are flexible and general to apply in any domain/language and be able to deal with sparse data. The key common phrase between the two frameworks is topic analysis for large-scale web collection called “universal dataset”. The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases. 24 Chapter 3. Topics Analysis of Large-Scale Web Dataset As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods. Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese. 3.1. Some Characteristics of Vietnamese Vietnamese is the national and official language of Vietnam [48]. It is the mother tongue of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three million overseas Vietnamese. It is also spoken as a second language by some ethnic minorities of Vietnam. Many words in Vietnamese are borrowed from Chinese. Originally, it is written in Chinese-like writing system. The current writing system of Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and certain letters. 3.1.1. Sound a. Vowels Like other Southeast Asian languages, Vietnamese has a comparatively large number of vowels. Below is a vowel chart of vowels in Vietnamese: Table 3.1. Vowels in Vietnamese The correspondence between the orthography and pronunciation is rather complicated. For example, the vowel i is often written as y; both may represent [i], in which case the difference is in the quality of the preceding vowel. For instance, “tai” (ear) is [tāi] while tay (hand/arm) is [tāj]. In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi). Three diphthongs consist of a vowel plus a. These are: “ia”, “ua”, “ưa” (When followed by a consonant, they become “iê”, “uô”, and “ươ”, respectively). The other diphthongs 25 consist of a vowel plus semivowel. There are two of these semivowels: y (written i or y). A majority of diphthongs in Vietnamese are formed this way. Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”, “ưa”) resulting in tripthongs. b. Tones Vietnamese vowels are all pronounced with an inherent tone. Tones differ in pitch, length, contour melody, intensity, and glottal (with or without accompanying constricted vocal cords) Tone is indicated by diacritics written above or below the vowel (most of the tone diacritics appear above the vowel; however, the “nặng” tone dot diacritic goes below the vowel). The six tones in Vietnamese are: Table 3.2. Tones in Vietnamese c. Consonants The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for the bilabial approximant which is written as “w” (in the writing system it is written the same as the vowels “o” and “u” Some consonant sounds are written with only one letter (like “p”), other consonant sounds are written with a two-letter digraph (like “ph”), and others are written with more than one letter or digraph (the velar stop is written variously as “c”, “k”, or “q”). 26 Table 3.3. Consonants of hanoi variety 3.1.2. Syllable Structure Syllables are elementary units that have one way of pronunciation. In documents, they are usually delimited by white-space. In spite of being the elementary units, Vietnamese syllables are not undividable elements but a structure. Table 3.4 depicts the general structure of Vietnamese syllable: Table 3.4. Structure of Vietnamese syllables TONE MARK Rhyme First Consonant Secondary Consonant Main Vowel Last Consonant Generally, each Vietnamese syllable has all five parts: first consonant, secondary vowel, main vowel, last consonant and a tone mark. For instance, the syllable “tuần” (week) has a tone mark (grave accent), a first consonant (t), a secondary vowel (u), a main vowel (â) and a last consonant (n). However, except for main vowel that is required for all syllables, the other parts may be not present in some cases. For example, the syllable “anh” (brother) has no tone mar, no secondary vowel and no first consonant. Another example is the syllable “hoa” (flower) has a secondary vowel (o) but no last consonant. 3.1.3. Vietnamese Word Vietnamese is often erroneously considered to be a "monosyllabic" language. It is true that Vietnamese has many words that consist of only one syllable; however, most words indeed contain more than one syllable. Based on the way of constructing words from syllables, we can classify them into three classes: single words, complex words and reduplicative words. Each single word has only 27 one syllable that implies specific meaning. For example: “tôi” (I), “bạn” (you), “nhà” (house), etc. Words that involve more than one syllable are called “complex word”. The syllables in complex words are combined based on semantic relationships which are either coordinated (“bơi lội” – swim) or “principle and accessory” (“đường sắt” –railway). A word is considered as a reduplicative word if its syllables have phonic components (Table 3.4) reduplicated, for instance: “đùng đùng” (full-reduplicative), “lung linh” (first consonant reduplicated), etc. This type of words is usually used for scene or sound descriptions particularly in the literary. 3.2. Preprocessing and Transformation Data preprocessing and Transformation are necessary steps for any data mining process in general and for hidden topics mining in particular. After these steps, data is clean, complete, reduced, partially free of noises, and ready to be mined. The main steps for our preprocessing and transformation are described in the subsequent sections and shown in the following chart: Figure 3.1. Pipeline of Data Preprocessing and Transformation 3.2.1. Sentence Segmentation Sentence segmentation is to determine whether a ‘sentence delimiter’ is really a sentence boundary. Like English, sentence delimiters in Vietnamese are full-stop, the exclamation mark and the question mark (.!?). The exclamation mark and the question mark do not really pose the problems. The critical element is again the period: (1) the period can be a sentence-ending character (full stop); (2) the period can denote an abbreviation; (3) the period can used in some expressions like URL, Email, numbers, etc.; (4) in some cases, a period can assume both (1) and (2) functions. Given an input string, the result of this detector are sentences, each of which is in one line. Then, this output is shifted to the sentence tokenization step. 28 3.2.2. Sentence Tokenization Sentence tokenization is the process of detaching marks from words in a sentence. For example, we would like to detach “,” from its previous word. 3.2.3. Word Segmentation As mentioned in Section 3.1. , Vietnamese words are not always determined by white- spaces due to the fact that each word can contain more than one syllable. This gives birth to the task of word segmentation, i.e. segment a sentence into a sequence of words. Vietnamese word segmentation is a perquisite for any further processing and text mining. Though being quite basic, it is not a trivial task because of the following ambiguities: - Overlapping ambiguity: String αβγ were called overlapping ambiguity when both αβ and βγ are valid Vietnamese word. For example: “học sinh học sinh học” (Student studies biology) Î “học sinh” (student) and “sinh học” (biology) are found in Vietnamese dictionary. - Combination ambiguity: String αβγ were called combination ambiguity when ( ),, αββα are possible choices. For instance: “bàn là một dụng cụ” (Table is a tool) Î “bàn” (Table), “bàn là” (iron), “là” (is) are found in Vietnamese dictionary. In this work, we used Conditional Random Fields approach to segment Vietnamese words[31] . The outputs of this step are sequences of syllables joined to form words. 3.2.4. Filters After word segmentation, tokens now are separated by white-space. Filters remove trivial tokens for analyzing process, i.e. tokens for number, date/time, too-short tokens (length is less than 2 characters). Too short sentences, English sentences, or Vietnamese sentences without tones (The Vietnamese sometimes write Vietnamese text without tone) also should be filtered or manipulated in this phrase. 3.2.5. Remove Non Topic-Oriented Words Non Topic-Oriented Words are those we consider to be trivial for topic analyzing process. These words can cause much noise and negative effects for our analysis. Here, we treat functional words, too rare or too common words as non topic-oriented words. See the following table for more details about functional words in Vietnamese: 29 Table 3.5. Functional words in Vietnamese Part of Speech (POS) Examples Classifier Noun cái, chiếc, con, bài, câu, cây, tờ, lá, việc Major/Minor conjunction Bởi chưng, bởi vậy, chẳng những, … Combination Conjunction Cho, cho nên, cơ mà, cùng, dẫu, dù, và Introductory word Gì, hẳn, hết, … Numeral Nhiều, vô số, một, một số, … Pronoun Anh ấy, cô ấy, … Adjunct Sẽ, sắp sửa, suýt, … 3.3. Topic Analysis for VnExpress Dataset We collect a large dataset from VnExpress [47] using Nutch [36] and then do preprocessing and transformation. The statistics of the topics assigned by humans and other parameters of the dataset are shown in the tables below: Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset Society: Education, Entrance Exams, Life of Youths … International: Analysis, Files, Lifestyle … Business: Business man, Stock, Integration … Culture: Music, Fashion, Stage – Cinema … Sport: Football, Tennis Life: Family, Health … Science: New Techniques, Natural Life, Psychology And Others … Note that information about topics assigned by humans is just listed here for reference and not used in the topic analysis process. After data preprocessing and transformation, we get 53M data (40,268 documents, 257,533 words; vocabulary size of 128,768). This data is put into GibbLDA++ [38] – a tool for Latent Dirichlet Allocation using Gibb Sampling (see Section 1.3. ). The results of topic analysis with K = 100 topics are shown in 30 Table 3.5. Table 3.7. Statistics of VnExpress dataset After removing html, doing sentence and word segmentation: size ≈219M, number of docs = 40,328 After filtering and removing non-topic oriented words: size ≈53M, number of docs = 40,268 number of words = 5,512,251; vocabulary size = 128,768 Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics. Topic 1 Topic 3 Topic 7 Tòa (Court) 0.0192 Điều tra (Investigate) 0.0180 Luật sư (Lawyer) 0.0162 Tội (Crime) 0.0142 Tòa án (court) 0.0108 Kiện (Lawsuits) 0.0092 Buộc tội (Accuse) 0.0076 Xét xử (Judge) 0.0076 Bị cáo (Accused) 0.0065 Phán quyết (Sentence) 0.0060 Bằng chứng (Evidence) 0.0046 Thẩm phán (Judge) 0.0050 Trường (School) 0.0660 Lớp (Class) 0.0562 Học sinh (Pupil) 0.0471 Giáo dục (Education) 0.0192 Dạy (Teach) 0.0183 Giáo viên (Teacher) 0.0179 Môn (Subject) 0.0080 Tiểu học (Primary school)0.0070 Hiệu trưởng (Rector) 0.0067 Trung học (High school) 0.0064 Tốt nghiệp (Graduation) 0.0063 Năm học (Academic year)0.0062 Game 0.0869 Trò chơi (Game) 0.0386 Người chơi (Gamer) 0.0211 Nhân vật (Characters) 0.0118 Online 0.0082 Giải trí (Entertainment) 0.0074 Trực tuyến (Online) 0.0063 Phát hành (Release) 0.0055 Điều khiển (Control) 0.0052 Nhiệm vụ (Mission) 0.0041 Chiến đấu (Fight) 0.0038 Phiên bản (Version) 0.0038 Topic 9 Topic 14 Topic 15 Du lịch (Tourism) 0.0542 Khách (Passengers) 0.0314 Khách sạn (Hotel) 0.0276 Du khách (Tourists) 0.0239 Tour 0.0117 Tham quan (Visit) 0.0097 Biển (Sea) 0.0075 Chuyến đi (Journey) 0.0050 Giải trí (Entertainment) 0.0044 Khám phá (Discovery) 0.0044 Lữ hành (Travel) 0.0039 Điểm đến (Destination) 0.0034 Thời trang (Fashion) 0.0482 Người mẫu (Models) 0.0407 Mặc (Wear) 0.0326 Mẫu (Sample) 0.0305 Trang phục (Clothing) 0.0254 Đẹp (Nice) 0.0249 Thiết kế (Design) 0.0229 Sưu tập (Collection) 0.0108 Váy (Skirt) 0.0105 Quần áo (Clothes) 0.0092 Phong cách (Styles) 0.0089 Trình diễn (Perform) 0.0051 Bóng đá (Football) 0.0285 Đội (Team) 0.0273 Cầu thủ (Football Players)0.0241 HLV (Coach) 0.0201 Thi đấu (Compete) 0.0197 Thể thao (Sports) 0.0176 Đội tuyển (Team) 0.0139 CLB (Club) 0.0138 Vô địch (Championship) 0.0089 Mùa (Season) 0.0063 Liên đoàn (Federal) 0.0056 Tập huấn (Training) 0.0042 3.4. Topic Analysis for Vietnamese Wikipedia Dataset The second dataset is collected from Vietnamese Wikipedia, and contains D=29,043 documents. We preprocessed this dataset in the same way described in the Section 3.2. This led to a vocabulary size of V = 63,150, and a total of 4,784,036 word tokens. In the 31 hidden topic mining phrase, the number of topics K was fixed at 200. The hyper- parameters α and β were set at 0.25 and 0.1 respectively. Table 3.9. Statistic of Vietnamese Wikipedia Dataset After removing html, doing sentence and word segmentation: size ≈270M, number of docs = 29,043 After filtering and removing non-topic oriented words: size ≈48M, number of docs = 17,428 number of words = 4,784,036; vocabulary size = 63,150 Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics Topic 2 Topic 5 Topic 6 Tàu (Ship) 0.0527 Hải quân (Navy) 0.0437 Hạm đội (Fleet) 0.0201 Thuyền (Ship) 0.0100 Đô đốc (Admiral) 0.0097 Tàu chiến (Warship) 0.0092 Cảng (Harbour) 0.0086 Tấn công (Attack) 0.0081 Lục chiến (Marine) 0.0075 Thủy quân (Seaman) 0.0067 Căn cứ (Army Base) 0.0066 Chiến hạm (Gunboat) 0.0058 Độc lập (Independence) 0.0095 Lãnh đạo (Lead) 0.0088 Tổng thống (President) 0.0084 Đất nước (Country) 0.0070 Quyền lực (Power) 0.0069 Dân chủ (Democratic) 0.0068 Chính quyền (Government)0.0067 Ủng hộ (Support) 0.0065 Chế độ (System) 0.0063 Kiểm soát (Control) 0.0058 Lãnh thổ (T

Các file đính kèm theo tài liệu này:

MSc08_Nguyen_Cam_Tu_Thesis_English.pdf