Table of Content
Introduction.1
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis.3
1.1. Introduction.3
1.2. The Early Methods.5
1.2.1. Latent Semantic Analysis.5
1.2.2. Probabilistic Latent Semantic Analysis.8
1.3. Latent Dirichlet Allocation.11
1.3.1. Generative Model in LDA.12
1.3.2. Likelihood.13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling.14
1.3.4. Applications.17
1.4. Summary.17
Chapter 2. Frameworks of Learning with Hidden Topics.19
2.1. Learning with ExternalResources: Related Works.19
2.2. General Learning Frameworks.20
2.2.1. Frameworks for Learning with Hidden Topics.20
2.2.2. Large-Scale Web Collections as Universal Dataset.22
2.3. Advantages of the Frameworks.23
2.4. Summary.23
Chapter 3. Topics Analysis of Large-Scale Web Dataset.24
3.1. Some Characteristics of Vietnamese.24
3.1.1. Sound.24
3.1.2. Syllable Structure.26
3.1.3. Vietnamese Word.26
3.2. Preprocessing and Transformation.27
3.2.1. Sentence Segmentation.27
3.2.2. Sentence Tokenization.28
3.2.3. Word Segmentation.28
3.2.4. Filters.28
3.2.5. Remove Non Topic-Oriented Words.28
3.3. Topic Analysis for VnExpress Dataset.29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset.30
3.5. Discussion.31
3.6. Summary.32
Chapter 4. Deployments of General Frameworks.33
4.1. Classification with Hidden Topics.33
4.1.1. Classification Method.33
4.1.2. Experiments.36
4.2. Clustering with Hidden Topics.40
4.2.1. Clustering Method.40
4.2.2. Experiments.45
4.3. Summary.49
Conclusion.50
Achievements throughout the thesis.50
Future Works.50
References.52
Vietnamese References.52
English References.52
Appendix: Some Clustering Results.
67 trang |
Chia sẻ: maiphuongdc | Lượt xem: 1970 | Lượt tải: 4
Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
external resources, these methods can be
roughly classified into 2 categories: those make use of unlabeled data, and those exploit
structured or semi-structured data.
The first category is commonly referred under the name of semi-supervised learning. The
key argument is that unlabeled examples are significantly easier to collect than labeled
ones. One example of this is web-page classification. Suppose that we want a program to
electronically visit some web site and download all the web pages of interest to us, such
as all the Computer Science faculty pages, or all the course home pages at some
university. To train such a system to automatically classify web pages, one would
typically rely on hand labeled web pages. Unfortunately, these labeled examples are fairly
expensive to obtain because they require human effort. In contrast, the web has hundreds
of millions of unlabeled web pages that can be inexpensively gathered using a web
crawler. Therefore, we would like the learning algorithms to be able to take as much
advantage of the unlabeled data as possible.
Semi-supervised learning has been received a lot of attentions in the last decade.
Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether
the word “plant” means a living organism or a factory in a given context. Rosenberg et. al
(2005) apply it to object detection systems from images, and show the semi-supervised
technique compares favorably with a state-of-the-art detector. In 2000, Nigam and Ghani
[30] perform extensive empirical experiments to compare co-training with generative
mixture models and Expectation Maximization (EM). Jones (2005) used co-training, co-
EM and other related methods for information extraction from text. Besides, there were a
lot of works that applied Transductive Support Vector Machines (TSVMs) to use
unlabeled data for determining optimal decision boundary.
The second category covers a lot of works exploiting resources like Wikipedia to support
learning process. Gabrilovich et. al. (2007) [16] has demonstrated the value of using
Wikipedia as an additional source of features for text classification and determining the
semantic relatedness between texts. Banerjee et al (2007)[3] also extract titles of
Wikipedia articles and use them as features for clustering short texts. Unfortunately, this
approach is not very flexible in the sense that it depends much on the external resource or
the application.
20
This chapter describes frameworks for learning with the support of topic model estimated
from a large universal dataset. This topic model can be considered background knowledge
for the domain of application. It also helps the learning process to capture hidden topics
(of the domain), the relationships between topics and words as well as words and words,
thus partially overcome the limitations of different word choices in text.
2.2. General Learning Frameworks
This section presents general frameworks for learning with the support of hidden topics.
The main motivation is how to gain benefits from huge sources of online data in order to
enhance quality of the Text/Web clustering and classification. Unlike previous studies of
learning with external resources, we approach this issue from the point of view of
text/Web data analysis that is based on recently successful latent topic analysis models
like LSA, pLSA, and LDA. The underlying idea of the frameworks is that for each
learning task, we collect a very large external data collection called “universal dataset”,
and then build a learner on both the learning data and a rich set of hidden topics
discovered from that data collection.
2.2.1. Frameworks for Learning with Hidden Topics
Corresponding to two typical learning problems, i.e. classification and clustering, we
describe two frameworks with some differences in the architectures.
a. Framework for Classification
Figure 2.1. Classification with Hidden Topics
Nowadays, the continuous development of Internet has created a huge amount of
documents which are difficult to manage, organize and navigate. As a result, the task of
automatic classification, which is to categorize textual documents into two or more
predefined classes, has been received a lot of attentions.
21
Several machine-learning methods have been applied to text classification including
decision trees, neural networks, support vector machines, etc. In the typical applications
of machine-learning methods, the training data is passed to a learning phrase. The result
of the learning step is an appropriate classifier capable of categorizing new documents.
However, in the cases such as the training data is not as much as expected or the data to
be classified is too rare [52], learning with only training data can not provide us a
satisfactory classifier. Inspired by this fact, we propose a framework that enables us to
enrich both training and new coming data with hidden topics from available large dataset
so as to enhance the performance of text classification.
Classification with hidden topics is described in Figure 2.1. We first collect a very large
external data collection called “universal dataset”. Next, a topic analysis technique such
as pLSA, LDA, etc. is applied to the dataset. The result of this step is an estimated topic
model which consists of hidden topics and the probability distributions of words over
these topics. Upon this model, we can do topic inference for training dataset and new
data. For each document, the output of topic inference is a probability distribution of
hidden topics – the topics analyzed in the estimation phrase – given the document. The
topic distributions of training dataset are then combined with training dataset itself for
learning classifier. In the similar way, new documents, which need to be classified, are
combined with their topic distributions to create the so called “new data with hidden
topics” before passing to the learned classifier.
b. Framework for Clustering
Figure 2.2. Clustering with Hidden Topics
Text clustering is to automatically generate groups (clusters) of documents based on the
similarity or distance among documents. Unlike Classification, the clusters are not known
22
previously. User can optionally give the requirement about the number of clusters. The
documents will later be organized in clusters, each of which contains “close” documents.
Clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find
successive clusters using previously established clusters, whereas partitional algorithms
determine all clusters at once. Hierarchical algorithms can be agglomerative (“bottom-
up”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a
separate cluster and merge them into larger ones. Divisive algorithm begin with the while
set and divide it into smaller ones.
Distance measure, which determines how similarity of two documents is calculated, is a
key to the success of any text clustering algorithms. Some documents may be close to one
another according to one distance and further away according to another. Common
distance functions are the Euclidean distance, the Manhattan distance (also called taxicab
norm or 1-norm) and the maximum norm, just to name here but a few.
Web clustering, which is a type of text clustering specific for web pages, can be offline or
online. Offline clustering is to cluster the whole storage of available web documents and
does not have the constraint of response time. In online clustering, the algorithms need to
meet the “real-time condition”, i.e. the system need to perform clustering as fast as
possible. For example, the algorithm should take the document snippets instead of the
whole documents as input since the downloading of original documents is time-
consuming. The question here is how to enhance the quality of clustering for such
document snippets in online web clustering. Inspired by the fact those snippets are only
small pieces of text (and thus poor in content) we propose the framework to enrich them
with hidden topics for clustering (Figure 2.2). This framework and topic analysis is
similar to one for classification. The difference here is only due to the essential
differences between classification and clustering.
2.2.2. Large-Scale Web Collections as Universal Dataset
Despite of the obvious differences between two learning frameworks, there is a key
phrase sharing between them – the phrase of analyzing topics for previously collected
dataset. Here are some important considerations for this phrase:
- The degree of coverage of the dataset: the universal dataset should be large enough
to cover topics underlined in the domain of application.
- Preprocessing: this step is very important to get good analysis results. Although
there is no general instruction for all languages, the common advice is to remove
as much as possible noise words such as functional words, stop words or too
frequent/ rare words.
23
- Methods for topic analysis: Some analyzing methods which can be applied have
been mentioned in the Chapter 1. The tradeoff between the quality of topic analysis
and time complexity should be taken into account. For example, topic analysis for
snippets in online clustering should be as short as possible to meet the “real-time”
condition.
2.3. Advantages of the Frameworks
- The general frameworks are flexible and general enough to apply in any
domain/language. Once we have trained a universal dataset, its hidden topics could
be useful for several learning tasks in the same domain.
- This is particularly useful for sparse data mining. Spare data like snippets returned
from a search engine could be enriched with hidden topics. Thus, enhanced
performance can be achieved.
- Due to learning with smaller data, the presented methods require less
computational resources than semi-supervised learning.
- Thank to the nice generative model for analyzing topics for new documents (in the
case of LDA), we have a natural way to map documents from term space into topic
space. This is really an advantage over heuristic-based mapping in the previous
approaches [16][3][10].
2.4. Summary
This chapter described two general frameworks and their advantages for learning with
hidden topics: one for classification and one for clustering. The main advantages of our
frameworks are that they are flexible and general to apply in any domain/language and be
able to deal with sparse data. The key common phrase between the two frameworks is
topic analysis for large-scale web collection called “universal dataset”. The quality of the
topic model estimation for this data will influence much the performance of learning in
the later phrases.
24
Chapter 3. Topics Analysis of Large-Scale Web Dataset
As mentioned earlier, topic analysis for a universal dataset is a key to the success of our
proposed methods. Thus, toward Vietnamese text mining, this chapter contributes to
considerations for the problem of topics analysis for large-scale web datasets in
Vietnamese.
3.1. Some Characteristics of Vietnamese
Vietnamese is the national and official language of Vietnam [48]. It is the mother tongue
of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three
million overseas Vietnamese. It is also spoken as a second language by some ethnic
minorities of Vietnam. Many words in Vietnamese are borrowed from Chinese.
Originally, it is written in Chinese-like writing system. The current writing system of
Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and
certain letters.
3.1.1. Sound
a. Vowels
Like other Southeast Asian languages, Vietnamese has a comparatively large number of
vowels. Below is a vowel chart of vowels in Vietnamese:
Table 3.1. Vowels in Vietnamese
The correspondence between the orthography and pronunciation is rather complicated.
For example, the vowel i is often written as y; both may represent [i], in which case the
difference is in the quality of the preceding vowel. For instance, “tai” (ear) is [tāi] while
tay (hand/arm) is [tāj].
In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi).
Three diphthongs consist of a vowel plus a. These are: “ia”, “ua”, “ưa” (When followed
by a consonant, they become “iê”, “uô”, and “ươ”, respectively). The other diphthongs
25
consist of a vowel plus semivowel. There are two of these semivowels: y (written i or y).
A majority of diphthongs in Vietnamese are formed this way.
Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”,
“ưa”) resulting in tripthongs.
b. Tones
Vietnamese vowels are all pronounced with an inherent tone. Tones differ in pitch,
length, contour melody, intensity, and glottal (with or without accompanying constricted
vocal cords)
Tone is indicated by diacritics written above or below the vowel (most of the tone
diacritics appear above the vowel; however, the “nặng” tone dot diacritic goes below the
vowel). The six tones in Vietnamese are:
Table 3.2. Tones in Vietnamese
c. Consonants
The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for
the bilabial approximant which is written as “w” (in the writing system it is written the
same as the vowels “o” and “u”
Some consonant sounds are written with only one letter (like “p”), other consonant sounds
are written with a two-letter digraph (like “ph”), and others are written with more than
one letter or digraph (the velar stop is written variously as “c”, “k”, or “q”).
26
Table 3.3. Consonants of hanoi variety
3.1.2. Syllable Structure
Syllables are elementary units that have one way of pronunciation. In documents, they are
usually delimited by white-space. In spite of being the elementary units, Vietnamese
syllables are not undividable elements but a structure. Table 3.4 depicts the general
structure of Vietnamese syllable:
Table 3.4. Structure of Vietnamese syllables
TONE MARK
Rhyme
First
Consonant Secondary
Consonant
Main
Vowel
Last
Consonant
Generally, each Vietnamese syllable has all five parts: first consonant, secondary vowel,
main vowel, last consonant and a tone mark. For instance, the syllable “tuần” (week) has
a tone mark (grave accent), a first consonant (t), a secondary vowel (u), a main vowel (â)
and a last consonant (n). However, except for main vowel that is required for all syllables,
the other parts may be not present in some cases. For example, the syllable “anh”
(brother) has no tone mar, no secondary vowel and no first consonant. Another example is
the syllable “hoa” (flower) has a secondary vowel (o) but no last consonant.
3.1.3. Vietnamese Word
Vietnamese is often erroneously considered to be a "monosyllabic" language. It is true
that Vietnamese has many words that consist of only one syllable; however, most words
indeed contain more than one syllable.
Based on the way of constructing words from syllables, we can classify them into three
classes: single words, complex words and reduplicative words. Each single word has only
27
one syllable that implies specific meaning. For example: “tôi” (I), “bạn” (you), “nhà”
(house), etc. Words that involve more than one syllable are called “complex word”. The
syllables in complex words are combined based on semantic relationships which are
either coordinated (“bơi lội” – swim) or “principle and accessory” (“đường sắt” –railway).
A word is considered as a reduplicative word if its syllables have phonic components
(Table 3.4) reduplicated, for instance: “đùng đùng” (full-reduplicative), “lung linh” (first
consonant reduplicated), etc. This type of words is usually used for scene or sound
descriptions particularly in the literary.
3.2. Preprocessing and Transformation
Data preprocessing and Transformation are necessary steps for any data mining process in
general and for hidden topics mining in particular. After these steps, data is clean,
complete, reduced, partially free of noises, and ready to be mined. The main steps for our
preprocessing and transformation are described in the subsequent sections and shown in
the following chart:
Figure 3.1. Pipeline of Data Preprocessing and Transformation
3.2.1. Sentence Segmentation
Sentence segmentation is to determine whether a ‘sentence delimiter’ is really a sentence
boundary. Like English, sentence delimiters in Vietnamese are full-stop, the exclamation
mark and the question mark (.!?). The exclamation mark and the question mark do not
really pose the problems. The critical element is again the period: (1) the period can be a
sentence-ending character (full stop); (2) the period can denote an abbreviation; (3) the
period can used in some expressions like URL, Email, numbers, etc.; (4) in some cases, a
period can assume both (1) and (2) functions.
Given an input string, the result of this detector are sentences, each of which is in one
line. Then, this output is shifted to the sentence tokenization step.
28
3.2.2. Sentence Tokenization
Sentence tokenization is the process of detaching marks from words in a sentence. For
example, we would like to detach “,” from its previous word.
3.2.3. Word Segmentation
As mentioned in Section 3.1. , Vietnamese words are not always determined by white-
spaces due to the fact that each word can contain more than one syllable. This gives birth
to the task of word segmentation, i.e. segment a sentence into a sequence of words.
Vietnamese word segmentation is a perquisite for any further processing and text mining.
Though being quite basic, it is not a trivial task because of the following ambiguities:
- Overlapping ambiguity: String αβγ were called overlapping ambiguity when both
αβ and βγ are valid Vietnamese word. For example: “học sinh học sinh học”
(Student studies biology) Î “học sinh” (student) and “sinh học” (biology) are
found in Vietnamese dictionary.
- Combination ambiguity: String αβγ were called combination ambiguity when
( ),, αββα are possible choices. For instance: “bàn là một dụng cụ” (Table is a tool)
Î “bàn” (Table), “bàn là” (iron), “là” (is) are found in Vietnamese dictionary.
In this work, we used Conditional Random Fields approach to segment Vietnamese
words[31] . The outputs of this step are sequences of syllables joined to form words.
3.2.4. Filters
After word segmentation, tokens now are separated by white-space. Filters remove trivial
tokens for analyzing process, i.e. tokens for number, date/time, too-short tokens (length is
less than 2 characters). Too short sentences, English sentences, or Vietnamese sentences
without tones (The Vietnamese sometimes write Vietnamese text without tone) also
should be filtered or manipulated in this phrase.
3.2.5. Remove Non Topic-Oriented Words
Non Topic-Oriented Words are those we consider to be trivial for topic analyzing process.
These words can cause much noise and negative effects for our analysis. Here, we treat
functional words, too rare or too common words as non topic-oriented words. See the
following table for more details about functional words in Vietnamese:
29
Table 3.5. Functional words in Vietnamese
Part of Speech (POS) Examples
Classifier Noun cái, chiếc, con, bài, câu, cây, tờ, lá, việc
Major/Minor conjunction Bởi chưng, bởi vậy, chẳng những, …
Combination Conjunction Cho, cho nên, cơ mà, cùng, dẫu, dù, và
Introductory word Gì, hẳn, hết, …
Numeral Nhiều, vô số, một, một số, …
Pronoun Anh ấy, cô ấy, …
Adjunct Sẽ, sắp sửa, suýt, …
3.3. Topic Analysis for VnExpress Dataset
We collect a large dataset from VnExpress [47] using Nutch [36] and then do
preprocessing and transformation. The statistics of the topics assigned by humans and
other parameters of the dataset are shown in the tables below:
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset
Society: Education, Entrance Exams, Life of Youths …
International: Analysis, Files, Lifestyle …
Business: Business man, Stock, Integration …
Culture: Music, Fashion, Stage – Cinema …
Sport: Football, Tennis
Life: Family, Health …
Science: New Techniques, Natural Life, Psychology
And Others …
Note that information about topics assigned by humans is just listed here for reference and
not used in the topic analysis process. After data preprocessing and transformation, we get
53M data (40,268 documents, 257,533 words; vocabulary size of 128,768). This data is
put into GibbLDA++ [38] – a tool for Latent Dirichlet Allocation using Gibb Sampling
(see Section 1.3. ). The results of topic analysis with K = 100 topics are shown in
30
Table 3.5.
Table 3.7. Statistics of VnExpress dataset
After removing html, doing sentence and word segmentation:
size ≈219M, number of docs = 40,328
After filtering and removing non-topic oriented words:
size ≈53M, number of docs = 40,268
number of words = 5,512,251; vocabulary size = 128,768
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics.
Topic 1 Topic 3 Topic 7
Tòa (Court) 0.0192
Điều tra (Investigate) 0.0180
Luật sư (Lawyer) 0.0162
Tội (Crime) 0.0142
Tòa án (court) 0.0108
Kiện (Lawsuits) 0.0092
Buộc tội (Accuse) 0.0076
Xét xử (Judge) 0.0076
Bị cáo (Accused) 0.0065
Phán quyết (Sentence) 0.0060
Bằng chứng (Evidence) 0.0046
Thẩm phán (Judge) 0.0050
Trường (School) 0.0660
Lớp (Class) 0.0562
Học sinh (Pupil) 0.0471
Giáo dục (Education) 0.0192
Dạy (Teach) 0.0183
Giáo viên (Teacher) 0.0179
Môn (Subject) 0.0080
Tiểu học (Primary school)0.0070
Hiệu trưởng (Rector) 0.0067
Trung học (High school) 0.0064
Tốt nghiệp (Graduation) 0.0063
Năm học (Academic year)0.0062
Game 0.0869
Trò chơi (Game) 0.0386
Người chơi (Gamer) 0.0211
Nhân vật (Characters) 0.0118
Online 0.0082
Giải trí (Entertainment) 0.0074
Trực tuyến (Online) 0.0063
Phát hành (Release) 0.0055
Điều khiển (Control) 0.0052
Nhiệm vụ (Mission) 0.0041
Chiến đấu (Fight) 0.0038
Phiên bản (Version) 0.0038
Topic 9 Topic 14 Topic 15
Du lịch (Tourism) 0.0542
Khách (Passengers) 0.0314
Khách sạn (Hotel) 0.0276
Du khách (Tourists) 0.0239
Tour 0.0117
Tham quan (Visit) 0.0097
Biển (Sea) 0.0075
Chuyến đi (Journey) 0.0050
Giải trí (Entertainment) 0.0044
Khám phá (Discovery) 0.0044
Lữ hành (Travel) 0.0039
Điểm đến (Destination) 0.0034
Thời trang (Fashion) 0.0482
Người mẫu (Models) 0.0407
Mặc (Wear) 0.0326
Mẫu (Sample) 0.0305
Trang phục (Clothing) 0.0254
Đẹp (Nice) 0.0249
Thiết kế (Design) 0.0229
Sưu tập (Collection) 0.0108
Váy (Skirt) 0.0105
Quần áo (Clothes) 0.0092
Phong cách (Styles) 0.0089
Trình diễn (Perform) 0.0051
Bóng đá (Football) 0.0285
Đội (Team) 0.0273
Cầu thủ (Football Players)0.0241
HLV (Coach) 0.0201
Thi đấu (Compete) 0.0197
Thể thao (Sports) 0.0176
Đội tuyển (Team) 0.0139
CLB (Club) 0.0138
Vô địch (Championship) 0.0089
Mùa (Season) 0.0063
Liên đoàn (Federal) 0.0056
Tập huấn (Training) 0.0042
3.4. Topic Analysis for Vietnamese Wikipedia Dataset
The second dataset is collected from Vietnamese Wikipedia, and contains D=29,043
documents. We preprocessed this dataset in the same way described in the Section 3.2.
This led to a vocabulary size of V = 63,150, and a total of 4,784,036 word tokens. In the
31
hidden topic mining phrase, the number of topics K was fixed at 200. The hyper-
parameters α and β were set at 0.25 and 0.1 respectively.
Table 3.9. Statistic of Vietnamese Wikipedia Dataset
After removing html, doing sentence and word segmentation:
size ≈270M, number of docs = 29,043
After filtering and removing non-topic oriented words:
size ≈48M, number of docs = 17,428
number of words = 4,784,036; vocabulary size = 63,150
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics
Topic 2 Topic 5 Topic 6
Tàu (Ship) 0.0527
Hải quân (Navy) 0.0437
Hạm đội (Fleet) 0.0201
Thuyền (Ship) 0.0100
Đô đốc (Admiral) 0.0097
Tàu chiến (Warship) 0.0092
Cảng (Harbour) 0.0086
Tấn công (Attack) 0.0081
Lục chiến (Marine) 0.0075
Thủy quân (Seaman) 0.0067
Căn cứ (Army Base) 0.0066
Chiến hạm (Gunboat) 0.0058
Độc lập (Independence) 0.0095
Lãnh đạo (Lead) 0.0088
Tổng thống (President) 0.0084
Đất nước (Country) 0.0070
Quyền lực (Power) 0.0069
Dân chủ (Democratic) 0.0068
Chính quyền (Government)0.0067
Ủng hộ (Support) 0.0065
Chế độ (System) 0.0063
Kiểm soát (Control) 0.0058
Lãnh thổ (T
Các file đính kèm theo tài liệu này:
- MSc08_Nguyen_Cam_Tu_Thesis_English.pdf