In this thesis, we proposed two models to enhance the
quality of the Machine Translation system by using crosslingual word embedding models. The first model enriches the
phrase-table in PBSMT system by recomputing the phrase
weights and generate new phrase pairs for the phrase-table.
The second model addresses the unknown word problem in
NMT system by replacing the unknown words with the most
appropriate in-vocabulary words. The analyses and results on
experiments point out that our models help translation systems
overcome the spare data of less-common and low-resource
language.
14 trang |
Chia sẻ: honganh20 | Ngày: 09/03/2022 | Lượt xem: 337 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Enhancing the quality of machine translation system using cross - Lingual word embedding models, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERNING AND TECHNOLOGY
NGUYEN MINH THUAN
ENHANCING THE QUALITY OF MACHINE TRANSLATION
SYSTEM USING CROSS-LINGUAL WORD EMBEDDING
MODELS
Major: Computer Science
Code: 8480101.01
SUMMARY OF COMPUTER SCIENCE MASTER THESIS
SUPERVISOR: Associate Professor Nguyen Phuong Thai
Publication: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai
Nguyen, Chi-Mai Luong, Enhancing the quality of Phrase-table in Statistical Machine
Translation for Less-Common and Low-Resource Languages, in the 2018 International
Conference on Asian Language Processing (IALP 2018).
Hanoi, 10/2018
2
Chapter 1: Introduction
This chapter introduces the motivation of the thesis, related
works and our proposed models. Nowadays, machine
translation systems attain much success in practice, and two
approaches that have been widely used for MT are Phrase-
based statistical machine translation (PBSMT) and Neural
Machine Translation (NMT). In PBSMT, having a good
phrase-table possibly makes translation systems improve the
quality of translation. However, attaining a rich phrase-table is
a challenge since the phrase-table is extracted and trained from
large amounts of bilingual corpora which require much effort
and financial support, especially for less-common languages
such as Vietnamese, Laos, etc. In the NMT system, To reduce
the computational complexity, conventional NMT systems
often limit their vocabularies to be the top 30K-80K most
frequent words in the source and target language, and all
words outside the vocabulary, called unknown words, are
replaced into a single unk symbol. This approach leads to the
inability to generate the proper translation for this unknown
words during testing.
Latterly, there are several approaches to address the above
impediments. Especially, techniques using word embedding
receive much interest from natural language processing
communities. Word embedding is a vector representation of
words which conserves semantic information and their
contexts words. Additionally, we can exploit the advantage of
embedding to represent words in diverse distinction spaces.
Besides, cross-lingual word embedding models are also
receiving a lot of interest, which learn cross-lingual
3
representations of words in a joint embedding space to
represent meaning and transfer knowledge in cross-lingual
scenarios. Inspired by the advantages of the cross-lingual
embedding models, we propose a model to enhance the quality
of a phrase-table by recomputing the phrase weights and
generating new phrase pairs for the phrase-table, and a model
to address the unknown word problem in the NMT system by
replacing the unknown words with the most appropriate in-
vocabulary words.
The rest of this thesis is organized as follows: Chapter 2
gives an overview of related backgrounds. In Chapter 3, we
describe our two proposed models. A model enhances the
quality of phrase-table in SMT, and the remaining model
tackles the unknown word problem in NMT. Settings and
results of our experiments are shown in Chapter 4. We indicate
our conclusion and future works in Chapter 5.
4
Chapter 2: Literature review
2.1 Machine Translation
This section shows the history, approaches, evaluation and
open-source in MT.
2.1.1 History
In the mid-1930s, Georges Artsrouni attempted to build
“translation machines” by using paper tape to create an
automatic dictionary. After that, Peter Troyanskii proposed a
model including a bilingual dictionary and a method for
handling grammatical issues between languages based on the
Esperanto’s grammatical system. During the 2000s, research
in MT has seen major changes. A lot of research has focused
on example-based machine translation and statistical machine
translation (SMT). Besides, researchers also gave more
interests in hybridization by combining morphological and
syntactic knowledge into statistical systems, as well as
combining statistics with existing rule-based systems.
Recently, the hot trend of MT is using a large artificial neural
network into MT, called Neural Machine Translation (NMT).
In 2014, (Cho et al., 2014) published the first paper on using
neural networks in MT, followed by a lot of research in the
following few years.
2.1.2 Approaches
In this section, we indicate typically approaches for MT
based on linguistic rules, statistical and neural network. These
are Rule-based Machine Translation (RBMT), Statistical
Machine Translation (STM), Example-based machine
translation (EBMT), and Neural Machine Translation (NMT).
5
2.1.3 Evaluation
This section describes BLEU - a popular method for
automatic evaluating MT output that is quick, inexpensive,
and language-independent. The basic idea of this method is to
compare n-grams of the MT output with n-grams of the
standard translation and count the number of matches. The
more the matches, the better the MT output is.
2.1.4 Open-Source Machine Translation
This subsection introduces a list of free and complete
toolkits for MT and describes two MT systems, which are used
in our work. The first system is Moses - an open system for
SMT and the remaining system is OpenNMT - an open
system for NMT.
2.2 Word Embedding
In this section, we introduce models about monolingual and
cross-lingual word embedding.
2.2.1 Monolingual Word Embedding Models
This subsection introduces models which used for
estimating continuous representations of words based on
monolingual data.
2.2.2 Cross-Lingual Word Embedding Models
This subsection introduces models which used for learning
the cross-lingual representation of words in a joint embedding
space to represent meaning and transfer knowledge in cross-
lingual applications.
6
Chapter 3: Using Cross-Lingual Word Embedding Models
for Machine Translation Systems
In this chapter, we propose two models for improving the
quality of machine translation system based on cross-lingual
word embedding models. The first model enhances the quality
of phrase-table in SMT system by recomputing phrase-table
weights and generating new phrase-pairs. The second model
addresses the unknown word problem in NMT system by
replacing unknown words with similar words.
3.1 Enhancing the Quality of Phrase-table in SMT Using
Cross-Lingual Word Embedding
Phrase-based statistical machine translation (PBSMT)
systems have been developed for years and attain suc- cess in
practice. The core of PBSMT is the phrase-table, which
contains words and phrases for SMT system to translate. In the
translation process, sentences are split into distinguished part.
Hence, having a good phrase-table possibly makes translation
systems improve the quality of translation. However, attaining
a rich phrase-table is a challenge since the phrase-table is
extracted and trained from large amounts of bilingual corpora
which require much effort and financial support. In this work,
our contribution focuses on enhancing the phrase-table quality
by recomputing phrase weights and integrating new
translations into the phrase-table by using cross-lingual
embedding models described in Section 2.2.2.
3.1.1 Recomputing Phrase-table weights
This subsection describes the detail of our method to
recompute the phrase-table weights. Phrase scoring is one of
the most important parts in a statistical machine translation
7
system. It estimates weights for phrase pairs based on a large
bilingual corpus. Therefore, in less-common and low-resource
languages, the estimation is often inaccurate. In order to
resolve this problem, we recompute phrase weights by using
monolingual data. The traditional phrase-table in an SMT
system normally contains four weights: inverse phrase
translation probability, inverse lexical weighting, direct phrase
translation probability, and direct lexical weighting. In order to
recompute those weights, we borrow the idea of using the
linear mapping (shown in Section 2.2.2) between word vectors
to explore similarities among languages. In our work, we use
all three cross-lingual embedding model for learning the linear
mapping W to choose the most appropriate method.
3.1.2 Generating new phrase pairs
This subsection describes the detail of our method to
generate new phrase pairs by using projections of word vector
representations. We also indicate how to combine new phrase
pairs into a traditional phrase-table.
3.2 Addressing the Unknown Word Problem in NMT
Using Cross-Lingual Word Embedding Models
In this section, we propose a model to solve the above
unknown words problems by using cross-lingual word
embedding models. Our model contains two phases: training
and testing.
The flow of training phrase is shown in Figure 3.1.
The flow of testing phrase is shown in Figure 3.2.
8
Chapter 4: Experiments and Results
In this chapter, we show our experiments and results.
4.1 Settings
This section describes the corpora and tools we use in our
experiments. We also represent the detail of our experiment
settings to evaluate the effect of our proposed models in MT
systems.
Table 4.1 shows the size of monolingual corpora.
Table 4.2 show the size of bilingual corpora.
4.2 Results
In this subsection, we first present the results in word
translation to choose the best approaches for Vietnamese-
English and Japanese-Vietnamese language pairs. We then
indicate the result of the PBSMT system in term of the BLEU
score to evaluate the effect of our proposed model for
enhancing the quality of the phrase-table. Finally, we report
the result of the NMT system, which incorporates our replaced
unknown words model and shows some examples of
translation.
4.2.1 Word Translation Task
Table 4.4 presents the precision of word translation task
using various models on the different dataset for Vietnamese-
English and Japanese-Vietnamese language pairs.
In short, the result of word translation task shows that the
method of (Xing et al., 2015) trained on a small manual
bilingual dictionary is the best approach for learning cross-
9
lingual word embeddings in Vietnamese-English and
Vietnamese-Japanese language pairs.
4.2.2 Impact of Enriching the Phrase-table on SMT system
As mentioned above, we proposed two methods to enhance
the quality of the phrase-table. They are recomputing phrase-
tabe weights and generating new phrase pairs for the phrase-
tabe. This section show the impact of these methods on SMT
system,
Table 4.5 shows the result of the experiments on the
PBSMT system in term of the BLEU score.
Table 4.6 shows some translation examples of our PBSMT
system, which use both recomputing phrase-table weights and
incorporating new phrase pairs for the Vietnamese-English
language pair.
This method was presented as a long paper in the 2018
International Conference on Asian Language Processing
(IALP 2018).
4.2.3 Impact of Removing the Unknown Words on NMT
system
This section indicates the impact of our proposed method
which replaces unknown words with the similar in-vocabulary
words in NMT system by using cross-lingual word embedding
models.
Table 4.7 shows the result of our experiment on the NMT
system in term of BLEU score for Vietnamese-English and
Japanese-Vietnamese.
Table 4.8 shows some translation examples of our NMT
system for Vietnamese-English in the testing phase.
10
Chapter 5: Conclusion
In this thesis, we proposed two models to enhance the
quality of the Machine Translation system by using cross-
lingual word embedding models. The first model enriches the
phrase-table in PBSMT system by recomputing the phrase
weights and generate new phrase pairs for the phrase-table.
The second model addresses the unknown word problem in
NMT system by replacing the unknown words with the most
appropriate in-vocabulary words. The analyses and results on
experiments point out that our models help translation systems
overcome the spare data of less-common and low-resource
language. In the PBSMT system, using both of recomputing
the phrase weights and incorporating the new phrase pairs, the
phrase-table quality has been signicantly improved. As a
result, the BLEU score increased by 0.23 and 1.16 in
Vietnamese-English and Japanese-Vietnamese respectively.
Likewise, in the NMT system, integrating our proposed model
to address unknown words has improved the BLEU score by
0.56 and 1.66 in Vietnamese-English and Japanese-
Vietnamese respectively. However, there are some drawbacks
in our approach since our methods created incorrect entries for
the phrase-table and bad translations for the unknown words.
In the future, we will work on specific cases of generating
bad phrase pairs for the phrase-table and bad translations for
the unknown words. We would also like to continue to
experiment with different cross-lingual word embedding
models to enhance the quality of the machine translation
system.
11
BIBLIOGRAPHY
[1] Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase
representations using rnn encoder–decoder for statistical
machine translation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734. Association
for Computational Linguistics, 2014. doi:
10.3115/v1/D14-1179. URL
[2] Alexis Conneau, Guillaume Lample, Marc’Aurelio
Ranzato, Ludovic Denoyer, and Hervé Jégou. Word
translation without parallel data. CoRR,
abs/1710.04087, 2017.
[3] Hien Vu Huy, Phuong-Thai Nguyen, Tung-Lam
Nguyen, and M. L. Nguyen. Bootstrapping phrase-based
statistical machine translation via wsd integration. In
IJCNLP, 2013.
[4] Philipp Koehn. Statistical Machine Translation.
Cambridge University Press, New York, NY, USA, 1st
edition, 2010. ISBN 0521874157, 9780521874151.
[5] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin,
and Evan Herbst. Moses: Open source toolkit for
statistical machine translation. In Proceedings of the
45th Annual Meeting of the ACL on Interactive Poster
12
and Demonstration Sessions, ACL ’07, pages 177–180,
Stroudsburg, PA, USA, 2007. Association for
Computational Linguistics. URL
[6] Xiaoqing Li, Jiajun Zhang, and Chengqing Zong.
Towards zero unknown word in neural machine
translation. In Proceedings of the Twenty-Fifth
International Joint Conference on Artificial Intelligence,
IJCAI’16, pages 2852–2858. AAAI Press, 2016. ISBN
978-1-57735-770-4. URL
[7] Thang Luong, Hieu Pham, and Christopher D. Manning.
Effective approaches to attention-based neural machine
translation. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing,
pages 1412–1421. Association for Computational
Linguistics, 2015a.doi: 10.18653/v1/D15-1166. URL
[8] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and
Jeffrey Dean. Efficient estimation of word
representations in vector space. CoRR, abs/1301.3781,
2013a.
[9] Tomas Mikolov, Google Inc, Mountain View, Quoc V.
Le, Google Inc, Ilya Sutskever, and Google Inc.
Exploiting similarities among languages for machine
translation, 2013b.
[10] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. Bleu: A method for automatic evaluation of
machine translation. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics,
ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002.
13
Association for Computational Linguistics. doi:
10.3115/1073083.1073135. URL
https://doi.org/10.3115/1073083.1073135.
[11] Jeffrey Pennington, Richard Socher, and Christoper
Manning. Glove: Global vectors for word
representation. In EMNLP, volume 14, pages 1532–
1543, 01 2014.
[12] Lê Hông Phuong, Nguyên Thi Minh Huyên, Azim
Roussanaly, and Hô Tuòng Vinh. Language and
automata theory and applications. chapter A Hybrid
Approach to Word Segmentation of Vietnamese Texts,
pages 240–249. Springer-Verlag, Berlin, Heidelberg,
2008. ISBN978-3-540-88281-7. doi:10.1007/978-3-
540-88282-4 23. URL
540-88282-4_23.
[13] Sebastian Ruder, Ivan Vuli’c, and Anders Sogaard. A
survey of cross-lingual word embedding models. 2017.
[14] Matthew S. Ryan and Graham R. Nudd. The viterbi
algorithm. Technical report, Coventry, UK, UK, 1993.
[15] Peter H. Schönemann. A generalized solution of the
orthogonal procrustes problem. Psychometrika, 31(1):1–
10, Mar 1966. ISSN 1860-0980. doi:
10.1007/BF02289451. URL
https://doi.org/10.1007/BF02289451.
[16] Rico Sennrich, Barry Haddow, and Alexandra Birch.
Neural machine translation of rare words with subword
units. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 1715–1725. Association for
14
Computational Linguistics, 2016. doi:10.18653/v1/P16-
1162. URL
[17] Stephan Vogel, Hermann Ney, and Christoph Tillmann.
Hmm-based word alignment in statistical translation. In
Proceedings of the 16th Conference on Computational
Linguistics - Volume 2, COLING ’96, pages 836–841,
Stroudsburg, PA, USA, 1996. Association for
Computational Linguistics. doi:
10.3115/993268.993313. URL
https://doi.org/10.3115/993268.993313.
[18] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.
Normalized word embedding and orthogonal transform
for bilingual word translation. In HLT-NAACL, 2015.
[19] Xiaoning Zhu, Zhongjun He, Hua Wu, Conghui Zhu,
Haifeng Wang, and Tiejun Zhao. Improving pivot-based
statistical machine translation by pivoting the
cooccurrence count of phrase pairs. In EMNLP, 2014.
Các file đính kèm theo tài liệu này:
- enhancing_the_quality_of_machine_translation_system_using_cr.pdf