Enhancing the quality of machine translation system using cross - Lingual word embedding models

In this thesis, we proposed two models to enhance the

quality of the Machine Translation system by using crosslingual word embedding models. The first model enriches the

phrase-table in PBSMT system by recomputing the phrase

weights and generate new phrase pairs for the phrase-table.

The second model addresses the unknown word problem in

NMT system by replacing the unknown words with the most

appropriate in-vocabulary words. The analyses and results on

experiments point out that our models help translation systems

overcome the spare data of less-common and low-resource

language.

14 trang | Chia sẻ: honganh20 | Lượt xem: 532 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Enhancing the quality of machine translation system using cross - Lingual word embedding models, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERNING AND TECHNOLOGY NGUYEN MINH THUAN ENHANCING THE QUALITY OF MACHINE TRANSLATION SYSTEM USING CROSS-LINGUAL WORD EMBEDDING MODELS Major: Computer Science Code: 8480101.01 SUMMARY OF COMPUTER SCIENCE MASTER THESIS SUPERVISOR: Associate Professor Nguyen Phuong Thai Publication: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen, Chi-Mai Luong, Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages, in the 2018 International Conference on Asian Language Processing (IALP 2018). Hanoi, 10/2018 2 Chapter 1: Introduction This chapter introduces the motivation of the thesis, related works and our proposed models. Nowadays, machine translation systems attain much success in practice, and two approaches that have been widely used for MT are Phrase- based statistical machine translation (PBSMT) and Neural Machine Translation (NMT). In PBSMT, having a good phrase-table possibly makes translation systems improve the quality of translation. However, attaining a rich phrase-table is a challenge since the phrase-table is extracted and trained from large amounts of bilingual corpora which require much effort and financial support, especially for less-common languages such as Vietnamese, Laos, etc. In the NMT system, To reduce the computational complexity, conventional NMT systems often limit their vocabularies to be the top 30K-80K most frequent words in the source and target language, and all words outside the vocabulary, called unknown words, are replaced into a single unk symbol. This approach leads to the inability to generate the proper translation for this unknown words during testing. Latterly, there are several approaches to address the above impediments. Especially, techniques using word embedding receive much interest from natural language processing communities. Word embedding is a vector representation of words which conserves semantic information and their contexts words. Additionally, we can exploit the advantage of embedding to represent words in diverse distinction spaces. Besides, cross-lingual word embedding models are also receiving a lot of interest, which learn cross-lingual 3 representations of words in a joint embedding space to represent meaning and transfer knowledge in cross-lingual scenarios. Inspired by the advantages of the cross-lingual embedding models, we propose a model to enhance the quality of a phrase-table by recomputing the phrase weights and generating new phrase pairs for the phrase-table, and a model to address the unknown word problem in the NMT system by replacing the unknown words with the most appropriate in- vocabulary words. The rest of this thesis is organized as follows: Chapter 2 gives an overview of related backgrounds. In Chapter 3, we describe our two proposed models. A model enhances the quality of phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT. Settings and results of our experiments are shown in Chapter 4. We indicate our conclusion and future works in Chapter 5. 4 Chapter 2: Literature review 2.1 Machine Translation This section shows the history, approaches, evaluation and open-source in MT. 2.1.1 History In the mid-1930s, Georges Artsrouni attempted to build “translation machines” by using paper tape to create an automatic dictionary. After that, Peter Troyanskii proposed a model including a bilingual dictionary and a method for handling grammatical issues between languages based on the Esperanto’s grammatical system. During the 2000s, research in MT has seen major changes. A lot of research has focused on example-based machine translation and statistical machine translation (SMT). Besides, researchers also gave more interests in hybridization by combining morphological and syntactic knowledge into statistical systems, as well as combining statistics with existing rule-based systems. Recently, the hot trend of MT is using a large artificial neural network into MT, called Neural Machine Translation (NMT). In 2014, (Cho et al., 2014) published the first paper on using neural networks in MT, followed by a lot of research in the following few years. 2.1.2 Approaches In this section, we indicate typically approaches for MT based on linguistic rules, statistical and neural network. These are Rule-based Machine Translation (RBMT), Statistical Machine Translation (STM), Example-based machine translation (EBMT), and Neural Machine Translation (NMT). 5 2.1.3 Evaluation This section describes BLEU - a popular method for automatic evaluating MT output that is quick, inexpensive, and language-independent. The basic idea of this method is to compare n-grams of the MT output with n-grams of the standard translation and count the number of matches. The more the matches, the better the MT output is. 2.1.4 Open-Source Machine Translation This subsection introduces a list of free and complete toolkits for MT and describes two MT systems, which are used in our work. The first system is Moses - an open system for SMT and the remaining system is OpenNMT - an open system for NMT. 2.2 Word Embedding In this section, we introduce models about monolingual and cross-lingual word embedding. 2.2.1 Monolingual Word Embedding Models This subsection introduces models which used for estimating continuous representations of words based on monolingual data. 2.2.2 Cross-Lingual Word Embedding Models This subsection introduces models which used for learning the cross-lingual representation of words in a joint embedding space to represent meaning and transfer knowledge in cross- lingual applications. 6 Chapter 3: Using Cross-Lingual Word Embedding Models for Machine Translation Systems In this chapter, we propose two models for improving the quality of machine translation system based on cross-lingual word embedding models. The first model enhances the quality of phrase-table in SMT system by recomputing phrase-table weights and generating new phrase-pairs. The second model addresses the unknown word problem in NMT system by replacing unknown words with similar words. 3.1 Enhancing the Quality of Phrase-table in SMT Using Cross-Lingual Word Embedding Phrase-based statistical machine translation (PBSMT) systems have been developed for years and attain suc- cess in practice. The core of PBSMT is the phrase-table, which contains words and phrases for SMT system to translate. In the translation process, sentences are split into distinguished part. Hence, having a good phrase-table possibly makes translation systems improve the quality of translation. However, attaining a rich phrase-table is a challenge since the phrase-table is extracted and trained from large amounts of bilingual corpora which require much effort and financial support. In this work, our contribution focuses on enhancing the phrase-table quality by recomputing phrase weights and integrating new translations into the phrase-table by using cross-lingual embedding models described in Section 2.2.2. 3.1.1 Recomputing Phrase-table weights This subsection describes the detail of our method to recompute the phrase-table weights. Phrase scoring is one of the most important parts in a statistical machine translation 7 system. It estimates weights for phrase pairs based on a large bilingual corpus. Therefore, in less-common and low-resource languages, the estimation is often inaccurate. In order to resolve this problem, we recompute phrase weights by using monolingual data. The traditional phrase-table in an SMT system normally contains four weights: inverse phrase translation probability, inverse lexical weighting, direct phrase translation probability, and direct lexical weighting. In order to recompute those weights, we borrow the idea of using the linear mapping (shown in Section 2.2.2) between word vectors to explore similarities among languages. In our work, we use all three cross-lingual embedding model for learning the linear mapping W to choose the most appropriate method. 3.1.2 Generating new phrase pairs This subsection describes the detail of our method to generate new phrase pairs by using projections of word vector representations. We also indicate how to combine new phrase pairs into a traditional phrase-table. 3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual Word Embedding Models In this section, we propose a model to solve the above unknown words problems by using cross-lingual word embedding models. Our model contains two phases: training and testing. The flow of training phrase is shown in Figure 3.1. The flow of testing phrase is shown in Figure 3.2. 8 Chapter 4: Experiments and Results In this chapter, we show our experiments and results. 4.1 Settings This section describes the corpora and tools we use in our experiments. We also represent the detail of our experiment settings to evaluate the effect of our proposed models in MT systems. Table 4.1 shows the size of monolingual corpora. Table 4.2 show the size of bilingual corpora. 4.2 Results In this subsection, we first present the results in word translation to choose the best approaches for Vietnamese- English and Japanese-Vietnamese language pairs. We then indicate the result of the PBSMT system in term of the BLEU score to evaluate the effect of our proposed model for enhancing the quality of the phrase-table. Finally, we report the result of the NMT system, which incorporates our replaced unknown words model and shows some examples of translation. 4.2.1 Word Translation Task Table 4.4 presents the precision of word translation task using various models on the different dataset for Vietnamese- English and Japanese-Vietnamese language pairs. In short, the result of word translation task shows that the method of (Xing et al., 2015) trained on a small manual bilingual dictionary is the best approach for learning cross- 9 lingual word embeddings in Vietnamese-English and Vietnamese-Japanese language pairs. 4.2.2 Impact of Enriching the Phrase-table on SMT system As mentioned above, we proposed two methods to enhance the quality of the phrase-table. They are recomputing phrase- tabe weights and generating new phrase pairs for the phrase- tabe. This section show the impact of these methods on SMT system, Table 4.5 shows the result of the experiments on the PBSMT system in term of the BLEU score. Table 4.6 shows some translation examples of our PBSMT system, which use both recomputing phrase-table weights and incorporating new phrase pairs for the Vietnamese-English language pair. This method was presented as a long paper in the 2018 International Conference on Asian Language Processing (IALP 2018). 4.2.3 Impact of Removing the Unknown Words on NMT system This section indicates the impact of our proposed method which replaces unknown words with the similar in-vocabulary words in NMT system by using cross-lingual word embedding models. Table 4.7 shows the result of our experiment on the NMT system in term of BLEU score for Vietnamese-English and Japanese-Vietnamese. Table 4.8 shows some translation examples of our NMT system for Vietnamese-English in the testing phase. 10 Chapter 5: Conclusion In this thesis, we proposed two models to enhance the quality of the Machine Translation system by using cross- lingual word embedding models. The first model enriches the phrase-table in PBSMT system by recomputing the phrase weights and generate new phrase pairs for the phrase-table. The second model addresses the unknown word problem in NMT system by replacing the unknown words with the most appropriate in-vocabulary words. The analyses and results on experiments point out that our models help translation systems overcome the spare data of less-common and low-resource language. In the PBSMT system, using both of recomputing the phrase weights and incorporating the new phrase pairs, the phrase-table quality has been signicantly improved. As a result, the BLEU score increased by 0.23 and 1.16 in Vietnamese-English and Japanese-Vietnamese respectively. Likewise, in the NMT system, integrating our proposed model to address unknown words has improved the BLEU score by 0.56 and 1.66 in Vietnamese-English and Japanese- Vietnamese respectively. However, there are some drawbacks in our approach since our methods created incorrect entries for the phrase-table and bad translations for the unknown words. In the future, we will work on specific cases of generating bad phrase pairs for the phrase-table and bad translations for the unknown words. We would also like to continue to experiment with different cross-lingual word embedding models to enhance the quality of the machine translation system. 11 BIBLIOGRAPHY [1] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1179. URL [2] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. CoRR, abs/1710.04087, 2017. [3] Hien Vu Huy, Phuong-Thai Nguyen, Tung-Lam Nguyen, and M. L. Nguyen. Bootstrapping phrase-based statistical machine translation via wsd integration. In IJCNLP, 2013. [4] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, New York, NY, USA, 1st edition, 2010. ISBN 0521874157, 9780521874151. [5] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster 12 and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics. URL [6] Xiaoqing Li, Jiajun Zhang, and Chengqing Zong. Towards zero unknown word in neural machine translation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 2852–2858. AAAI Press, 2016. ISBN 978-1-57735-770-4. URL [7] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics, 2015a.doi: 10.18653/v1/D15-1166. URL [8] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013a. [9] Tomas Mikolov, Google Inc, Mountain View, Quoc V. Le, Google Inc, Ilya Sutskever, and Google Inc. Exploiting similarities among languages for machine translation, 2013b. [10] Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002. 13 Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135. [11] Jeffrey Pennington, Richard Socher, and Christoper Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532– 1543, 01 2014. [12] Lê Hông Phuong, Nguyên Thi Minh Huyên, Azim Roussanaly, and Hô Tuòng Vinh. Language and automata theory and applications. chapter A Hybrid Approach to Word Segmentation of Vietnamese Texts, pages 240–249. Springer-Verlag, Berlin, Heidelberg, 2008. ISBN978-3-540-88281-7. doi:10.1007/978-3- 540-88282-4 23. URL 540-88282-4_23. [13] Sebastian Ruder, Ivan Vuli’c, and Anders Sogaard. A survey of cross-lingual word embedding models. 2017. [14] Matthew S. Ryan and Graham R. Nudd. The viterbi algorithm. Technical report, Coventry, UK, UK, 1993. [15] Peter H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1– 10, Mar 1966. ISSN 1860-0980. doi: 10.1007/BF02289451. URL https://doi.org/10.1007/BF02289451. [16] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for 14 Computational Linguistics, 2016. doi:10.18653/v1/P16- 1162. URL [17] Stephan Vogel, Hermann Ney, and Christoph Tillmann. Hmm-based word alignment in statistical translation. In Proceedings of the 16th Conference on Computational Linguistics - Volume 2, COLING ’96, pages 836–841, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics. doi: 10.3115/993268.993313. URL https://doi.org/10.3115/993268.993313. [18] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In HLT-NAACL, 2015. [19] Xiaoning Zhu, Zhongjun He, Hua Wu, Conghui Zhu, Haifeng Wang, and Tiejun Zhao. Improving pivot-based statistical machine translation by pivoting the cooccurrence count of phrase pairs. In EMNLP, 2014.

Các file đính kèm theo tài liệu này:

enhancing_the_quality_of_machine_translation_system_using_cr.pdf