Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine

Transkript

Lexical Ambiguity Resolution for Turkish in Direct
Transfer Machine Translation Models
A. Cüneyd TANTUĞ 1 Eşref ADALI 1 Kemal OFLAZER 2
1
Istanbul Technical University Faculty of Eletrical-Electronic Engineering
Computer Engineering Department
34469, Maslak, Istanbul, Türkiye
{cuneyd, adali}@cs.itu.edu.tr
2
Sabancı University
Faculty Of Engineering and Natural Sciences
34956, Orhanlı, Tuzla, Türkiye
[email protected]
Abstract. This paper presents a statistical lexical ambiguity resolution method
in direct transfer machine translation models in which the target language is
Turkish. Since direct transfer MT models do not have full syntactic
information, most of the lexical ambiguity resolution methods are not very
helpful. Our disambiguation model is based on statistical language models. We
have investigated the performances of some statistical language model types
and parameters in lexical ambiguity resolution for our direct transfer MT
system.
1. Introduction
This paper presents a statistical lexical ambiguity resolution method in direct transfer
machine translation models in which the target language is Turkish. This resolution
method is based on statistical language models (LM) which exploit collocational
occurrence probabilities. Although, lexical ambiguity resolution methods are
generally required for some NLP purposes like accent restoration, word sense
disambiguation, homograph and homophone disambiguation, we focus only on lexical
ambiguity of word choice selection in machine translation (MT).
The direct transfer model in MT is the transfer of the sentence in the source
language to the sentence in the target language on word-by-word basis. While this
model is the simplest technique for MT, it nevertheless works fine with some close
language pairs like Czech-Slovak [1], Spanish-Catalan [2]. We have been
implementing an MT system between Turkic Languages, the first stage of the project
includes the Turkmen-Turkish language pair.
The lexical ambiguity problem arises when the bilingual transfer dictionary
produces more than one corresponding target language root words for a source
language root word. Hence, the MT system has to choose the right target word root by
means of some evaluation criteria. This process is called word choice selection in MT
2
A. Cüneyd TANTUĞ, Eşref ADALI, Kemal OFLAZER
related tasks. Most of the methods developed to overcome this problem are based on
some syntactic analysis or domain knowledge. However, there is no such syntactical
analysis or any other deeper knowledge in MT systems that utilize direct transfer
models.
2. Previous Work
The general task of “word sense disambiguation” (WSD) studies the assignment of
correct sense labels to the ambiguous words in naturally occurring free text. WSD has
many application areas like MT, information retrieval, content and thematic analysis,
grammatical analysis, spelling correction, case changes and etc. In general terms,
word sense disambiguation (WSD) involves the association of a given word in a text
or discourse with a definition or meaning (sense) which is distinguishable from other
meanings potentially attributable to that word. In the history of the WSD area,
supervised and unsupervised statistical machine learning techniques are used as well
as other AI methods. A Bayesian classifier was used in work done by Gale et al [5].
Leacock et al [6] compared the performances of Bayesian, content vector and neural
network classifiers in WSD tasks. Shütze’s experiments showed that unsupervised
clustering techniques can get results approaching the supervised techniques by using a
large scale context (1000 neighbouring words) [7]. However, all these statistical WSD
suffer from two major problems: need for manually sense-tagged corpus (certainly for
supervised techniques) and data sparseness.
The lexical ambiguity resolution in MT is a related task of WSD which focuses on
deciding the most appropriate word in the target language among a group of possible
translations of a word in the source language. There are some general lexical
ambiguity resolution techniques; unfortunately, most of the successful techniques
require complex information sources like lemmas, inflected forms, parts of speech,
local and distant collocations and trigrams [8].
Most of the disambiguation work use syntactic information with other information
like taxonomies, selectional restrictions and semantic networks. In recent work,
complex processing is avoided by using partial (shallow) parsing instead of full
parsing. For example, Hearst [9] segments text into noun and prepositional phrases
and verb groups and discards all other syntactic information. Mostly, the syntactic
information is generally simply part of speech used conjunction with other
information sources [10]. In our work, there is no syntactic transfer so there is not any
syntactic level knowledge but the part of speech. So we have developed a selection
model based on the statistical language models.
Our disambiguation module performs not only the lexical ambiguity resolution but
also the disambiguation of source language morphological ambiguities. Our
translation system is designed mainly for translation from Turkic languages to
Turkish. One serious problem is the almost total lack of computational resources and
tools for these languages to help with some of the problems on the source language
side. So all the morphological ambiguities are transferred to the target language side.
Apart from the lexical disambiguation problem, our disambiguation module should
Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine Translation Models
3
decide the right root word by considering the morphological ambiguity. For example
the Turkmen word “akmak” (foolish) has two ambiguous morphological analyses:
akmak+Adj
ak+Verb+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom
(fool)
(to flow)
These analyses can not be disambiguated until the transfer phase because there is
no POS tagger for the source language. The transfer module converts all the root
words to the target language as stated below.
ahmak+Adj
budala+Adj
ak+Verb+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom
(fool)
(stupid)
(to flow)
Note that there are two possible translations of the source word regarding its
“fool” sense. The disambiguation module that will be designed has to handle both
the lexical ambiguity and the morphological ambiguity.
3. Language Models
Statistical language models define probability distributions on word sequences. By
using a language model, one can compute the probability of a sentence S
(w1w2w3…wn) by the following formula:
P(S)=P(w1)P(w2|w1)P(w3|w1w2)…P(wn|w1…wn-1)
This means that the probability of any word sequence can be calculated by
decomposition using the chain rule but usually due to sparseness, most terms above
would be 0, so n-gram approximations are used.
3.1.
Training Corpus
Since Turkish has an agglutinative and highly inflectional morphology, the training
corpus cannot be just texts collected from various sources. In order to calculate
frequencies of the root words, these texts should be processed by a lemmatizer.
However, some of our models require not only the root words but also some other
morphological structures so the words in the training corpus should be processed with
a morphological analyzer and the morphological ambiguity should be resolved
manually or by using a POS-Tagger.
We have used such a corpus which is composed of texts from a daily Turkish
newspaper. Some statistics about the corpus is depicted in Table 1.
4
Table 1. Training Corpus Statistics
# of tokens
root word vocabulary size
root words occurring 1 or 2 times
root words occurring more than 2 times
3.2.
948,000
25,787
14,830
10,957
Baseline Model
At first glance, the simplest way of word choice selection can be implemented by
incorporating word occurrence frequencies collected from a corpus. We took this
model as our baseline model. Note that this is same as the unigram (1-gram) language
model.
3.3.
Language Model Types
We have used two different types of language models for lexical ambiguity
resolution. The first LM Type 1 is built by using only root words and dismissing all
other morphological information. LM Type 2 uses the probability distributions of root
words and their part of speech information.
All types of these language models are back-off language models combined with
Good-Turing discounting for smoothing. Additionally, we have used a cutoff 2 which
means n-grams occurring fewer than two times are discarded. We have computed our
language models by using CMU-Cambridge Statistical Language Modeling Toolkit
[11].
3.4.
LM Parameters
Apart from the type of the LM, there are two major parameters to construct a LM.
The first one is the order the model; the number of successive words to consider. The
second parameter is the vocabulary size. It might be better to have all the words in a
LM. However this is not practically possible because a large LM is hard to handle
and manage in real world applications, and is prone to sparseness problems.
Therefore, a reduction in vocabulary size is a common process. So, deciding how
many of the most frequent words will be used to build a LM becomes the second
parameter to be determined.
5
4. Implementation
We have employed our tests on a direct MT system which translates text in Turkmen
Language to Turkish. The main motivation of this MT system is the design of a
generic MT system that performs automatic text translations between all Turkic
languages. Although the system is currently under development and it has some
weaknesses (like the small coverage of the source language morphological analyzer,
insufficient multi-word transfer block), the preliminary results are at an acceptable
level. The system has following processing blocks:
1.
2.
3.
4.
5.
6.
7.
Tokenization
Source Language (SL) Morphological Analysis
Multi-Word Detection
Root Word Tansfer
Morphological Structure Transfer
Lexical & Morphological Ambiguity Resolution
Target Language (TL) Morphological Synthesis
Our word-to-word direct MT system generates all possible Turkish counterparts of
each input source root word by using a bilingual transfer dictionary (while the
morphological features are directly transferred). As the next step, all candidate
Turkish words are used to generate a directed acyclic graph of possible word
sequence, as shown in Figure 1.
Source
Language
Näme üçin adamlar dürli dillerde gepleýärler ?
w1
näme
w2
üçin
ne
w3
adam
kim
w5
dil
insan
için
<s>
w4
dürli
konuş
türlü
adam
w6
geple
</s>
dil
söyle
Fig. 1. The process of decoding the most probable target language sentence
As seen in the figure, each root word of the sentence in the source language can
have one or more Turkish translations which produce lexically ambiguous output. The
transition probabilities are determined by the language model.
As an example, the probability of the transition between “dil” and “konuş” is
determined by the probability P(“konuş”|“dil”) which is calculated from the
corpus in the training stage of the bigram language model. Then, the ambiguity is
disambiguated by trying to find the path with the maximum probability by using the
Viterbi algorithm [13].
6
5. Evaluation
In the evaluation phase of our work, the aim is to find the LM type, order and
vocabulary size which performs best with our direct MT model. We have conducted
our tests for n=1,2,3,4 and root word vocabulary size =3K, 4K, 5K, 7K and 10K. Note
that 10K is nearly the vocabulary size of the training corpus which means that all the
root words are used to construct the LM.
The performance evaluation of the proposed resolution method is investigated by
means of NIST scores achieved by each LM type for different parameters. NIST is a
widely used, well-known method for evaluating the performance of the MT systems
[14]. We have used the BLEU-NIST evaluation tool mteval that can be accessed from
NIST. These evaluations are calculated on 255 sentences. In order to find out which
LM type and parameters are better, we have run our direct transfer system with
different LMs. For a fair evaluation, we have measured the performances of these
LMs against the performance of our baseline model. The results are given in figure 2
and figure 3.
0.1000
NIST Score Improvement
0.0900
0.0800
0.0700
3K
0.0600
4K
0.0500
5K
0.0400
7K
0.0300
10K
0.0200
0.0100
0.0000
n=1
n=2
n=3
n=4
LM Order
Fig. 2. LM Type 1 (Root Word) Results
0.0800
NIST Score Improvement
0.0700
0.0600
3K
0.0500
4K
0.0400
5K
7K
0.0300
10K
0.0200
0.0100
0.0000
n=1
n=2
n=3
n=4
LM Order
Fig. 3. LM Type 2 (Root Word + POS Tag) Results
7
In our tests, the root word language model (type 1, n=2 and vocabulary size = 7K)
performs best. We examined that there is absolutely no difference between 7K and
10K vocabulary selection. This is why the 7K and 10K lines in the graphs are
superposed. An interesting and important result is the decrease of the score with n
higher than 2 for all LM types and parameters (except type 1 with 3K vocabulary).
This can be explained with the fact that most of the dependencies are between two
consecutive words. Also, as expected, the NIST score improvement increases for
larger vocabulary sizes, but it is clear that using a LM which has a 5K vocabulary is
meaningful.
As an example, the translation of the Turkmen sentence in the figure 1 by using our
baseline model is below:
Input : Näme üçin adamlar dürli dillerde gepleýärler ?
Output1: ne
için insanlar türlü dillerde söylüyorlar ?(Type1,n=1,3K)
Output2: ne
için insanlar türlü dillerde konuşuyorlar ?(Type2,n=2,3K)
Although Output1 is an acceptable translation, the quality of the translation can
be improved by using a LM type 1 with n=2. P(“söyle”) (to say) is larger than
P(“konuş”) (to
speak) in the baseline model. On the contrary,
P(“konuş”|“dil”)
(dil
means
language)
is
larger
than
the
P(“söyle”|“dil”) in the LM with n=2 so this makes the translation more fluent.
In the example below, one can see the positive effects of increasing the vocabulary
size. The source root word “çöl” (desert) has three translations; “seyrek”
(scarce), “kıt” (inadequate) and “çöl” (desert) in the transfer dictionary, there
is no probability entry about these words in a 3K vocabulary so all three target words
have probabilities given by the smoothing process. The system chooses the first one
“seyrek” (scarce) which results in a wrong translation. This error is corrected by
the LM with a 7K vocabulary which involves the word “çöl” (desert) .
içinde kompas tapypdyr .
Input : Bir adam çölüñ
Output1: bir insan seyreğin içinde pusula bulmuştur . (Type1,n=1,3K)
Output2: bir insan çölün
içinde pusula bulmuştur . (Type1,n=1,7k)
There are also examples for erroneous translations. In the following translation
instances, the LM with n=2 and 5K vocabulary can not correctly choose the word
“dur” (to stay), instead, it chooses the verb “dikil” (to stand) because of the
higher probability of P(“dikil”|”sonra”) (“sonra” means “after”). In this
case, the baseline translation model outperforms all other LMs.
wagtlap durandan
soñ
, ondan
: Hamyr gyzgyn tamdyrda esli
tagamly bir zadyñ ysy
çykyp ugrapdyr
.
durduktan
sonra , ondan
Output1: hamur kızgın tandırda epeyce süre
tatlı
bir şeyin kokusu çıkıp başlamıştır .
(Type1,n=1,3K)
Output2: hamur kızgın tandırda epeyce süre
dikildikten sonra , ondan
tatlı
bir şeyin kokuyu çıkıp başlamıştır .
(Type1,n=2,5K)
Input
In some cases, POS information decreases NIST scores which is opposed as
expected. For instance, the following translations show a situation where the baseline
system produces the right result by choosing the verb “var” (to exist) whereas the
8
more complex (type 2 with n=2 and 5K vocabulary) LM generated a false word
choice by selecting the verb “git” (to go).
Input : Sebäbi meniñ içimde goşa ýumruk ýaly gyzyl bardy
Output1: nedeni benim içimde çift yumruk gibi altın vardı.(Type1,n=1,3K)
Output2: nedeni benim içimde çift yumruk gibi altın gitti.(Type2,n=2,5K)
6. Conclusions
Our major goal in this work is proposing a lexical ambiguity resolution method for
Turkish to be used in our direct transfer MT model. The lexical ambiguity occur
mainly because of the transfer of the source language root words. We have built
language models for statistical resolution of this lexical ambiguity and then these LMs
are used to generate Hidden Markov Models. Finding the path with the highest
probability in HMMs has done by using the Viterbi method. By this way, we can
disambiguate the lexical ambiguity and choose the most probable word sequences in
the target language.
Two types of language models (root words and root word+POS tag) are used and
we have investigated the effects of the other parameters like LM order or vocabulary
size. The LM built by using root words performs best with the parameters n=2 and 7K
vocabulary.
Since Turkish is an agglutinative language, taking NIST as the evaluation method
may not be very meaningful because NIST considers only surface form matching of
the words, not the root word matching. Even though the model chooses the right root
word, the other morphological structures can make the surface form of the translated
word different from surface forms of the words in the reference sentences. This means
that the real performance scores are equal or probably higher than the actual
performance scores.
We are expecting higher NIST score improvements with the development and
extending of our translation model. There are some problems like SL morphological
analyzer mistakes and transfer rule insufficiencies to handle some cases.
Acknowledgments
This project is partly supported by TÜBİTAK (The Scientific and Technical Research
Council of Turkey) under the contract no 106E048.
9
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Hajič, J., Hric J., Kubon, V. : Machine Translation of Very Close Languages. Applied
NLP Processing, NAACL, Washington (2000)
Canals, R., et.al. : interNOSTRUM: A Spanish-Catalan Machine Translation System.
EAMT Machine Translation Summit VIII, Spain (2001)
Hirst, G. Charniak, E. : Word Sense and Case Slot Disambiguation, In AIII-82, pp. 95-98
(1982)
Black, E. : An Experiment in Computational Discrimination of English Word Senses, IBM
Journal of Research and Development 32(2), pp. 185-194 (1988)
Gale, W., Church, K. W., and Yarowsky, D. : A Method for Disambiguating Word Senses
in a Large Corpus, Statistical Research Report 104, Bell Laboratories (1992)
Leacock, C.; Towell, G., Voorhees, E. : Corpus-based statistical sense resolution,
Proceedings of the ARPA Human Language Technology Worskshop, San Francisco,
Morgan Kaufman(1993)
Schütze, H. : Automatic word sense discrimination, Computational Linguistics, 24(1),
(1998)
Yarowsky, D. : Unsupervised word sense disambiguation rivaling supervised methods. In
Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics
(ACL’95), Cambridge, MA. (1995)
Hearst, M. A. : Noun homograph disambiguation using local context in large corpora.
Proceedings of the 7th Annual Conf. of the University of Waterloo Centre for the New
OED and Text Research, Oxford, United Kingdom, 1-19 (1991)
McRoy, S. W. : Using multiple knowledge sources for word sense discrimination.
Computational Linguistics, 18(1) 1-30 (1992)
Clarkson, P.R., Rosenfeld R. : Statistical Language Modeling Using the CMU-Cambridge
Toolkit. Proceedings ESCA Eurospeech (1997)
Fomey, G.D., Jr. : The Viterbi Algorithm, IEEE Proc. Vol. 61, pp. 268-278 (1973)
NIST Report (2002) Automatic Evaluation of Machine Translation Quality Using N-gram
Co-Occurrence Statistics.
http://www.nist.gov/speech/tests/mt/doc/ngramstudy.pdf

Lexical Ambiguity Resolution for Turkish in Direct Transfer Machine

Transkript

Benzer belgeler

Language, Culture, Translation and Interpreting

MAY, 15,2014 MAY, 16,2014

IED 348 (01)

The Turkish National Anthem - English lyrics Turkish National

Patient Participation Group Turkish

Embassy English 2016 Promotion for Turkish

Systematics of Scorpaeniformes species in the Mediterranean Sea

HUMANITAS Bahar / Spring, Tekirdağ, 2015

the Linguistic Sciences Margie 0` Bryan action of rules is that

NCSTAR 1-5D Ceilings