|Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur|
Spoken Language Processing Group (TLP)
Machine Translation @ LIMSI
SMT@LIMSI > Research themes
Research on machine translation is primarily oriented towards improving existing statistical machine translation (SMT) systems, or more generally data-driven machine translation engines. In a nutshell, SMT systems rely on the statistical analysis of large bilingual corpora to train stochastic models of the mapping between a source and a target language. In their simplest form, these models correspond to probabilistic rational relations between source and target strings of words, as initially formulated in the famous IBM models in the early nineties. More recently, these models have extended to capture more complex representations (eg. chunks, trees, or dependency structures) and the possible probabilistic relasionships between these representations. Such models are typically trained from parallel corpora, ie from examples of source texts aligned with their translation(s), where the alignment is typically defined at the subsentential level.
In this context, LIMSI is developping its research activities in several directions, from the design of word and phrase alignment models, to the conception of novel translation or language models; from the exploration of new training or tuning methodologies to the development of new decoding strategies. All these innovations need to be evaluated and diagnosed, and we also devote a significant fraction of our efforts to address the vexing issue of quality measurements in MT outputs. All these activities have been published in a number of international conferences or journals (see the Publications section). We are finally involved in a number of national and international projects (see the Project section below.)
Regarding alignment models, most of our recent work deals with the design and training of discriminative alignment techniques (Tomeh et al, 2011a, 2011b, 2010b; Allauzen & Wisniewski, 2009) to be used either to actually compute word alignments, to symmetrize existing word alignments, or to refine the extraction process. Recent work (Lardilleux et al, 2011; 2012; 2013) explores alternative alignment techniques, based on a phrase association measures: the goal is to explore flexible on-demand alignments strategies.
Our main decoder, N-code, belongs to the class of n-gram based systems. In a nutshell, these systems define the translation as a two step process, where an input source sentence is first reordered non-deterministically yielding a input word lattice containing several possible reorderings. This lattice is then translated monotonically using a bilingual n-gram model; as in the more standard approach, hypotheses are scored using a battery of probabilistic models, whose weights are tuned with minimum error weight training. Recent evolutions of this approach are described in (Crego & Yvon 2009, 2010a, 2010b). This system is now released as open source software (see Ncode web pages) and (Crego et al 2012); an online demo is also available. As an alternative training strategy, we have recently proposed a CRF-based translation model (Lavergne et al. 2011; 2013).
Our activities are not restricted to these core modules of SMT systems, and we are investigating many other aspects of SMT systems, such as tuning (Sokolov & Yvon, 2011; Wisniewsk & Yvon 2013), multi-source machine translation (Crego & al 2010a, 2010b), evaluation of MT (Max & al 2010, Wisniewski & al, 2010), confidence estimation for MT (Wisniewski et al 2012, 2013, 2014), WSD in SMT (Apidianaki et al, 2011), extraction of parallel sentences from comparable corpora (Braham-Ghabiche & al 2011), etc.
Activities in SMT are finally closely related to the work carried out on language modeling, a theme on which LIMSI has been contributing for many years. A major recent contribution is the work on Neural Network Language models, initiated in (Gauvain & Schwenk, 2002), and recently revisited in (Le & al, 2010, 2011, 2012).
Our research activities are conducted in close relationship with several academic and industrial partners in the context of several national and international projects. A partial list of these projects is given below.
LIMSI's systems have taken part in several international MT evaluation campaigns. This includes a yearly participation to the WMT evaluation series (2006-2014), where LIMSI has consistently been amongst the top ranking systems, especially when translation into French is concerned. We have also ran the 2009 NIST MT evaluation for the Arabic-English task, as well as the IWSLT evaluations in 2010 and 2011.
LIMSI has recently been actively involved in the organization of various scientific events: EAMT 2010 in St Raphaël and IWSLT 2010 in Paris, as well as the Tralogy series. A. Allauzen has launched the series of ACL workshop on learning representations (2013 in Sophia, 2014 in Gothemburg. F. Yvon is again chairing the IWSLT 2014 in Lake Tahoe scientific committee.
The LIMSI system performed best in the SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking for English.
SMT@LIMSI > PublicationsThe full list of publications
SMT@LIMSI > People
If you would like to join us, do not hesitate to send us your CV: we are always looking for good Ph.Ds or postdoctoral research associates.
Visitors and collaborators
SMT@LIMSI > Recent Seminars
They have visited LIMSI in the past, so why don't you ? If you are interested, and happen to visit Paris, just drop us a mail !
SMT@LIMSI > Projects
SMT@LIMSI > Softwares and Demos
Last modified: Friday,17-April-15 18:57:41 CEST