ViewVC logotype

Contents of /OpenMaTrEx/trunk/ABOUT

Parent Directory Parent Directory | Revision Log Revision Log

Revision 244 - (show annotations)
Thu Jan 27 16:35:42 2011 UTC (6 years, 11 months ago) by mikel
File size: 19159 byte(s)
Added wordpacking support for version 0.97
6 July 2010
10 We describe OpenMaTrEx, a free/open-source (FOS) example-based machine
11 translation (EBMT) system based on the marker hypothesis. It comprises
12 a marker-driven chunker, a collection of chunk aligners, and two
13 engines: one based on the simple proof-of-concept monotone
14 recombinator (previously released as Marclator,
15 http://www.openmatrex.org/marclator/) and a Moses-based decoder (Koehn
16 et al. 2007, http://www.sf.net/projects/mosesdecoder/). OpenMaTrEx is
17 a FOS version of the basic components of MaTrEx, the Dublin City
18 University machine translation (MT) system (Stroppa and Way 2006,
19 Stroppa et al. 2006). A great part of the code in OpenMaTrEx is
20 written in Java, although there are many important tasks that are
21 performed in a variety of scripting languages.
23 The architecture of OpenMaTrEx is the same as that of a baseline
24 MaTrEx system (Stroppa and Way 2006, Stroppa et al. 2006); as MaTrEx,
25 it can wrap around the Moses statistical decoder, using a hybrid
26 translation table containing marker-based chunks as well as
27 statistically extracted phrase pairs.
29 OpenMaTrEx has been released as a FOS package so that MaTrEx
30 components which have successfully been used (Groves and Way 2006,
31 Hassan et al. 2007, Tinsley et al. 2008) may be combined with
32 components from other FOS machine translation (FOSMT) toolkits such as
33 Cunei (Phillips and Brown 2009, http://www.cunei.org), Apertium
34 (Tyers, Forcada and Ramírez-Sánchez 2009, http://www.apertium.org),
35 etc (for a longer list of FOSMT systems, see
36 http://www.fosmt.info). Indeed, using components released in
37 OpenMaTrEx, researchers have previously: used statistical models to
38 rerank the results of recombination (Groves and Way 2006); used
39 aligned, marker-based chunks in an alternative decoder which uses a
40 memory-based classifier (van den Bosch, Stroppa and Way 2007);
41 combined the marker-based chunkers with rule-based components
42 (Sánchez-Martínez, Forcada and Way 2009), or used the chunker to
43 filter out Moses phrases for linguistic motivation (Sánchez-Martínez
44 and Way, 2009).
46 This file is organized as follows. Section 2 describes the principles
47 of training and translation in OpenMaTrEx section 3 describes the
48 EBMT-specific components in OpenMaTrEx; section 4 describes its
49 software requirements and briefly explains how to run the available
50 components. Concluding remarks are made in section 5.
54 2.1 Training
56 Training with OpenMaTrEx may be performed in two different modes. In
57 MaTrEx mode:
59 1. Each example sentence in the sentence-aligned source text and its
60 counterpart in the target training text are divided in
61 subsentential segments using a marker-based /chunker/. For
62 instance, the English sentence
64 That delay was intended to allow time for the other institutions
65 to consider our amendments and for the Commission to formulate
66 amended proposals.
68 would be chunked as
70 that delay was intended ||| to allow time ||| for the other
71 institutions ||| to consider ||| our amendments ||| and for the
72 commission ||| to formulate amended proposals
74 Chunks may optionally be tagged according to the tag of the marker
75 words used to delimit them (and tags may be used to guide the
76 alignment process), hence the name chunker/chunk tagger (from now
77 on, simply "chunker").
79 2. A complete Moses--GIZA++ training run is performed up to step 5
80 (phrase extraction). Moses is used to learn a maximum likelihood
81 lexical translation table and to extract phrase-pair tables.
83 3. The subsentential chunks are aligned using one of the aligners
84 provided (using, among other information, probabilities generated
85 by GIZA++).
87 4. Aligned chunk pairs from step 3 are merged with the phrase pairs
88 generated by Moses in step 2 (more details in in section 3).
90 From then on, training proceeds as a regular Moses job after step
91 6. MERT (Och 2003) may be used on a development set for tuning.
93 In Marclator mode (see below), the last two steps are not necessary
94 and Moses is only run up to step 4.
96 2.2 Translation
98 Translation may be performed, as training, in two ways:
100 * Marclator mode uses a monotone ("naïve") decoder (released as part
101 of Marclator): each source sentence is run through the marker-based
102 chunker; the most probable translations for each chunk are
103 retrieved, along with their weights; if no chunk translations are
104 found, the decoder backs off to the most likely translations for
105 words (as aligned by Giza++) and concatenates them in the same
106 order, and when no translation is found, the source word is left
107 untranslated. This decoder has obvious limitations, but it is fast
108 and likely to be of most use in the case of very related language
109 pairs.
111 * MaTrEx mode is, however, the usual way to use OpenMaTrEx; that is,
112 the Moses decoder is run on a merged phrase table, as in (Stroppa
113 and Way 2006, Stroppa et al. 2006).
118 3.1 Chunker
120 The main chunker/chunk tagger (class chunkers/MarkerBasedChunker, base
121 class chunkers/Chunker) is based on Green's "marker hypothesis" (Green
122 1979) which states that the syntactic structure of a language is
123 marked at the surface level by a set of marker (closed-category) words
124 or morphemes. In English, markers are predominantly right-facing and
125 it is therefore a left-marking language (for instance, determiners or
126 prepositions mark the start of noun phrases or prepositional phrases,
127 respectively); there are also right-marking languages such as
128 Japanese, with left-facing markers. OpenMaTrEx deals with
129 left-marking languages: a chunk starts where a marker word appears,
130 and must contain at least one non-marker (content, open-category)
131 word. In addition to marker words, punctuation is used to delimit the
132 end of chunks (there is also another chunker, chunkers/SillyChunker
133 which just divides the sentence in three-word chunks, using no
134 markers). The chunker includes a tokenizer and lowercaser.
136 3.1.1 Marker files
138 Currently OpenMaTrEx provides marker files for Catalan, Czech, English,
139 Portuguese, Spanish, French, Italian, Irish and German. Marker files specify
140 one marker word or punctuation in each line: its surface form, its category
141 and (optionally) its subcategory. A typical marker word file contains a few
142 hundred entries.
145 3.2 Chunk aligners:
147 There are different chunk aligners (base class aligners/ChunkAligner)
148 in Marclator:
150 * The class aligners/EditDistanceChunksAligner aligns chunks using a
151 regular Levenshtein edit distance with costs (base class
152 aligners/ChunkDistance specified by aligners/CombinedChunkDistance,
153 which combines at runtime the component costs listed by the user in
154 the configuration file used to call it, such as
156 - aligners/NumCharsChunkDistance (based on the number of characters
157 in the chunk),
159 - aligners/NumWordsChunkDistance (based on the number of words in
160 the chunk),
162 - aligners/TagBasedChunkDistance (based on the tags assigned to each
163 chunk by the chunker/tagger)
165 - aligners/WordProbsChunkDistance (see below), and
167 - aligners/CognateChunkDistance (which calculates cognate distance
168 based on a combination of the Levenshtein distance, the lowest
169 common rubsequence ratio and the Dice coefficient).
171 A combination of the latter two is customarily used (for more
172 details on alignment strategies, see Stroppa and Way 2006). Given
173 that no substantial improvement was obtained by modifying these
174 weights (Stroppa and Way 2006), the code uses equal weights for all
175 component costs specified.
177 * aligners/EditDistanceWJChunksAligner is a version of
178 aligners/EditDistanceChunksAligner but uses a modified edit
179 distance with /jumps/ or block movements (Stroppa and Way 2006)
181 * Other aligners such as aligners/HmmChunksAligner,
182 aligners/MostProbableChunkAligner, aligners/GreedyChunksAligner, and
183 aligners/SillyChunksAligner are also available for experimentation.
185 Since aligners/WordProbsChunkDistance uses word translation
186 probabilities calculated using the Giza++ statistical aligner
187 (http://www.fjoch.com/GIZA++.html) and scripts available as part of
188 the Moses MT engine (http://www.statmt.org/moses/) these
189 free/open-source packages need to be installed in advance (see
192 3.3 Translation table merging:
194 To run the system in MaTrEx mode, marker-based chunk pairs are merged
195 with phrase pairs from alternative resources (here, Moses
196 phrases). Firstly, each chunk pair is assigned a word alignment based
197 on the refined Giza++ alignments, for example
199 please show me ||| por favor muéstreme ||| 0-0 0-1 1-2 2-2
201 In cases where there is no word alignment for a particular chunk pair
202 according to Giza++, the chunk pair is discarded. Using these word
203 alignments, we additionally extract a phrase orientation-based
204 lexicalised reordering model à la Moses (Koehn et al. 2005). Finally,
205 we may also limit the maximum length of chunks pairs that will be
206 used. The resulting set of chunk pairs are in the same format as those
207 phrase pairs extracted by Moses. The next step is to combine the
208 chunk pairs with Moses phrase pairs. In order to do this, the two sets
209 of chunk/phrase pairs are merged into a single file. Moses training is
210 then carried out from step 6 (scoring) which calculated the required
211 scores for all feature functions, including the reordering model,
212 based on the combined counts. A binary feature distinguishing EBMT
213 chunks from SMT chunks may be added for subsequent MERT optimization
214 (Srivastava et al. 2009).
218 4.1 Word packing
220 The current version of OpenMaTrEx incorporates "word-packing" (Ma, Stroppa
221 and Way 2007), a simple method to pack words for statistical word alignment.
222 Word packing simplifies the task of automatic word alignment by packing
223 several consecutive words together when they are believed to correspond to a
224 single word in the opposite language. This is done using the word aligner
225 itself, i.e. by bootstrapping on its output, as follows: (i) a word aligner
226 is used to obtain 1-to-n alignments, (ii) candidates for word packing are
227 extracted (iii) the reliability of these candidates is estimated (iv) the
228 groups of words to pack are replaced by a single token in the parallel
229 corpus, (v) the alignment process is re-iterated using the updated corpus.
230 The first three steps are performed in both directions, and produce two
231 bilingual dictionaries (source-target and target-source) of groups of words
232 to pack. The resulting alignments are combined and word-level alignments are
233 reconstructed by unpacking the packed tokens, ready for phrase extraction.
237 5.1 Required software:
239 OpenMaTrEx requires the installation of the following software: GIZA++,
240 Moses, IRSTLM (Federico and Cettolo 2007), and Kohsuke Kawaguchi's args4j
241 command-line argument processor. Refer to the INSTALL file that comes with
242 the distribution for details.
244 Optionally, a set of auxiliary scripts for corpus preprocessing
245 (http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz) and the Meteor
246 evaluation software (Lavie and Agarwal 2007,
247 http://www.cs.cmu.edu/~alavie/METEOR/) may be installed.
249 5.2 Installing OpenMaTrEx itself
251 OpenMaTrEx may easily be built simply by invoking ant or an equivalent
252 tool on the build.xml provided. The resulting OpenMaTrEx.jar contains
253 all the relevant classes, some of which will be invoked using a shell,
254 OpenMaTrEx (see below).
256 5.3 Running
258 A shell (OpenMaTrEx) has options to initialise the training, testing and
259 development sets, to call the chunker and the aligner, to train a target
260 language model with IRSTLM, to run GIZA++ and Moses training jobs, to merge
261 marker-based chunk pairs with Moses phrase pairs, to run MERT optimization
262 jobs, to execute the decoders, and to evaluate the output. Future versions
263 will contain higher-level ready-made options for the most usual training and
264 translation jobs. For detailed instructions on how to perform complete
265 training and translation jobs in both MaTrEx and Marclator mode, see the
266 README file. Test files will be provided in the examples directory of the
267 OpenMaTrEx package.
271 We have presented OpenMaTrEx, a FOS EBMT system based including a
272 marker-driven chunker (with marker word files for a few languages),
273 chunk aligners, a simple monotone recombinator, and a wrapper around
274 Moses so that it can be used as a decoder for a merged translation
275 table containing Moses phrases and marker-based chunk
276 pairs. OpenMaTrEx releases the basic components of MaTrEx, the Dublin
277 City University machine translation system under a FOS license, so
278 that they are easily accessible to researchers and developers of MT
279 systems.
281 As for future work, version 1.0 will contain, among other
282 improvements, a better marker files and an improved install procedure;
283 further versions will free/open-source additional MaTrEx
284 components. See also the TODO file.
288 The original MaTrEx code on which OpenMaTrEx is based was developed among
289 others by S. Armstrong, Y. Graham, N. Gough, D. Groves, H. Hassan, Y. Ma,
290 N. Stroppa, J. Tinsley, A. Way, and B. Mellebeek. We specially thank Y.
291 Graham and Y. Ma for their advice. P. Pecina helped with Czech markers.
292 Jimmy O'Regan helped with IRish markers. Sarah Ebling built the German
293 marker file. M.L. Forcada's sabbatical stay at InstitutionDublin City
294 University is supported by Science Foundation Ireland (SFI) through ETS
295 Walton Award 07/W.1/I1802 and by the Universitat d'Alacant (Spain). Support
296 from SFI through grant 07/CE/I1142 is acknowledged.
302 Armstrong, Stephen, Marian Flanagan, Yvette Graham, Declan Groves,
303 Bart Mellebeek, Sara Morrissey, Nicolas Stroppa, and Andy Way. 2006.
304 Matrex: Machine translation using examples. In /TC-STAR Open- Lab
305 Workshop on Speech Translation/, Trento, Italy.
307 Federico, M. and M. Cettolo. 2007. Efficient handling of n-gram
308 language models for statistical machine translation. /ACL 2007:
309 proceedings of the Second Workshop on Statistical Machine
310 Translation/, June 23, 2007, Prague, Czech Republic; pp. 88-95
312 Gough, N. and A. Way. 2004. Robust large-scale EBMT with
313 marker-based segmentation. In /Proceedings of the Tenth Conference
314 on Theoretical and Methodological Issues in Machine Translation
315 (TMI-04)/, pages 95-104, Baltimore, MD.
317 Green, T. 1979. The necessity of syntax markers. two experiments
318 with artificial languages. /Journal of Verbal Learning and
319 Behavior/, 18:481-496.
321 Groves, D. and A. Way. 2005. Hybrid example-based SMT: the best of
322 both worlds? /Building and Using Parallel Texts: Data-Driven
323 Machine Translation and Beyond/, 100:183.
325 Groves, D. and A. Way. 2006. Hybridity in MT: Experiments on the
326 europarl corpus. In /Proceedings of the 11th Annual Conference of
327 the European Association for Machine Translation (EAMT-2006)/,
328 pages 115-124.
330 Hassan, H., Y. Ma, A. Way, and I. Dublin. 2007. MaTrEx: the DCU
331 machine translation system for IWSLT 2007. In /Proceedings of the
332 International Workshop on Spoken Language Translation/, pages
333 69-75.
335 Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne,
336 M., and Talbot, D. 2005. "Edinburgh system description for the
337 2005 IWSLT speech translation evaluation". /Proc. of IWSLT 2005/
339 Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch,
340 Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen,
341 Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
342 Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for
343 statistical machine translation. /Annual Meeting of the
344 Association for Computational Linguistics (ACL), demonstration
345 session, Prague, Czech Republic, June 2007/.
347 Lavie, A., A. Agarwal. 2007. "METEOR: An Automatic Metric for MT
348 Evaluation with High Levels of Correlation with Human Judgments",
349 Proceedings of Workshop on Statistical Machine Translation at the 45th
350 Annual Meeting of the Association of Computational Linguistics
351 (ACL-2007), Prague, June 2007
353 Ma, Yanjun, Nicolas Stroppa, and Andy Way. 2007. Boostrapping Word
354 Alignment via Word Packing. In Proceedings of the 45th Annual Meeting of
355 the Association for Computational Linguistics (ACL 2007), Prague, Czech
356 Republic, pp.304-311
358 Och, F.J. 2003. Minimum error rate training in statistical machine
359 translation. In /Proceedings of the 41st Ann. Meeting on
360 Association for Computational Linguistics/ Volume 1, pages
361 160-167.
363 Phillips, Aaron B. and Ralf D. Brown. 2009. Cunei machine
364 translation platform: System description. In Mikel L. Forcada and
365 Andy Way, editors, /Proceedings of the 3rd International Workshop
366 on Example-Based Machine Translation/, pages 29-36, November.
368 Sánchez-Martínez, Felipe, Mikel L. Forcada, and Andy Way. 2009.
369 Hybrid rule-based - example-based MT: Feeding Apertium with
370 sub-sentential translation units. In Mikel L. Forcada and Andy
371 Way, editors, /Proceedings of the 3rd Workshop on Example-Based
372 Machine Translation/, pages 11-18, Dublin, Ireland, November.
374 Sánchez-Martínez, Felipe and Andy Way. 2009. Marker-based
375 filtering of bilingual phrase pairs for SMT. In /Proceedings of
376 EAMT-09, the 13th Annual Meeting of the European Association for
377 Machine Translation/, pages 144-151, Barcelona, Spain.
379 Srivastava, A., S. Penkale, D. Groves, and
380 J. Tinsley. 2009. Evaluating Syntax-Driven Approaches to Phrase
381 Extraction for MT. In /Proceedings of 3rd International Workshop
382 on Example-Based Machine Translation/, Dublin, Ireland, pp. 19-28.
384 Stroppa, N., D. Groves, A. Way, and K. Sarasola. 2006. Example-based
385 machine translation of the Basque language. In /Proceedings of AMTA
386 2006/, pages 232-241.
388 Stroppa, N. and A. Way. 2006. MaTrEx: DCU machine translation
389 system for IWSLT 2006. In /Proceedings of the International
390 Workshop on Spoken Language Translation/, pages 31-36.
392 Tinsley, J., Y. Ma, S. Ozdowska, and A. Way. 2008. MaTrEx: the DCU
393 MT system for WMT 2008. In /Proceedings of the Third Workshop on
394 Statistical Machine Translation/, pages 171-174. Association for
395 Computational Linguistics.
397 Tyers, Francis M., Mikel L. Forcada, and Gema Ramírez-Sánchez. 2009.
398 The Apertium machine translation platform: Five years on. In F.
399 Sánchez-Martínez J.A. Pérez-Ortiz and F.M. Tyers, editors, /Proceedings
400 of the First International Workshop on Free/Open-Source Rule-Based
401 Machine Translation/, pages 3-10, November.
403 van den Bosch, A., N. Stroppa, and A. Way. 2007. A memory-based
404 classification approach to marker-based EBMT. In /Proceedings of
405 the METIS-II Workshop on New Approaches to Machine Translation/,
406 pages 63-72, Leuven, Belgium.
408 Way, A. and N. Gough. 2005. Comparing example-based and
409 statistical machine translation. /Natural Language Engineering/,
410 11(03):295-309.

Mikel L. Forcada">Mikel L. Forcada
ViewVC Help
Powered by ViewVC 1.1.5