/[svn]/OpenMaTrEx/trunk/README
ViewVC logotype

Contents of /OpenMaTrEx/trunk/README

Parent Directory Parent Directory | Revision Log Revision Log


Revision 271 - (show annotations)
Tue May 24 11:00:46 2011 UTC (6 years, 7 months ago) by mikel
File size: 12313 byte(s)
changed references to 0.97.1 to 0.98
1 #-------------------------------------------------------------------------------
2 # This file is part of OpenMaTrEx: a marker-driven corpus-based machine
3 # translation system.
4 #
5 # Copyright (c) 2004-2011 Dublin City University
6 # (c) 2004-2007 Steve Armstrong, Yvette Graham, Nano Gough, Declan Groves,
7 # Yanjun Ma, Nicolas Stroppa, John Tinsley, Andy Way, Bart Mellebeek
8 # (c) 2010-2011 Sandipan Dandapat, Mikel L. Forcada, Declan Groves, John
9 # Tinsley, Sergio Penkale, Andy Way
10 #
11 # This program is free software: you can redistribute it and/or modify
12 # it under the terms of the GNU General Public License as published by
13 # the Free Software Foundation, either version 3 of the License, or
14 # (at your option) any later version.
15 #
16 # This program is distributed in the hope that it will be useful,
17 # but WITHOUT ANY WARRANTY; without even the implied warranty of
18 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 # GNU General Public License for more details.
20 #
21 # You should have received a copy of the GNU General Public License
22 # along with this program. If not, see <http://www.gnu.org/licenses/>
23 #-------------------------------------------------------------------------------
24
25 Running OpenMaTrEx
26
27
28 1 INTRODUCTION
29
30 A shell (OpenMaTrEx) is used to hide a make call on a Makefile
31 provided with the distribution; the makefile target is passed on to
32 make. Targets are available to initialise and filter the training set,
33 to call the chunker/tagger and the aligner, to train a target language
34 model with IRSTLM, to run GIZA+++ and Moses training jobs, to merge
35 marker-based chunk pairs with Moses 'phrase' pairs, to run MERT
36 (minimum-error-rate training) tuning jobs, to execute the decoders,
37 and to evaluate their results using BLEU and NIST.
38
39 Some of these targets build an XML configuration file (usually called
40 OpenMaTrEx.ini) which is used to pass information to classes
41 such as main/Chunk, main/Align, and main/Decode.
42
43 Support for higher-level ready-made options for the most usual
44 training and translation jobs is being currently added.
45
46 OpenMaTrEx may be run in two modes:
47
48 * 'Marclator mode' does not make use the Moses decoder but instead
49 uses a left-to-right monotonous recombinator ("decoder") that was
50 released with Marclator (http://www.OpenMaTrEx.org/marclator/).
51
52 * 'MaTrEx mode' merges Moses 'phrase' pairs with aligned chunk pairs
53 and uses Moses itself as a decoder.
54
55
56 2 HOW TO RUN
57
58 What follows is a sketch of the procedure that describes how to run
59 OpenMaTrEx.
60
61 To simplify, in the documentation that follows this command is
62 referred to as "OpenMaTrEx".
63
64 To use the examples that follow, replace "OpenMaTrEx" by the actual
65 command you need to run, for example
66 "/home/mikel/OpenMaTrEx-test/OpenMaTrEx-0.98/OpenMaTrEx".
67
68 1.1 Training
69
70 Here is the training sequence. Unless explicitly stated, a step is
71 common to Marclator and MaTrEx modes.
72
73 1. Create a working directory. In this directory, create a
74 subdirectory 'orig' to store your training and test data: train.S and
75 train.T, and testset.S and testset.T, where S denotes the source
76 language and T the target language. If you plan to use MERT to
77 optimize the Moses weights (in MaTrEx mode), you will also need a
78 devset.S file and a devset.T file. Each source file and its target
79 counterpart must have the same number of lines (one sentence per
80 line).
81
82 NOTE: OpenMaTrEx has been tested on tokenized and lowercased files having
83 no zero-word sentences nor sentences that are too long. To tokenize a
84 file, you may use the script tokenizer/tokenizer.perl in the Moses
85 scripts directory scripts-YYYYMMDD-HHMM. To filter sentences for length,
86 you may use the file training/clean-corpus-n.perl, which may also be used
87 to tokenize them at the same time.
88
89
90 2. When these files are ready, type
91
92 ./OpenMaTrEx SL="S" TL="T" ENCODING="UTF-8"
93
94 to initialize the system (One can also easily invoke the Moses filtering
95 scripts to remove certain sentence pairs, etc.).
96
97 To get more detailed information about the work of the chunker, the
98 aligners, etc., you can add DEBUG="yes" like this:
99
100 ./OpenMaTrEx SL="S" TL="T" ENCODING="UTF-8" DEBUG="YES"
101
102
103 3. Then, your corpus probably needs to be filtered: ./OpenMaTrEx filter
104
105 A filtered directory has been created with the filtered training data.
106 Note you should put your testset.S and testset.T in this directory.
107
108
109 4. In MaTrEx mode, we need to train a target language model for the Moses
110 decoder
111
112 ./OpenMaTrEx language_model
113
114 Currently OpenMaTrEx uses IRSTLM, modified Knesser-Ney smoothing
115 and 5-gram models,
116
117
118 5. Run the source- and target-language marker-based chunker/tagger
119 using the following commands respectively:
120
121 ./OpenMaTrEx marker-based_chunking_source
122
123 and
124
125 ./OpenMaTrEx marker-based_chunking_target
126
127 This will create a directory chunked which contains the
128 marker-based chunked files train.S and train.T. Source and target
129 chunking can be run in parallel.
130
131
132 6. First, run MOSES until step 5 (phrase extraction), if you are going to
133 execute OpenMaTrEx in MaTrEx mode:
134
135 ./OpenMaTrEx moses_training_steps FIRST_STEP=1 LAST_STEP=5
136
137 or until step 4, if you are going to execute OpenMaTrEx in
138 Marclator mode:
139
140 ./OpenMaTrEx moses_training_steps FIRST_STEP=1 LAST_STEP=4
141
142
143
144 7. Align the marker-based chunked files using the marker-based
145 aligner. Currently we have three different entry points for three
146 different aligners (i.e. ebmt_alignments, ebmt_alignments_on_disk,
147 ebmt_alignments_on_disk_with_id, ebmt_alignments_with_context).
148
149 You should use
150
151 ./OpenMaTrEx ebmt_alignments_on_disk
152
153 if you are going to execute OpenMaTrEx in Marclator mode or
154
155 ./OpenMaTrEx ebmt_alignments_on_disk_with_id
156
157 if you are going to execute OpenMaTrEx in MaTrEx mode. These aligners use
158 two distances: one based on lexical probabilities as computed by Giza++
159 and Moses, and a cognate distance that takes into account the similarity
160 of the words. One can add a third distance that favours the alignment of
161 chunks when the markers have the same part of speech. The corresponding
162 commands would be
163
164 ./OpenMaTrEx ebmt_alignments_on_disk_with_tags
165
166 for Marclator mode and
167
168 ./OpenMaTrEx ebmt_alignments_on_disk_with_tags_with_id
169
170 for MaTrEx mode.
171
172 Any of these creates the ebmt_alignments directory with the chunk aligned file in
173 the desired format.
174
175
176 8. Only in MaTrEx mode, we then combine the aligned chunks and Moses
177 phrase pairs into a single file in order for their combined counts
178 to be used for calculating relative frequency. In order to do
179 this, the aligned Marclator chunks must be in the format expected
180 by Moses (with ID). This involves assigning a word alignment to
181 each chunk pair according to the aligned.grow-diag-final file. We
182 also create a Moses-style reordering model in this fashion. This
183 can be carried out by running the following command
184
185 ./OpenMaTrEx merge_ebmt_with_moses
186
187 This is not needed in Marclator mode.
188
189
190
191 9. Only in MaTrEx mode, we have to execute steps 6 to 9 of Moses training
192
193 ./OpenMaTrEx moses_training_steps FIRST_STEP=6 LAST_STEP=9
194
195
196 10. Optionally, in MaTrEx mode, one can add to each "phrase" pair an
197 extra feature that takes two different values, 0 if the pair has
198 been found by Moses and 1 if the pair is an aligned marker-based
199 chunk pair, to be added to the features that may be optimized using
200 MERT.
201
202 ./OpenMaTrEx add_ebmt_feature
203
204 11. In MaTrEx mode, MERT may be used to optimize the weights
205 corresponding to each feature.
206
207 ./OpenMaTrEx moses_mert
208
209 12. Optionally, phrase tables and reordering tables may be binarized.
210 Binarizing the phrase table and reordering models helps decrease
211 memory usage as only phrase pairs that are needed for each
212 sentence are read from file into memory. Decoding may however be a
213 bit slower. Command:
214
215 ./OpenMaTrEx binarize
216
217 [The Makefile contains an examples of higher-level targets that may be
218 edited to simplify the whole training run]
219
220
221 1.2 Translating
222
223 13. In Marclator mode, translate the file testset.S using the naive
224 decoder using the marker-based chunk alignments and the word
225 alignments obtained in the previous steps:
226
227 ./OpenMaTrEx ebmt_decode
228
229 This creates the ebmt-results directory. The translation of testset.S is
230 found as file testset.S.translated.T
231
232
233 14. In Marclator mode, optionally, one can filter the 'phrase' table
234 against the testset to speed up decoding during evaluation:
235
236 ./OpenMaTrEx filter_model
237
238
239 15. In MaTrEx mode, we must then run the Moses decoder to use the
240 combined phrase table.
241
242 ./OpenMaTrEx moses_decode
243
244
245 1.3 Evaluation
246
247 16. To check the output quality of the translation both in Marclator
248 mode and in MaTrEx mode, evaluate the output invoking the following
249 command
250
251 ./OpenMaTrEx eval_output
252
253 This creates a folder 'eval' containg the 'results' from evaluation
254 using BLEU and NIST score. If Meteor is installed, Meteor results will also
255 be reported.
256
257
258 3 EXAMPLE
259
260 Sample files may be found in the examples/ directory of the OpenMaTrEx
261 package.
262
263 The examples/ directory contains sample files for a French(fr) to
264 English(en) translation task. They correspond to sections of the
265 Europarl (http://www.statmt.org/europarl/). The authors of the website
266 say they are not aware of any copyright restrictions of the material,
267 but this should be checked in future versions, as there might be
268 restrictions to redistributing them under the OpenMaTrEx license. A
269 README file in the examples/ directory describes the files.
270
271 3.1 Marclator mode
272
273 To perform French to English translation in Marclator mode, copy the
274 example/ directory as orig/ in your working directory and invoke the
275 following steps for training:
276
277 OpenMaTrEx SL="fr" TL="en" ENCODING="UTF-8" init
278
279 OpenMaTrEx filter
280
281 OpenMaTrEx marker-based_chunking_source
282
283 OpenMaTrEx marker-based_chunking_target
284
285 OpenMaTrEx moses_training_steps FIRST_STEP=1 LAST_STEP=4
286
287 OpenMaTrEx ebmt_alignments_on_disk
288
289 Then to translate:
290
291 OpenMaTrEx ebmt_decode
292
293 Filtering and evaluation for Marclator mode have not been implemented yet.
294
295 3.2 MaTrEx mode
296
297 To perform the same translation in MaTrEx mode, the steps are:
298
299 OpenMaTrEx SL="fr" TL="en" ENCODING="UTF-8" init
300
301 OpenMaTrEx filter
302
303 OpenMaTrEx language_model
304
305 OpenMaTrEx marker-based_chunking_source
306
307 OpenMaTrEx marker-based_chunking_target
308
309 OpenMaTrEx moses_training_steps FIRST_STEP=1 LAST_STEP=5
310
311 OpenMaTrEx ebmt_alignments_on_disk_with_id
312
313 OpenMaTrEx merge_ebmt_with_moses
314
315 OpenMaTrEx moses_training_steps FIRST_STEP=6 LAST_STEP=9
316
317 [optional] OpenMaTrEx add_ebmt_feature
318
319 [optional] OpenMaTrEx moses_mert
320
321 [optional] OpenMaTrEx filter_model
322
323 OpenMaTrEx moses_decode
324
325 Then, to evaluate:
326
327 To perform evaluation
328
329 OpenMaTrEx eval_output
330
331 A shell, sample-run.sh, is provided which exemplifies a series of three MaTrEx
332 mode runs and may be used to prepare similar jobs.
333
334 3.3. Word packing
335
336 Word packing obtains advanced word alignments which may be used on their own
337 or to improve phrase extraction and, subsequently, machine translation
338 results.
339
340 Before running word packing a complete SMT training job consisting of
341 steps 1, 2, 4, 6 and 9 has to be run.
342
343 The Makefile offers two targets for word packing:
344
345 3.3.1 word_packing_4_mt
346
347 This obtains advanced word alignments through word packing and then uses
348 them to obtain a better machine translation system.
349
350 Using this option, you need to make sure you have both training and
351 development data and you have run through MaTrEx moses mode.
352
353 Run it using
354
355 OpenMaTrEx word_packing_4_mt
356
357 3.3.2 word_packing_4_alignment (not tested yet)
358
359 This obtains advanced word alignments through word packing and then
360 evaluates them against a gold standard. Using this option, you need to make
361 sure you have some gold standard word alignment in the following format:
362
363 00001 1 1
364 00001 3 2
365 00001 2 3
366 00001 4 4
367 00001 5 5
368 00001 6 5
369 00001 7 6
370 00001 8 6
371 00002 2 1
372 00002 4 1
373
374 where the first number is the sentence identifier and the second and third
375 number index the words aligned.
376
377 OpenMaTrEx word_packing_4_alignment NO_GS_SENT=your_number_of_gold_standard_sentences GS_FILE=your_gold_standard_file
378
379
380
381

Mikel L. Forcada">Mikel L. Forcada
ViewVC Help
Powered by ViewVC 1.1.5