/[svn]/OpenMaTrEx/trunk/TODO
ViewVC logotype

Contents of /OpenMaTrEx/trunk/TODO

Parent Directory Parent Directory | Revision Log Revision Log


Revision 249 - (show annotations)
Thu Jan 27 16:57:21 2011 UTC (6 years, 11 months ago) by mikel
File size: 8609 byte(s)
more info about word packing
1 #-------------------------------------------------------------------------------
2 # This file is part of OpenMaTrEx: a marker-driven corpus-based machine
3 # translation system.
4 #
5 # Copyright (c) 2004-2010 Dublin City University
6 # (c) 2004-2007 Steve Armstrong, Yvette Graham, Nano Gough, Declan Groves,
7 # Yanjun Ma, Nicolas Stroppa, John Tinsley, Andy Way, Bart Mellebeek
8 # (c) 2010-2011 Sandipan Dandapat, Mikel L. Forcada, Declan Groves, John Moran,
9 # Sergio Penkale, Andy Way, Yanjun Ma
10 #
11 # This program is free software: you can redistribute it and/or modify
12 # it under the terms of the GNU General Public License as published by
13 # the Free Software Foundation, either version 3 of the License, or
14 # (at your option) any later version.
15 #
16 # This program is distributed in the hope that it will be useful,
17 # but WITHOUT ANY WARRANTY; without even the implied warranty of
18 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 # GNU General Public License for more details.
20 #
21 # You should have received a copy of the GNU General Public License
22 # along with this program. If not, see <http://www.gnu.org/licenses/>
23 #-------------------------------------------------------------------------------
24
25 2011.01.27
26
27 OPENMATREX TO-DO, TO-EXPLORE AND MISCELLANEOUS ITEMS LIST
28
29 DIFFERENT WAYS TO INVOKE OPENMATREX COMPONENTS
30
31 Only one of the ways of invoking some of the Java components in the systems
32 is currently being used: the one that uses a configuration file (class
33 learning/ConfigFile.java) and uses the Reflection API. Code for the other
34 one (class learning/CommandLineLearningContext.java) is included but not
35 used yet.
36
37 WRITING DOCUMENTATION
38
39 The use of many classes (such as aligners/HmmChunksAligner.java) is
40 just barely mentioned but not documented; these classes have not been
41 tested and may not even work. There is a lot of work to be performed
42 here.
43
44 MAKEFILE
45
46 Some lines were added in the Makefile to invert the first two fields of
47 ts_words.probs in the alignment stanzas (see ALIGNERS below).
48
49 The addition of higher-level Makefile targets to perform typical
50 MaTrEx-mode and Marclator-mode training and testing jobs should be
51 finalized. Targets should check for dependencies of certain tasks on
52 preceding tasks as is usual in Makefiles.
53
54 John Moran started to write a Makefile.MacOS for Marclator
55 (http://www.openmatrex.org/marclator/), but this has to be redone so that
56 the OpenMaTrEx shell can also run on a Mac. Sarah Ebling is working on that.
57
58 ALIGNERS
59
60 A number of problems were detected in the original MaTrEx
61 aligners. Workarounds are in place to solve them:
62
63 * Of the two lexical probability files st_words.probs and
64 ts_words.probs, the aligners, as called by the original Makefile,
65 only used ts_words.probs. This file, which derives from Moses file
66 lex.0-0.f2e, has lines with the structure "t s p(t|s)" where t is a
67 target word, s is a source word, and p(t|s) is the lexical
68 probability of t given s. This can be easily confirmed by checking
69 that sum_t p(t|s) is 1 for each s. However, as may be easily be
70 checked in class aligners/WordProbsChunkDistance.java by adding some
71 debugging code (which is commented out), the system assumes that the
72 structure of the file is "s t p(t|s)". As a result, it tries to
73 find words in the source chunk in the target part of ts_words.probs
74 and matching words in the target chunk with words in the source part
75 of ts_words.probs and therefore it only succeeds when the source and
76 target chunk contain proper nouns, numbers, etc. Most of the times
77 it does not find anything. After inverting the first two fields of
78 file ts_words.probs to turn it into the format assumed by
79 aligners/WordProbsChunkDistance.java (there is code in the Makefile
80 to do so), then many more matches are found and many of them make
81 complete sense as translations of a source word in the source chunk
82 being found in the target chunk.
83
84 * Alignment jobs using the target ebmt_alignments and
85 ebmt_alignments_on_disk generate a train.aw file which is useless for the
86 naïve decoder (see DECODER below). It contains what looks like negative
87 logarithms of probabilities and some lines appear to be repeated.
88
89 * There are currently some ebmt_aligmnents* targets that have not been
90 tested and may have to be removed.
91
92 * Some of the aligners and chunk distances in java/src/aligners have
93 not been tested (see above on DOCUMENTATION).
94
95 * At the moment only the default constructor in
96 aligners/CongnateChunkDistance.java class is called by the
97 EditDistanceChunksAligner and contains two hard-coded
98 thresholds for the levenshtein distance (10.0) and the cognate distance
99 metric (0.3). Word pairs that have a levenshtein distance below the hard-
100 coded levenshtein distance threshold and that have a cognate distance
101 score above the cognate distance threshold are counted as cognates. This
102 threshold needs to be parameterised so that the Makefile can use either
103 user-defined thresholds or estimate relative thresholds (e.g. based on
104 the relative length of the word pairs for which the cognate distance is
105 being calculated.
106
107 NAÏVE DECODER
108
109 An experimental interface to the decoder (class main/Decode) and a
110 decoding target in the Makefile were already working in
111 Marclator. They use the ts_words.probs as the align.aw file it
112 expects. Initial tests yield interesting results, but more tests are
113 needed.
114
115 A target to filter the files train.aw and train.ac against the test
116 file may be convenient to avoid loading large tables into Java
117 memory. There is already some code to do so in tools/ (filter_aw.py
118 and filter_tt.py).
119
120
121 MARKER FILES, CHUNKERS
122
123 Some of the marker files are very preliminary and should be
124 reviewed. Marker files use very loosely defined parts of speech such
125 as assuming that "aquel" (es) is a determiner and cannot be a pronoun,
126 or that "cuatrocientos" (es) is a determiner that could indeed be
127 preceded by another determiner ("los cuatrocientos"); it would be
128 worth seeing what would be the improvement if more "linguistic"
129 thought were devoted to defining markers more carefully (for instance,
130 to use tag-based distances to align chunks). Indeed, better
131 documentation about how tow write marker files should also be produced
132 (see java/marker_files/NOTES-on-markers.txt). The format of marker
133 files should also be reconsidered: an XML format could be managed by
134 using the existing XML processing present.
135
136 There is some information that is currently "hardcoded" into some
137 files. For instance, it may be the case that one has to change
138 data/Tag.java to deal with new languages or with a more refined set of
139 markers; some of this material should be moved to some sort
140 configuration files.
141
142 Indeed, data/Tag.java uses repetitive three-line code stanzas to
143 populate two Maps (why two?) and this could be improved. The code
144 could perhaps be simplified to avoid ".hashcode()". Perhaps there is
145 no need either to convert between two different representations of the
146 tagset.
147
148 chunker/MarkerBasedChunker.java: the current chunker processes a
149 sentence by building regular expressions that define the chunking on
150 the fly, presumably at runtime, after reading the marker
151 files. One-pass finite-state chunking techniques could be explored
152 because they are likely to be faster and easier to specify.
153
154 RECASING
155
156 De-casing and re-casing targets should be added to be able to tackle
157 "real text" translation tasks.
158
159 BUILDING PROCEDURE
160
161 The installing procedure could be greatly improved. Currently, the
162 INSTALL file specifies a number of manual install operations that
163 could be automated using some of the available standard building
164 procedures. In particular, we should explore using Maven instead of
165 ant so that adequate versions of other Java software such as args4j
166 are obtained at compile time if needed.
167
168 SPURIOUS WARNINGS
169
170 Target word_packing_4_mt throws two sets of spurious warnings that
171 should be ignored. First, probably because of the fact that targets of
172 the Makefile execute make on that particular Makefile again, many
173 warnings like this are possible:
174
175 [...] Makefile:NN: warning: overriding commands for target `init'
176 [...] Makefile:NN: warning: ignoring old commands for target `init'
177
178 Second, due to an unnecessary check when the word-packing shells call
179 "OpenMaTrEx init ..." the following warning is produced
180
181 Warning: using orig/train.en to train target language model
182
183 which is unnecessary since the target language model is taken
184 (symbolically linked) from the output of the regular SMT training run
185 that is necessary before word-packing.
186
187 We are working to solve these warnings.
188

Mikel L. Forcada">Mikel L. Forcada
ViewVC Help
Powered by ViewVC 1.1.5