Monday, March 12, 2007

When Open Source, Machine Translation and Computer Assisted Translation meet

Nice title, right? Well what scenario is possible there? Where are the potentials? What are the problems?

Ok, so let's start: actually there is more translation need in Open Source and Open Content than what can be handled by humans right now. All we need is probably already available and just needs some adjustment and additional feature.

Imagine a software like Mozilla Firefox that needs UI translation and manual translation. Well: the UI should be done by using a CAT, but considering some basic facts like: translators are not programmers, each language is very different - this must be considered somewhere - and yes, I know they are working on it. Then take the manual and the help files: the need translation and over time updates. A workflow I could imagine is:

1) Machine translation of the manual/help files using Apertium
2) Alignment of source and target text and load it in OmegaT for proof reading and translation of the parts that are completely "out of order".
3) Creation of the final documents + feedback to Apertium

A second time the translation process starts it would then go like this:
1) Pretranslation of the manual/help files using OmegaT (sentences that remain the same receive 100% matches).
2) Machine translation of the new parts using Apertium
3) Proof reading within OmegaT
4) Creation of final documents + feedback to Apertium

To do that Apertium needs to be programmed for the language pairs needed - the approach of Apertium is to create a one language to the other machine translation. Therefore, like you can see when you try out the Spanish-Catalan version, which is already very mature, you will get factual translation where only a 4-5% needs some manual changes.

I showed this also to a translator who deals professionally with these languages and one wikipedian (time ago). Both of them confirmed that the factual translation is correct. So it is suited for manuals and encyclopaedic entries. Of course, if you put an article done with machine translation online, you need to mark it as such until it is proof-read.

Of course it could be that some specific terminology for software is now missing in its engine ... well that needs to be added. For now Apertium has own dictionaries that care about this, but we would very much like to see the contents creation within OmegaWiki - the reason is simple: on one hand people who need Apertium function better and better will create their dictionaries there and on the other hand the work they do can be also used for other scopes, such as spell checkers, offline bilingual dictionaries, the dictionaries for the OLPC laptop, other software localization projects etc. So: the work is done only once and then re-used. If we get various projects to work together all of them will have better results by having less work to do. This means their time is used more effektively and they can do more good in the same time.

When it comes to OmegaWiki there is one feature that we are still missing, but that is already planned: inflections. These inflections are needed for Apertium, they are needed for spellcheckers and for various other applications. So again: co-operation among various projects becomes a means to make sure they will go ahead and survive in future.

One last note: please don't consider only our everyday languages where in some way you normally find people to co-operate with (and where we really also should try to consider their time investment with the value it actually has and where exactly this consideration and respect often is missing) - consider all those rare, less sourced, non governmental languages or better language entities (just to introduce this term) that will be helped a lot by such ways of working.
Post a Comment