Thursday, December 06, 2007

Translating Wikipedia articles ...

... into less resourced languages. Well, time has come that we can start to think about how to go about a faster creation of contents for the many small Wikipedias. As you all know, often we have just a handful of people creating and translating and then adapting articles. Well ... combining various Open Source and Open Content projects we can now go a further step into the direction of fast contents creation, but that does not mean: stub upload. This is a completely different way of doing things.

Apertium is a machine translation tool that works really great with similar languages. Approx. a year ago I had a translation from Spanish to Catalan done by Apertium through the online interface (http://xixona.dlsi.ua.es/apertium/) and asked some people of the Catalan Wikipedia to have a look at it. They told me that of course it was not perfect, but that it would be easy to proofread it and much faster than actually translating it. In March I made a similar test during a masters for translation studies in Pisa. I asked one of the students who was bilingual Spanish and Catalan to have a look at the outcome of the machine translation of a general text. The grammar was almost perfect and and also the terminology. There were just 5 corrections in a bit more than half a page (A4).

Now what does this mean to us: if we have a bilingual wordlist for two similar languages under a free license, we can pass it on to the Apertium people. From there we are a step closer of getting machine translation for that specific language combinations on their way.

One note inbetween for the Apertium people who might read this: please don't mind me not using specific terminology to describe what needs to be done. It could become to techy.

So the next step is to identify what a term is and how it needs to be handled. That is for example a verb needs to be declared as such, then one needs to give it a tag that indicates which conjugation scheme needs to be applied. This needs doing for all word types, that is verbs, nouns, adjectives etc. After that grammar rules need to be considered. Step by step the correctness level will be improved and the time invested to complete wordlists which will be available as google doc spreadsheet and to add all the additional information will help to save a lot of time. That is: now it will take longer, once the engine "learnt" how to deal with the terminology and grammar for that specific language combination creating contents will become much faster. This will help the small projects in such a way that the few editors can concentrate on proof reading and adapting and will result in a faster contents growth that has quite high quality.

This project that is going to care about less resourced languages will be one of the first lead through Vox Humanitatis. Should you be interested in helping with the wordlists, please let us know which language combination you would like to work on (that is starting from English right now and step by step from others since most of the Terminology is there in English). We will get you the access to the online document. If you need to work offline, please let us know. You can contact me by e-mail: s.cretella (at) voxhumanitatis.org

I just received a list of the supported language combinations as well as an example for Catalan-Occitan and some notes on evaluation of machine translation co-operating with a Wikipedia community. This means I have quite some further stuff to tell you. I'll post that info tomorrow, otherwise this blog would become too long.

Please also note that the documents will be released under CC-BY license and therefore they can be integrated into any wiktionary.
Post a Comment