Friday, December 07, 2007

Translating Wikipedia articles (2)

Like I already said yesterday, I would come back to this argument today.

Apertium is already used in some projects, one of which is the Occitan Wikipedia. For those who are not familiar with Wikis: there you have the possibility to compare the not proofread version with the proofread version and that is something you will see by clicking here.

What you see on the left hand side is the text as it was after the machine translation and on the right hand side the proofread version of the text. The changes are highlighted in green on the left and in blue on the right hand side. There are even some parts of the text that were not changed at all.

The work on the glossary and the grammar rules (well I am not using the specific terminology here to make things understandable for all) has been going on for approximately one year now.

At a certain stage the problems arise from vocabulary that is missing and not so much from the rules. Of course these translations will probably never be a 100% perfect, but the quality depends very much on us and our adding terminology and classifying it.

Comparing the above result to what you would see for Spanish-Catalan, well the last one having been under development for years is much better.

You can find further reading about co-operation between Wikipedia and Apertium on the Apertium Wiki.

Language pairs that are right now available are:

  • Spanish←→Catalan
  • Spanish←→Galician
  • Spanish←→Portuguese (pt and pt_BR)
  • Catalan←→English
  • Catalan←→French
  • Catalan←→Occitan (oc and oc@aran)
  • Romanian→Spanish

Many other language pairs are under development. Of course: you may start on any language combination that is comfortable for you. Please keep in mind: the more similar two languages are the easier it is to program the rules, the faster the translation engine will produce good translations.

If you want to start to work on wordlists, please write me at: s.cretella (at) and tell me which language pair you are interested in. You can also reach me by skype at: sabinecretella

I will upload a wordlist to google docs and give you access. Please let me know if you have difficulties to work online (that is if you work with a dial-in connection).

The Apertium Chat is on Freenode.

One more thing I just received criticism since machine translation would flatten the language: well any translated text, in particular when it comes to literature translations, is post edited by a second person. The translation is never published directly since during translation - and you can be the best translator of the world - there are always some bits and pieces that sound a little strange or that do not really transport the scene into the other culture. And please allow me to introduce the concept of cultural localization here that will be explained in one of the future posts here and that was coined by Dr. Martin Benjamin who is part of the advisory board of Vox Humanitatis. The concept of cultural localization became then immediately part of the scope of the association.

And since I am adding notes here: please remember that the Fundraiser of the Wikimedia Foundation is still running and that you can help by donating and telling others that the fundraiser is on. For more information and to donate please click here.
Post a Comment