;   Medical Translation Insight: Assessing quality in machine translation - ForeignExchange Translations

Assessing quality in machine translation

Assessing quality in machine translation - medical translationHow do you measure the quality of a translation if you don't understand the translated text?

Kirti Vashee gave his answer in a recent guest post on the Tomedes Blog.

Kirti advocates the use of BLEU. Essentially, the BLEU metric measures how many words overlap in the source and target texts, giving higher scores to sequential words. BLEU scores a translation on a 0 to 1 scale. The higher the score (i.e., the closer to 1), the more overlap there is with a human reference translation and thus the better the translation is.

He is not minimizing the challenge of measuring translation quality:

Anybody who has tried to measure translation quality will understand the difficulty of doing this in a way that has any general credibility. Developers of statistical machine translation systems, in particular have to grapple with this issue on a constant basis to understand how to evolve the state of the technology.
And while there may be a role for the BLEU metric, there are clearly some shortcomings as well. Kirti's post as well as a recent academic paper [PDF link] and a newspaper article all highlight serious problems and limitations of the BLEU approach.

What's interesting is that all of these discussions are moving away from the goal of having "perfect" output for machine translation tools. Academics, computational linguists and software engineers are doing a nice job resetting expectations to "good enough" translation output.

Three more to read:
ForeignExchange Translations provides specialized medical translation services to the world's leading pharmaceutical and medical device companies.


  1. Tapani Ronni said...
    This BLEU approach is pretty worthless in languages like Finnish, where words have endings depending on the context; nouns have 16 different cases. This system would fail to recognize inflected words as actually being the same word. I have tried two different English-Finnish machine translation systems and they are bad, especially with technical texts.
    Kirti said...
    There is a lot more discussion from academics and practitioners in the LinkedIn group on Automated Translation. You can follow the discussion thread here if you are a member of the Automated Language Translation group:


    I admit that BLEU is quick and dirty but it is also consistent and relatively cheap as flawed as it is.

    There are two new measures that are gaining more momentum: METEOR and TERp which come closer to human evaluations and address some of the shortcomings of BLEU.

    I disagree with your characterization that these measure are driving a "good enough" quality acceptance.

    The real value of these measures is to help improve MT (SMT in particular)and enable 10X to 100X the amount of content to be translated to reduce the lack of information access that much of the world faces today if they do not speak dominant language.

    There is ongoing research
    Jaakko said...
    The BLEU score acts more like a document similarity score rather than a translation quality score, because it's mainly concerned with the overlap of n-grams rather than their meaning. I believe it is useful when automatic MT systems are being developed, but one should not think that is measures directly translation quality as humans think of it.

    There are also a lot of other automatic evaluation metrics, such as DP, ULC(h), DR, SR, meteor-ranking/baseline, posbleu, pos4gram, F-measure, svm-rank, mbleu, BADGER, etc.

    We have recently evaluated (and submitted a paper about) a score called NCD (Normalized Compression Distance), that is general similarity measure and clustering tool between binary strings. It is simple and more theoretically justified than, for instance, BLEU, but performs roughly at the same level as BLEU and METEOR.
    Arabic Editor said...
    I have seen and tried many machine translations on different subjects and all I can say is "a waste of time and ridiculous". I am a translator of Arabic, English, French and other language pairs. My recommendation is to stay away from ridiculously low rates for translations and if machine translation is carried out, it is imperative to have a human linguist in the target language carry out a comprehensive edit/proof of the output.

Post a Comment


Services | Resources | Company | Contact Us | Blog | Home

(c) Copyright 2010, ForeignExchange Translations, Inc.