Assignment 1: Evaluation

(this assignment is inspired by and based on HW3 in JHU's 2012 mt-class)

Machine translation systems are typically evaluated through relative ranking. For instance, given the following German sentences:

  Die Prager Börse stürzt gegen Geschäftsschluss ins Minus.
  Nach dem steilen Abfall am Morgen konnte die Prager Börse die Verluste korrigieren.

...One machine translation system produces this output:

  The Prague stock exchange throws into the minus against closing time.
  After the steep drop in the morning the Prague stock exchange could correct the losses.

...And a second machine translation system produces this output:

  The Prague stock exchange risks geschäftsschluss against the down.
  After the steilen waste on the stock market Prague, was aez almost half of the normal tagesgeschäfts.

A plausible ranking of the translation systems would place the first system higher than the second. While neither translation is perfect, the first one is clearly easier to understand and conveys more of the original meaning.

In class, we discsussed one very popular evaluation metric: BLEU.

This assignment has two parts. In the first part, you will implement the BLEU metric and score the outputs of translation systems. We didn't go into the exact details details in class, but you can find them in the BLEU paper. You will get 95% of your grade for this part.

As you will see, BLEU judgements correlate well with, but are not quite the same as, human judegments.

In the second part, you are encouraged to try and improve over the BLEU metric, and wrie a program that ranks the systems in the same order that a human evaluator would. You can get up to 25% additional points for this part (yes, this is > 100, but you'll have to work for it).

Part 1

Data: In the file data.txt you will find a human translation and many human translations of some German documents.

Goal: You need to write a program that takes the human translation and machine translations as input, and provides a BLEU score for each system.

Output: The output of the program should be in the following format: each line will consist of the system name, followed by a tab, and the system's BLEU score. The lines should be sorted such that the highest scoring systems are on top. For example:

  SystemE  15.88
  SystemF  15.29
  [...]
  SystemL  4.16

Part 2

The file human-ranking.txt provide the real human ranking for these systems, where the first line is the best system and the last line is the worst. As you can see, this ranking is (hopefully) similar to the ranking you produced in part 1, but not identical to it.

In this part of the assignment you are expected to write a program that improves upon the scoring provided by BLEU, and correspond better to human judgement.

One way to improve over the default system is to experiment with BLEU's parameters, or to retokenize the data in some way. However, there are many, many alternatives to BLEU — the topic of evaluation is so popular that Yorick Wilks, a well-known researcher, once remarked that more has been written about machine translation evaluation than about machine translation itself. Some of the techniques people have tried may result in stronger correlation with human judgement, including:

Incorporating recall statistics into the metric.
Stemming the words, or counting synonyms as matches.
Analyzing predicate-argument structure and distributional semantics of the translations.
Combining simple features with machine learning.
Here in Bar Ilan, you may want to look into integrating the Textual Entailment framework in the evaluation.

But the sky's the limit! There are many, many ways to automatically evaluate machine translation systems, and you can try anything you want as long as you don't just download some software specifically designed for this task and use it as your submission.

Submission details, deadlines, etc:

You can program the assignment in any language you want. Your code should be able to run on linux, from the commandline, without installing or using an IDE.

Your submission should include a .zip or .tar.gz file with your code, as well as a README file that explains how to run your program (both for part 1 and part 2).

The input file name should be specified on the commandline, either as a parameter or through a pipe (your program may be run on a different file than the one provided to you).

Submit your assignment by email. The subject of your email should be "mt course ass 1" (without the quotes).

The official deadline is March 20. However you can also submit after passover with no penalty. Keep in mind that the next assignment will be released just before passover as well, so plan your time accordingly.