Assignment 3: Phrase-based decoding

(this assignment is inspired by and based on HW2 in JHU's 2012 mt-class)

In Machine Translation, "decoding" means finding the best translation according to a given translation model (and language model). This is a hard search problem. Dedocing algorithms are at the core of any machine-translation system.

In this assignment you will experiment with decoding algorithms by implementing a phrase-based decoder and inspecting its behavior.

Decoding with realistic translation and language models is challenging due to the size of the models and requires some major engineering efforts. Instead, I provided you with tiny translation and language models, derived from the data you used in assignment 2 (after alignment with the berkeley aligner), and reduced to the vocabulary of the sentences you need to translate. You cannot expect to get good translations from such a model, but it is large enough to demonstrate the key properties of a decoding algorithm. And remember: a decoder is not measured by translation quality, but by its ability to recover translations with good scores according to the given model.

The data file is described in the README file

Part 1 - Implementing a Phrase-based Decoder (60 pts)

Implement a machine-translation phrase based decoder, as seen in class. Your decoder should get as input a language model and a translation model (phrase table), which are passed on as commandline arguments, and use them to decode (translate) sentences. Your program should work at least with the input, tm and lm files supplied in the data directory.

For each decoded sentence, your program should output: (1) The chosen translation. (2) It's segmentation (which phrases are used). (3) It's model score.

You can assume a uniform distortion cost. Use a beam of size 100.

What to submit: For this part, you need to submit a single .zip or .tgz file including your code and a README file explaining how to compile and run it. In addition, the zip file should include a file named output, which is the result of running your program on the supplied input file.

Part 2 - Inspecting the decoder behavior (20)

Two important parameters in decoding a phrase-based system are the distortion limit and the beam size. Add to your decoder the ability to take the beam size and the distorion limit as parameters.

How does changing the beam-size affect the model scores? How does it affect the translation time? How does changing the distortion limit affect the model scores? How does it affect the translation time? Is there a beam size in which the model score stops increasing?

What to submit A report on your experiments and finding.

Alternative 1 (90 pts)

Instead of doing parts 1 and 2 above and for an additional 10 points, you could implement a decoder by posing the translation problem as the (NP hard) traveling-salesman problem or as an integer-linear-program, and then use a traveling-saleseman or ILP solver to solve it.

This is possible because of the relatively small size of the translation table and language model.

What to submit: A report explaining how you encoded the translation model as the NP hard problem, as well as the outputs produced by your system.

Alternative 2 (60 pts)

Instead of doing parts 1 and 2 above, and for 20 points less, you could download the moses toolkit, install it, and use it to translate the input sentneces given the provided translation and language models, as well as to explore the effect of beam-size and distortion limit on the model scores and translation times.

What to submit: The output of moses on the input file, and a report as required in Part 2.

Part 3 - Forced Decoding (20 points)

Let's assume you know a reference translation. Change your decoder so that it will give you the best derivation for this translation (and its model score), or indicate if the reference translation is not reachable using the translation model.

Test your program on the following input file.

What to submit: A single .zip or .tgz file including your code and a README file explaining how to compile and run it. In addition, the zip file should include a file named output, which is the result of running your program on the supplied input file.

Submission details, deadlines, etc:

You can program the assignment in any language you want. Your code should be able to run on linux, from the commandline, without installing or using an IDE.

If you use python, you may find the following pieces of code useful. You are not required to use them. utils.py

Your submission should include a .zip or .tar.gz file with your code, as well as a README file that explains how to run your program.

The input file names should be specified on the commandline, either as a parameter or through a pipe (your program may be run on a different file than the one provided to you).

Submit your assignment by email. The subject of your email should be "mt course ass 3" (without the quotes).

The should submit the assignment by the last week of the semester.