Assignment 2: Word Alignment

(this assignment is inspired by and based on HW1 in JHU's 2012 mt-class)

Word-alignemnt is at the core of practically any statistical machine translation system.

In class, we discsussed several methods for learning word-alignments from parallel text.

In this assignment, you will experiment with word alignments by: (1) Manually alligning some sentences between Hebrew and English. (2) Implementing a word-alignment software.

Part 1 - The alignment task (10 pts)

In this task, you will experiment with word alignemnt of some "real world" sentenes, to get a feeling for the task.

In the following files hebrew.txt, english.txt you will find 10 Hebrew and 10 English sentences. The files are in unicode/utf8 encoding. You need to manually align the words in these sentences. You are allowed to have many-many mapping in both directions, as well as un-aligned words. You can have both "sure" and "probable" alignments.

What to submit: For this part, you need to submit your manual alignments. You can either print them out and draw lines manually, or submit a text file in the format described in the README file provided in part 2.

Part 2 - Statistical Word Alignement (80 pts)

Download and extract the contents of this file.

You will find French data, English data, some gold alignments, an example of an alignment file output, some scripts, and a README file. Read the README file.

Goal: You need to write a program that takes the English and French data, aligns it, and produces an aligment file in the format described in the README file. In addition, you should write the translation parameters t(e,f) to a file.

For 80% of the credit, implment IBM model 1. For the additional 20%, implement model 2. You can find the details of model 2's EM procedure in Michael Collins' notes, available from the course webpage.

The supplied .a file contains manual alignments for the first few sentence pairs. This file should not be used in training your aligner (and it is too small anyways). But you can use it to measure the quality of your alignments in terms of AER using the supplied evaluation script.

What to submit: For this part, you need to submit a zip file containing (a) your code (b) a README file, describing how to run your code to align the data with model 1 and model 2 (c) output aligment file for model 1 and output aligment file for model 2.

A note about memory: While developing/debugging your code, you can use only the first k lines of the data. However, your final submission is expected to run on the entire dataset. This is a relatively small dataset, and should easily fit in the memory of a relatively modern personal computer. If it does not fit, maybe you should re-structure your program.
If you are using Java, make sure you allocate enough heap space for the JVM, as the default is very small. The head size can be specified using the -Xmx flag:

  java -Xmx1g YourClassName

will allocate 1gb of heap size.

Part 2.5 - Looking at the produced alignments (10 pts)

After running your aligner, look at the alignments and the learned parameters.

Can you identify the garbage-collection effect of model 1? plesae provide an example.

Run your aligner several times, with various precentages of the training data. How does the AER change when you change the amount of training material?

What to submit: For this part, you need to provide a text file containing (a) examples for the garbage collection effect in your implementation of model 1. (b) description of the relation between the AER and the amount of training materials.

Submission details, deadlines, etc:

You can program the assignment in any language you want. Your code should be able to run on linux, from the commandline, without installing or using an IDE.

Your submission should include a .zip or .tar.gz file with your code, as well as a README file that explains how to run your program.

The input file names should be specified on the commandline, either as a parameter or through a pipe (your program may be run on a different file than the one provided to you).

Submit your assignment by email. The subject of your email should be "mt course ass 2" (without the quotes).

The should submit the assignment by April 11.