The data directory contains files derived from the Canadian Hansards, originally aligned by Ulrich Germann: -input: French sentences to translate. -tm: a phrase-based translation model. French phrase ||| English phrase ||| log_10(translation_prob) -lm: a trigram language model file in ARPA format: log_10(ngram_prob) ngram log_10(backoff_prob) The backoff (bo) prob should be used when using this ngram as a backoff. The probability of a trigram (w1,w2,w3) is given according to the following receipe: trigram probability of p(w3 | w1, w2) : p(w3|w1,w2)= if(trigram exists) p_3(w1,w2,w3) else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(w3|w2) else p(w3|w2) bigram probability of p(w2 | w1) : p(w2|w1)= if(bigram exists) p_2(w1,w2) else bo_wt_1(w1)*p_1(w2) To get the probability of an unknown single work, use the "" token. Keep in mind that the values in the file are in log10, and the receipe is not in log-space. The lm assumes sentences start with "" and end with "". The language model and translation model are computed from the data in the align directory, using alignments from the Berkeley aligner.