Instructions for the final project



Download this collection of spam and non-spam emails (Open them only in a text editor to avoid getting infected by any possible viruses.)

Build a model that distinguishes spam from non-spam. To do this, you should use two different feature sets:
1. Character n-grams only. The choice of how many and which n-grams to use (for each value of n) is up to you.
2. Any features you think might work. Be creative.

Your learner should be one of the SVM learners available at http://svmlight.joachims.org . Choice of parameters (kernel, slack, false positive cost vs false negative cost, etc) is up to you.

Your objective is to maximize spam recall, while keeping spam precision at at least 99%.

Explain all your design choices. Discuss which features were particularly useful.

Your models and reports are due on September 7. The next day, you'll get test data and will have to use your models on that test data.

UPDATE: Due date extended to September 14.
UPDATE 2: Test set is available here. Please submit a confusion matrix.