Instructions
for the final project
Download this collection of spam and non-spam emails
(Open them only in a text editor to avoid getting infected by any possible
viruses.)
Build a model that distinguishes spam from non-spam. To do this, you should use
two different feature sets:
1. Character n-grams only. The choice of how many and which n-grams to use (for
each value of n) is up to you.
2. Any features you think might work. Be creative.
Your learner should be one of the SVM learners available at http://svmlight.joachims.org .
Choice of parameters (kernel, slack, false positive cost vs
false negative cost, etc) is up to you.
Your objective is to maximize spam recall, while keeping spam precision at at least 99%.
Explain all your design choices. Discuss which features were particularly
useful.
Your models and reports are due on September 7. The next day, you'll get test
data and will have to use your models on that test data.
UPDATE: Due date extended to September 14.
UPDATE 2: Test set is available here. Please submit a confusion matrix.