Here is an old test that can serve as a rough model of what you can expect.
Problem II: You are given movie reviews labeled as positive or negative. Your task is to learn a classifier that assigns an unlabeled review to the right class.
Perform all the following steps. Write a report explaining exactly what you did in each step and what results you obtained for that step.
1. Chunk each book into some number of chunks, each of which will serve as a training example. Explain how you chose to chunk and why.
2. Choose three different feature sets to use for this purpose. For example, you can use all words that appear in any of the books, or all words that appear at least k times, or all function words, or parts-of-speech. Be creative. (But don’t work too hard, since experience shows that simple feature sets work at least as well, if not better, than sophisticated ones.)
3. For each feature set you’ve chosen, use Weka’s “Select Attributes” menu to identify the features with highest infogain. Check that these features make sense and aren’t just artifacts of text formatting or other irrelevancies. List the key distinguishing features and see if you can divine some underlying pattern.
4. Choose three different learning algorithms. Describe how each of them works. Identify the main parameters that need to be chosen for each of them.
5. For each algorithm (and some reasonable assortment of parameter values for each), use Weka (or some other package, or even write some learning algorithms on your own) to learn a classifier distinguishing the classes. Report results for each <feature_set, learner+parameter_settings> testing on this validation set for the authorship problem and this validation set for the movie review problem.
6. For each of your three learning algorithms, try bagging and boosting. (If you use Weka, do this by choosing your learner from the “meta” menu and then choosing bagging and adaboost, respectively). Report results for testing on the validation set. Do bagging and boosting improve accuracy?
7. Summarize your conclusions about what works well and what doesn’t. How many chunks is right? Which learners work well? With which parameters? Which has bigger impact on results: choice of feature set or choice of learner?
8. Once you’ve handed in your reports, a new test set will be posted. Run your best methods (as determined on the validation set) on the new test set and report the results. Compare performance on the validation set and on the test set. Are they well correlated? If not, explain why not.
Please put a hard copy of your report in my mail box on or before August 15. Please include a link to your code and other big files.
© Last updated on 16/7/2013 , by Navot Akiva