Topics in Text Categorization

Assignment 3 here

*******************************************************************
Assignment 1: Here are texts from Maariv and Yediot. Use the first 75% of the documents in each folder as a training corpus and save the rest for testing. Your task is to learn a classifier for the topics: news, sports, finance and culture, respectively.

Choose the most common 1000 words in the corpus as the base vocabulary. Use two methods -- frequency and infogain -- to filter this vocabulary down to 750, 500 and 250 words, respectively. Using these feature sets, learn classifiers using Naive Bayes and SMO. Compare results over the test set.
*******************************************************************

1.      General introduction

An overview of text categorization

Machine learning in automated text categorization
F Sebastiani - ACM Computing Surveys, 2002

2.       Individual learning algorithms for text categorization

A comparison of event models for naïve bayes text classification
A McCallum, K Nigam

BoosTexter: A Boosting-based System for Text Categorization
REN Schapire, YN Singer

Text Categorization with Support Vector Machines
T Joachims

3.       Comparison studies: learning algorithms

Inductive Learning Algorithms and Representations for Text Categorization
S Dumais, J Platt, M Sahami, D Heckerman

A re-examination of text categorization methods
Y Yang, X Liu

4.       Comparison studies: feature selection

A comparative study on feature selection in text categorization
Y Yang, JO Pedersen

An extensive empirical study of feature selection metrics for text classification
G Forman

5.       Feature reduction (not selection)

Indexing by latent semantic analysis
S Deerwester, ST Dumais, GW Furnas, TK Landauer

The power of word clusters for text classification
N Slonim, N Tishby

6.       Authorship Attribution

Computer-Based Authorship Attribution Without Lexical Measures
ET Stamatatos, NT Fakotakis, GT Kokkinakis

Language independent authorship attribution using character level language models
F Peng, D Schuurmans, V Keselj, S Wang

Author Identification on the Large Scale
D Madigan, A Genkin, DD Lewis, S Argamon, D Fradkin

Authorship Attribution with Thousands of Candidate Authors
M Koppel, J Schler, S Argamon, E Messeri

7.       Author profiling

Determining an Author's Native Language by Mining a Text for Errors (Slides)
M. Koppel, J. Schler, K. Zigdon

Automatically categorizing written texts by author gender (Slides)
M.Koppel, S. Argamon, A. Shimoni

8.       Authorship verification

Authorship Verification as a One-Class Classification Problem (Slides)
M Koppel, J Schler

9.       Sentiment analysis – words and phrases

Predicting the semantic orientation of adjectives
V Hatzivassiloglou, KR McKeown

Thumbs Up or Thumbs Down?
P Turney

Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis (Slides)
T Wilson, J Wiebe, P Hoffmann

10. Sentiment analysis – documents

Thumbs up?: sentiment classification using machine learning techniques
B Pang, L Lee, S Vaithyanathan


Mining the peanut gallery: opinion extraction and semantic classification of product reviews
K Dave, S Lawrence, DM Pennock


The Importance of Neutral Examples for Learning Sentiment (slides)
M Koppel, J Schler

89469: a list of nlp resources   and   nlp for TC - oren's slides

Slides – Lecture1

Slides – Lecture2

Slides – Lecture3

Slides – Lecture4

Slides – Lecture5

Slides – Lecture6

Slides – Lecture7

Slides – Lecture8

Slides – Lecture9

Slides – Lecture10