Topics in Text Categorization

Assignment: Using this blog corpus, do the following. Choose 100 blogs, each with at least 1000 words. Label the first 500 words of blog i, begin_i, and label the last 500 words of blog i, end_i. Your task is to learn to distinguish pairs <begin_i, end_i> from pairs <begin_i, end_j>, where i != j. You can use any method you like and you can use other blogs in the corpus as background information. Generate similar pairs using different blogs (dont use any youve used in development in any fashion) and test the effectiveness of your method.



Lecture 1. General introduction; comparison of learning algorithms

Slides : Lecture1



Machine learning in automated text categorization
F Sebastiani


Inductive Learning Algorithms and Representations for Text Categorization
S Dumais, J Platt, M Sahami, D Heckerman


A re-examination of text categorization methods
Y Yang, X Liu

Text Categorization with Support Vector Machines
T Joachims




Lecture 2.  Naïve Bayes; feature selection 

Slides : Lecture2



A comparison of event models for Naïve Bayes text classification
A McCallum, K Nigam


A comparative study on feature selection in text categorization
Y Yang, JO Pedersen


An extensive empirical study of feature selection metrics for text classification
G Forman




Lecture 3: Authorship Attribution


Slides : Lecture3




Computational Methods in Authorship Attribution

M. Koppel, J. Schler, S. Argamon




Lecture 4: Authorship verification


Slides : Lecture4




Measuring Differentiability: Unmasking Pseudonymous Authors

M. Koppel, J. Schler, E. Bonchek-Dokow


Authorship Attribution in the Wild

M. Koppel, J. Schler, S. Argamon




Lecture 5: Author profiling

Slides : Lecture5



Determining an Author's Native Language by Mining a Text for Errors
M. Koppel, J. Schler, K. Zigdon


Automatically Profiling the Author of an Anonymous Text

S. Argamon, M. Koppel, J. Pennebaker and J. Schler




Lecture 6. Bottom-up sentiment analysis

Slides : Lecture6




Predicting the semantic orientation of adjectives
V Hatzivassiloglou, KR McKeown


Thumbs Up or Thumbs Down?
P Turney


Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis
T Wilson, J Wiebe, P Hoffmann

Lecture 7. Top-down sentiment analysis

Slides : Lecture7



Opinion Mining and Sentiment Analysis

B Pang, L Lee

Mining the peanut gallery: opinion extraction and semantic classification of product reviews
K Dave, S Lawrence, DM Pennock

The Importance of Neutral Examples for Learning Sentiment
M Koppel, J Schler




Lecture 8. Spam filtering

Slides : Lecture8




Lecture 9. Text clustering


Slides: Lecture9




Introduction to Information Retrieval (Chapter 16)

C Manning, P Raghavan, H Schutze


On Spectral Clustering: Analysis and an Algorithm

A Ng, M Jordan, Y Weiss




Lecture 10. Latent semantic analysis


Slides: Lecture10




Introduction to Information Retrieval (Chapter 18)

C Manning, P Raghavan, H Schutze


Indexing by latent semantic analysis
S Deerwester, ST Dumais, GW Furnas, TK Landauer