Topics in Text Categorization

*******************************************************************
Assignment: Using this blog corpus, do the following. Choose 100 blogs, each with at least 1000 words. Label the first 500 words of blog i, begin_i, and label the last 500 words of blog i, end_i. Your task is to learn to distinguish pairs <begin_i, end_i> from pairs <begin_i, end_j>, where i != j. You can use any method you like and you can use other blogs in the corpus as background information. Generate similar pairs using different blogs (dont use any youve used in development in any fashion) and test the effectiveness of your method.

*******************************************************************

 

Lecture 1. General introduction; comparison of learning algorithms

Slides : Lecture1

Readings:

 

Machine learning in automated text categorization
F Sebastiani

 

Inductive Learning Algorithms and Representations for Text Categorization
S Dumais, J Platt, M Sahami, D Heckerman

 

A re-examination of text categorization methods
Y Yang, X Liu

Text Categorization with Support Vector Machines
T Joachims

 

 

 

Lecture 2.  Naïve Bayes; feature selection 

Slides : Lecture2

Readings:

 

A comparison of event models for Naïve Bayes text classification
A McCallum, K Nigam

 

A comparative study on feature selection in text categorization
Y Yang, JO Pedersen

 

An extensive empirical study of feature selection metrics for text classification
G Forman

 

 

 

Lecture 3: Authorship Attribution

 

Slides : Lecture3

 

Readings:

 

Computational Methods in Authorship Attribution

M. Koppel, J. Schler, S. Argamon

 

 

 

Lecture 4: Authorship verification

 

Slides : Lecture4

 

Readings:

 

Measuring Differentiability: Unmasking Pseudonymous Authors

M. Koppel, J. Schler, E. Bonchek-Dokow

 

Authorship Attribution in the Wild

M. Koppel, J. Schler, S. Argamon

 

 

 

Lecture 5: Author profiling

Slides : Lecture5

Readings:

 

Determining an Author's Native Language by Mining a Text for Errors
M. Koppel, J. Schler, K. Zigdon

 

Automatically Profiling the Author of an Anonymous Text

S. Argamon, M. Koppel, J. Pennebaker and J. Schler

 

 

 

Lecture 6. Bottom-up sentiment analysis

Slides : Lecture6

 

Readings:

 

Predicting the semantic orientation of adjectives
V Hatzivassiloglou, KR McKeown

 

Thumbs Up or Thumbs Down?
P Turney

 

Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis
T Wilson, J Wiebe, P Hoffmann



Lecture 7. Top-down sentiment analysis

Slides : Lecture7

 

Readings:


Opinion Mining and Sentiment Analysis

B Pang, L Lee


Mining the peanut gallery: opinion extraction and semantic classification of product reviews
K Dave, S Lawrence, DM Pennock


The Importance of Neutral Examples for Learning Sentiment
M Koppel, J Schler

 

 

 

Lecture 8. Spam filtering

Slides : Lecture8

 

 

 

Lecture 9. Text clustering

 

Slides: Lecture9

 

Readings:

 

Introduction to Information Retrieval (Chapter 16)

C Manning, P Raghavan, H Schutze

 

On Spectral Clustering: Analysis and an Algorithm

A Ng, M Jordan, Y Weiss

 

 

 

Lecture 10. Latent semantic analysis

 

Slides: Lecture10

 

Readings:

 

Introduction to Information Retrieval (Chapter 18)

C Manning, P Raghavan, H Schutze

 

Indexing by latent semantic analysis
S Deerwester, ST Dumais, GW Furnas, TK Landauer