Topics in Text Categorization
******************************
Assignment 1: Here are
texts from Maariv and Yediot.
Use the first 75% of the documents in each folder as a training corpus and save
the rest for testing. Your task is to learn a classifier for the topics: news,
sports, finance and culture, respectively.
Choose the most common 1000 words in the corpus as the base vocabulary. Use two
methods -- frequency and infogain -- to filter this
vocabulary down to 750, 500 and 250 words, respectively. Using these feature
sets, learn classifiers using Naive Bayes and SMO.
Compare results over the test set.
******************************
1. General introduction
An overview of text categorization
Machine learning
in automated text categorization
F Sebastiani - ACM Computing Surveys, 2002
2. Individual learning algorithms for text categorization
A
comparison of event models for naïve bayes text
classification
A McCallum, K Nigam
BoosTexter: A Boosting-based System for Text Categorization
REN
Schapire, YN Singer
Text
Categorization with Support Vector Machines
T
Joachims
3. Comparison studies: learning algorithms
Inductive
Learning Algorithms and Representations for Text Categorization
S Dumais, J Platt, M Sahami, D
Heckerman
A
re-examination of text categorization methods
Y
Yang, X Liu
4. Comparison studies: feature selection
A
comparative study on feature selection in text categorization
Y
Yang, JO Pedersen
An
extensive empirical study of feature selection metrics for text classification
G
Forman
5. Feature reduction (not selection)
Indexing
by latent semantic analysis
S Deerwester, ST Dumais, GW Furnas,
TK Landauer
The
power of word clusters for text classification
N
Slonim, N Tishby
6. Authorship Attribution
Computer-Based
Authorship Attribution Without Lexical Measures
ET
Stamatatos, NT Fakotakis,
GT Kokkinakis
Language
independent authorship attribution using character level language models
F Peng, D Schuurmans, V Keselj, S Wang
Author
Identification on the Large Scale
D
Madigan, A Genkin, DD Lewis,
S Argamon, D Fradkin
Authorship
Attribution with Thousands of Candidate Authors
M Koppel, J
Schler, S Argamon, E Messeri
7. Author profiling
Determining
an Author's Native Language by Mining a Text for Errors (Slides)
M.
Koppel, J. Schler, K. Zigdon
Automatically
categorizing written texts by author gender (Slides)
M.Koppel, S. Argamon, A. Shimoni
8. Authorship verification
Authorship
Verification as a One-Class Classification Problem (Slides)
M
Koppel, J Schler
9. Sentiment analysis words and phrases
Predicting the
semantic orientation of adjectives
V Hatzivassiloglou,
KR McKeown
Thumbs
Up or Thumbs Down?
P
Turney
Recognizing Contextual
Polarity in Phrase-Level Sentiment Analysis (Slides)
T Wilson, J Wiebe, P Hoffmann
10. Sentiment analysis documents
Thumbs
up?: sentiment classification using machine learning techniques
B Pang, L Lee, S Vaithyanathan
Mining the peanut gallery:
opinion extraction and semantic classification of product reviews
K Dave, S Lawrence, DM Pennock
The Importance
of Neutral Examples for Learning Sentiment (slides)
M Koppel, J Schler
89469: a list of nlp resources and nlp for TC - oren's slides
Slides Lecture1
Slides Lecture2
Slides Lecture3
Slides Lecture4
Slides Lecture5
Slides Lecture6
Slides Lecture7
Slides Lecture8
Slides Lecture9
Slides Lecture10