The archive below contains the dataset used in Kotlerman et al. (2012), and the output of the 4 compared methods for this dataset.
The data includes:
1. Gold-standard dataset of 194 sentences crawled from Twitter, expressing reasons for customer dissatisfaction with Citibank. The sentences were gathered automatically by a rule-based extraction algorithm and manually grouped to clusters according to the reasons stated in them.
2. A small corpus of tweets from the banking domain.
3. Output produced by the novel method suggested in the paper, and the baseline methods.
Download: data.zip (30M)
Reference:
Lili Kotlerman, Ido Dagan, Maya Gorodetsky and Ezra Daya. "Sentence clustering via projection over term clusters." Proceedings of the First Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, 2012.
1. Gold-standard dataset
The file "GS_dataset.txt" contains the sentences grouped to clusters, in the following format:
cluster1 name
\t sentence1
...
\t sentenceK
cluster2 name
...
Note that the gold standard
is soft-clustered, placing a sentence in several clusters if more than one
reason is stated in it.
2. A small corpus of
tweets from the banking domain
The file "corpus.zip" contains a corpus of 20,476 tweets mentioning the Bank of America and 11,422 tweets mentioning Citibank.
3. Output produced by our
method and the baseline methods
The folder "output" contains the resulting clusterings in the following format:
cluster1 size \t cluster1 name
\t sentence1
...
\t sentenceK
cluster2 size \t cluster2 name
...
The clusters are sorted by
their size (descending). For baseline methods, cluster labels contain the terms
ordered (descending) by frequency in the cluster. For our method, cluster
labels list all the terms in the underlying term cluster, with terms appearing
in the sentenes given in the uppercase.