The archive below contains the dataset used in Kotlerman et al. (2012), and the output of the 4 compared methods for this dataset.

The data includes:

1. Gold-standard dataset of 194 sentences crawled from Twitter, expressing reasons for customer dissatisfaction with Citibank. The sentences were gathered automatically by a rule-based extraction algorithm and manually grouped to clusters according to the reasons stated in them.

2. A small corpus of tweets from the banking domain.

3. Output produced by the novel method suggested in the paper, and the baseline methods.

Download: data.zip (30M)

Reference:

Lili Kotlerman, Ido Dagan, Maya Gorodetsky and Ezra Daya. "Sentence clustering via projection over term clusters." Proceedings of the First Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, 2012.

1. Gold-standard dataset

The file "GS_dataset.txt" contains the sentences grouped to clusters, in the following format:

cluster1 name

\t sentence1

...

\t sentenceK

cluster2 name

...

Note that the gold standard is soft-clustered, placing a sentence in several clusters if more than one reason is stated in it.

2. A small corpus of tweets from the banking domain

The file "corpus.zip" contains a corpus of 20,476 tweets mentioning the Bank of America and 11,422 tweets mentioning Citibank.

3. Output produced by our method and the baseline methods

The folder "output" contains the resulting clusterings in the following format:

cluster1 size \t cluster1 name

\t sentence1

...

\t sentenceK

cluster2 size \t cluster2 name

...

The clusters are sorted by their size (descending). For baseline methods, cluster labels contain the terms ordered (descending) by frequency in the cluster. For our method, cluster labels list all the terms in the underlying term cluster, with terms appearing in the sentenes given in the uppercase.