Domain: Topic Modeling

Overview

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Dataset List

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. More detail: http://qwone.com/~jason/20Newsgroups/...
BASELINE Purity NMI ARI Evaluation
Biterm Topic Model 0.6050 0.5920 0.4250 Detail
LDA 0.5900 0.5500 0.4110 Detail
189080 Questions from Baidu ZhiDao. Each question has a category label. Format: docID category|question Questions are segmented by ICTCLAS. Encode: UTF8 Application: short text classification...
BASELINE Acc Evaluation
Biterm Topic Model 0.5450 Detail
LDA 0.4820 Detail
The NIPS data set contains papers from the NIPS conferences between 1987 and 1999. More Detail: http://www.cs.toronto.edu/~roweis/data.html...
BASELINE Acc Evaluation
Dataset List