Domain: Word Representation


Word representation is central to many fields like natural language processing and information retrieval.Vector space models of language represent each word with a real-valued vector that captures both semantic and syntactic information of the word.

The representations can be used as basic features in a variety of applications, such as information retrieval, named entity recognition, question answering, disambiguation, and parsing. The tasks on this page use the publicly available April 2010 dump as the corpus.The corpus in total has 3,035,070 articles and about 1 billion tokens. In preprocessing, we lowercase the corpus, remove pure digit words and non-English characters. The preprocessed corpus can be downloaded from here

Dataset List

The word analogy task is introduced by Mikolov et al. (2013) to quantitatively evaluate the linguistic regularities between pairs of word representations. The task consists of questions like “a is to b as c is to __”, where is missing and must be guessed from the entire vocabulary....
CBOW 73.5800 65.9500 69.5000 Detail
PDC 72.7700 67.6800 70.3500 Detail
GloVe 71.3900 53.7200 61.5700 Detail
HDC 69.5700 63.7500 66.6700 Detail
SG 65.6200 56.6100 60.6400 Detail
Minh-Thang Luong et al. introduced a new dataset focusing on rare words. Its 2034 word pairs contain more morphological complexity than other well-established word similarity datasets, e.g. crudeness—impoliteness.. Details can be found in this paper. Reference Minh-Thang Luong, Richard Socher, a...
PDC 47.5400 Detail
CBOW 45.9200 Detail
HDC 44.2600 Detail
GloVe 42.8600 Detail
SG 42.6800 Detail
WordSim 353 is a standard dataset for evaluuating vector-space models. It consists of 353 pairs of words. Each pair is presented without context and rated by 13 or 16 human on similarity or relatedness on a scale from 0 (totally unrelated words) to 10 (very much related or identical words). Details...
PDC 74.1200 Detail
CBOW 73.2500 Detail
HDC 70.2500 Detail
GloVe 68.9300 Detail
SG 68.6900 Detail
Huang et al (2012) introduced a new dataset with human judgments on pairs of words in sentential context, Stanford’s Contextual Word Similarities (SCWS). The dataset consists of 2003 word pairs and their sentential contexts. It consists of 1328 noun-noun pairs, 399 verb-verb pairs, 140 verb-noun, ...
PDC 66.5900 Detail
CBOW 64.8200 Detail
HDC 62.7600 Detail
SG 62.4700 Detail
GloVe 62.3700 Detail