Stanford Contextual Word Similarity


Huang et al (2012) introduced a new dataset with human judgments on pairs of words in sentential context, Stanford’s Contextual Word Similarities (SCWS). The dataset consists of 2003 word pairs and their sentential contexts. It consists of 1328 noun-noun pairs, 399 verb-verb pairs, 140 verb-noun, 97 adjective-adjective, 30 noun-adjective, 9 verb-adjective, and 241 same word pairs.


Each line in ratings.txt consists of a pair of words, their respective contexts, the 10 individual human ratings, as well as their averages. The target word is surrounded by <b>...</b> in its context. Each line is tab-delimited with the following format:

<id> <word1> <POS of word1> <word2> <POS of word2> <word1 in context> <word2 in context> <average human rating> <10 individual human ratings>


Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 873– 882, Stroudsburg, PA, USA. Association for Computational Linguistics.

Download     Downloads  78  times