ClueWeb09B is a large Web collection, whose topics are accumulated from TREC Web Tracks 2009, 2010, and 2011. And ClueWeb09B is filtered to the set of documents with spam scores in the 60th percentile, us ing the Waterloo Fusion spam scores [1]. The collection consist of 34M documents and 150 queries. The vocabulary size is 38M and the collection length is about 26B. Here the ClueWeb09B-Title means that the title of the topic are used as query.

These data can only be used for academic research purposes.

[1]G. V. Cormack, M. D. Smucker, and C. L. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval, 14(5):441–465, 2011


clueweb09B-title.tar.bz2     Downloads  20  times