ClueWeb09B is a large Web collection, whose topics are accumulated from TREC Web Tracks 2009, 2010, and 2011. And ClueWeb09B is filtered to the set of documents with spam scores in the 60th percentile, us ing the Waterloo Fusion spam scores [1]. The collection consist of 34M documents and 150 queries. The vocabulary size is 38M and the collection length is about 26B. Here the ClueWeb09B-Title means that the title of the topic are used as query.

These data can only be used for academic research purposes.

