Dataset Introductions

MQ2007 is a query set from Million Query track of TREC 2007. There are about 1700 queries in it with labeled documents. In MQ2007, the 5-fold cross

validation strategy is adopted and the 5-fold partitions are included in the package. In each fold, there are three subsets for learning: training set,

validation set and testing set.


  Folds     Training set   Validation set   Test set

  Fold1     {S1,S2,S3}          S4              S5 

  Fold2     {S2,S3,S4}          S5              S1 

  Fold3     {S3,S4,S5}          S1              S2 

  Fold4     {S4,S5,S1}          S2              S3 

  Fold5     {S5,S1,S2}          S3              S4 


Dataset Descriptions

Each row is a query-document pair. The first column is relevance label of this pair, the second column is query id, the following columns are features, and

the end of the row is comment about the pair, including id of the document. The larger the relevance label, the more relevant the query-document pair. A

query-document pair is represented by a 46-dimensional feature vector. Here are several example rows from MQ2007 dataset:


2 qid:10032 1:0.056537 2:0.000000 3:0.666667 4:1.000000 5:0.067138 ... 45:0.000000 46:0.076923 #docid=GX029-35-5894638 inc=0.0119881192468859 prob=0.139842

0 qid:10032 1:0.279152 2:0.000000 3:0.000000 4:0.000000 5:0.279152 ... 45:0.250000 46:1.000000 #docid=GX030-77-6315042 inc=1 prob=0.341364

0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 5:0.134276 ... 45:0.750000 46:1.000000 #docid=GX140-98-13566007 inc=1 prob=0.0701303

1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 5:0.600707 ... 45:0.500000 46:0.000000 #docid=GX256-43-0740276 inc=0.0136292023050293 prob= 0.400738



MQ2007.rar     Downloads  66  times