Recent developments in neural information retrieval models have beenpromising, but a problem remains: human relevance judgments are expensive toproduce, while neural models require a considerable amount of training data. Inan attempt to fill this gap, we present an approach that---given a weaktraining set of pseudo-queries, documents, relevance information---filters thedata to produce effective positive and negative query-document pairs. Thisallows large corpora to be used as neural IR model training data, whileeliminating training examples that do not transfer well to relevance scoring.The filters include unsupervised ranking heuristics and a novel measure ofinteraction similarity. We evaluate our approach using a news corpus witharticle headlines acting as pseudo-queries and article content as documents,with implicit relevance between an article's headline and its content. By usingour approach to train state-of-the-art neural IR models and comparing toestablished baselines, we find that training data generated by our approach canlead to good results on a benchmark test collection.
展开▼