Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theoretical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal inter-document similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based similarity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee's evaluation measure.
展开▼