Measuring similarity between sentences plays an important role in textual applications such as document summarization and question answering. While various sentence similarity measures have recently been proposed, these measures typically only take into account word importance by virtue of inverse document frequency (IDF) weighting. IDF values are based on global information compiled over a large corpus of documents, and we hypothesise that at the sentence level better performance can be achieved by using a measure of the importance of a word within the sentence that it appears. In this paper we show how the PageRank graph-centrality algorithm can be used to assign a numerical measure of importance to each word in a sentence, and how these values can be incorporated within various sentence similarity measures. Results from applying the measures to a difficult sentence clustering task demonstrates that incorporation of sentential word importance leads to statistically significant improvement in clustering performance as evaluated using a range of external clustering criteria.
展开▼