This paper demonstrates the usefulness of summariesin an extrinsic task of relevance judgmentbased on a new method for measuring agreement,Relevance-Prediction, which compares subjects’judgments on summaries with their own judgmentson full text documents. We demonstrate that,because this measure is more reliable than previousgold-standard measures, we are able to makestronger statistical statements about the benefits ofsummarization. We found positive correlations betweenROUGE scores and two different summarytypes, where only weak or negative correlationswere found using other agreement measures. However,we show that ROUGE may be sensitive to thechoice of summarization style. We discuss the importanceof these results and the implications for futuresummarization evaluations.
展开▼