Information retrieval can be greatly enhanced if the semantics of document contents are made explicit as labels that can be queried by markup-sensitive languages. We focus on labelling small text fragments, such as parts of sentences or paragraphs, with frequent topics. We propose WORDtrain, a sequence miner that builds topics for small document regions, such as sentences with many subsentences. WORDtrain splits regions in such a way that non-overlapping fragments are built and the topics derived for them are frequent. WORDtrain discovers frequent topics rather than choosing from a predefined reference list. This raises the issue of evaluating the quality of its resuls. To this purpose, we have designed two evaluation schemes, one requiring expert involvement and an automatic one. Our first experiments with these schemes show that WORDtrain yields promising results.
展开▼