首页> 外文会议>International Conference on Data Warehousing and Knowledge Discovery >Deriving Multiple Topics to Label Small Document Regions
【24h】

Deriving Multiple Topics to Label Small Document Regions

机译:导出多个主题来标记小型文档区域

获取原文

摘要

Information retrieval can be greatly enhanced if the semantics of document contents are made explicit as labels that can be queried by markup-sensitive languages. We focus on labelling small text fragments, such as parts of sentences or paragraphs, with frequent topics. We propose WORDtrain, a sequence miner that builds topics for small document regions, such as sentences with many subsentences. WORDtrain splits regions in such a way that non-overlapping fragments are built and the topics derived for them are frequent. WORDtrain discovers frequent topics rather than choosing from a predefined reference list. This raises the issue of evaluating the quality of its resuls. To this purpose, we have designed two evaluation schemes, one requiring expert involvement and an automatic one. Our first experiments with these schemes show that WORDtrain yields promising results.
机译:如果文档内容的语义明确为可通过标记敏感语言查询的标签,可以大大提高信息检索。我们专注于标记小文本碎片,例如句子或段落的部分,具有频繁的主题。我们提出Wordtrain,一个序列矿工,为小型文档区域构建主题,例如具有许多子句的句子。 Wordtrain以这样的方式拆分区域,即构建不重叠的片段并且频繁导出的主题。 Wordtrain发现频繁主题而不是从预定义的参考列表中选择。这提出了评估其重大质量的问题。为此目的,我们设计了两个评估计划,一个需要专家参与和自动的计划。我们的第一个实验与这些计划表明,Wordtrain产生了有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号