首页> 外文会议>Advances in information retrieval >An Iterative Approach to Text Segmentation
【24h】

An Iterative Approach to Text Segmentation

机译:文本分割的迭代方法

获取原文
获取原文并翻译 | 示例

摘要

We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree - a dynamic object that describes the structure of a text and that is fully adaptable to a user's segmentation needs. We treat it as a separate task to flatten the tree into a broad and shallow hierarchy either through supervised learning of a document set or explicit input of how a text should be segmented. The rich structure of our created tree further allows us to segment documents at varying levels such as topic, sub-topic, etc. We evaluated our new solution on a set of 265 articles from Discover magazine where the topic structures are unknown and need to be discovered. Our experimental results show that the iterative approach has the potential to generate better segmentation results than several leading baselines, and the separate flattening step allows us to adapt the results to different levels of details and user preferences.
机译:我们介绍了divSeg,这是一种新颖的文本分割方法,它根据两个相邻部分之间的连接强度,在最弱的位置迭代地分割一部分文本。为了搜索最弱点,我们应用了两种不同的方法:一种基于文本分段的语言建模,另一种基于两个分段之间的互连性。我们的解决方案产生了一个深而窄的二叉树-一个动态对象,它描述文本的结构,并且完全适应用户的细分需求。我们将其视为一项单独的任务,通过监督性学习文档集或显式输入应如何分割文本,将树分为平坦和浅层的层次结构。我们创建的树的丰富结构进一步允许我们按不同级别(例如主题,子主题等)对文档进行细分。我们从Discover杂志的265篇文章中评估了我们的新解决方案,其中主题结构是未知的并且需要发现。我们的实验结果表明,与几种领先的基准相比,迭代方法有可能产生更好的细分结果,而单独的展平步骤使我们能够将结果适应于不同级别的详细信息和用户偏好。

著录项

  • 来源
    《Advances in information retrieval》|2011年|p.629-640|共12页
  • 会议地点 Dublin(IE);Dublin(IE)
  • 作者单位

    School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, Ontario, NIG 2W1, Canada;

    School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, Ontario, NIG 2W1, Canada;

    School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, Ontario, NIG 2W1, Canada;

    PryLynx Corporation, 21 Oneida Place, Kitchener, Ontario, N2A 3G2, Canada;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 信息处理(信息加工);
  • 关键词

    text segmentation; language modeling.;

    机译:文本分割;语言建模。;
  • 入库时间 2022-08-26 13:47:04

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号