首页> 外文期刊>The Electronic Library >Multi-granularity hierarchical topic-based segmentation of structured, digital library resources
【24h】

Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

机译:基于多粒度分层主题的结构化数字图书馆资源细分

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the focus on linear segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital library's structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based segmentation system (MHTSS) to decide section breaks. Design/methodology/approach - MHTSS adopts up-down segmentation strategy to divide a structured, digital library document into a document segmentation tree. Specifically, it works in a three-stage process, such as document parsing, coarse segmentation based on document access structures and fine-grained segmentation based on lexical cohesion. Findings - This paper analyzed limitations of document segmentation methods for the structured, digital library resources. Authors found that the combination of document access structures and lexical cohesion techniques should complement each other and allow for a better segmentation of structured, digital library resources. Based on this finding, this paper proposed the MHTSS for the structured, digital library resources. To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora. Through comparison, it was found that the MHTSS achieves top overall performance. Practical implications - With MHTSS, digital library users can get their relevant information directly in segments instead of receiving the whole document. This will improve retrieval performance as well as dramatically reduce information overload. Originality/value - This paper proposed MHTSS for the structured, digital library resources, which combines the document access structures and lexical cohesion techniques to decide section breaks. With this system, end-users can access a document by sections through a document structure tree.
机译:目的-当前的分割系统几乎总是专注于线性分割,并且只能将文本划分为线性的分割序列。这适合于具有粘性的文本(例如新闻源),而不适合于具有一致性的文本(例如具有分层结构的数字图书馆文档)。为了克服对文档分段中线性分段的关注,并实现数字图书馆结构化资源的分层分段的目的,本文旨在提出一种新的基于多粒度分层主题的分段系统(MHTSS)来确定分节符。设计/方法/方法-MHTSS采用上下分段策略,将结构化的数字图书馆文档划分为文档分段树。具体来说,它以三个阶段的过程工作,例如文档解析,基于文档访问结构的粗略分割和基于词汇内聚的细粒度分割。调查结果-本文分析了结构化数字图书馆资源的文档分割方法的局限性。作者发现,文档访问结构和词汇衔接技术的结合应该互补,并可以更好地分割结构化的数字图书馆资源。基于这一发现,本文针对结构化的数字图书馆资源提出了MHTSS。为了评估它,将MHTSS与真实数字图书馆语料库上的TT和C99算法进行了比较。通过比较,发现MHTSS达到了最高的整体性能。实际意义-借助MHTSS,数字图书馆用户可以直接按段获取其相关信息,而无需接收整个文档。这将提高检索性能,并大大减少信息过载。原创性/价值-本文针对结构化的数字图书馆资源提出了MHTSS,它结合了文档访问结构和词汇衔接技术来确定分节符。使用此系统,最终用户可以通过文档结构树按节访问文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号