首页> 外文会议>International Conference on Pattern Recognition >Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images
【24h】

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

机译:分割杂乱文本:检测源自历史报纸图像的文本的边界

获取原文

摘要

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
机译:文本分段,将文档划分为部分的任务通常是执行其他自然语言处理任务的先决条件。 通常使用Clean,叙事式文本和包含不同主题的段的段开发和测试现有文本分段方法。 在这里,我们考虑一个具有挑战性的文本细分任务:将报纸结婚公告列表分成一个公告的单位。 在许多情况下,信息不构造成句子,并且相邻的段并不彼此局部不同。 此外,通过光学字符识别源自历史报纸的图像的公告的文本包含许多印刷错误。 因此,这些公告不会与现有技术进行分段。 我们提出了一种基于深度学习的基于深度学习的模型,用于分割这些文本,并表明它显着优于我们的任务中现有的最先进的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号