Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

机译：分割杂乱文本：检测源自历史报纸图像的文本的边界

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.

机译：文本分段，将文档划分为部分的任务通常是执行其他自然语言处理任务的先决条件。通常使用Clean，叙事式文本和包含不同主题的段的段开发和测试现有文本分段方法。在这里，我们考虑一个具有挑战性的文本细分任务：将报纸结婚公告列表分成一个公告的单位。在许多情况下，信息不构造成句子，并且相邻的段并不彼此局部不同。此外，通过光学字符识别源自历史报纸的图像的公告的文本包含许多印刷错误。因此，这些公告不会与现有技术进行分段。我们提出了一种基于深度学习的基于深度学习的模型，用于分割这些文本，并表明它显着优于我们的任务中现有的最先进的方法。

著录项

来源
《International Conference on Pattern Recognition》|2021年|5543-5550|共8页
会议地点
作者
Carol Anderson; Phil Crone;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Image segmentation; Production systems; Tagging; Position measurement; Optical imaging; Tokenization; Optical character recognition software;

机译：图像分割;生产系统;标记;位置测量;光学成像;令牌化;光学字符识别软件;

相似文献

外文文献
中文文献
专利

1. Retrieving and Processing Images from the Pages of a Historical Newspaper and Modeling the Text Topics [J] . Gildacio J. de A. Sa, Jose E. B. Maia Journal of digital information management . 2021,第2期

机译：从历史报纸的页面中检索和处理图像并建立文本主题
2. A learning-based method to detect and segment text from scene images [J] . JIANG Ren-jie, QI Fei-hu, XU Li, Journal of Zhejiang University. A, Science . 2007,第4期

机译：一种基于学习的方法，可以从场景图像中检测文本并将其分割
3. A learning-based method to detect and segment text from scene images [J] . JIANG Ren-jie, QI Fei-hu, XU Li, Journal of Zhejiang University. Science, A . 2007,第4期

机译：一种基于学习的方法，可以从场景图像中检测文本并将其分割
4. A Fast Appearance-Based Full-Text Search Method for Historical Newspaper Images [C] . Terasawa Kengo, Shima Takahiro, Kawashima Toshio 2011 International Conference on Document Analysis and Recognition . 2011

机译：基于快速外观的历史报纸图像全文搜索方法
5. COMPUTER-ASSISTED AND TRADITIONAL METHODS OF TEXT ANALYSIS - A COMPARATIVE STUDY OF EAST AND WEST GERMAN NEWSPAPER LANGUAGE (SOCIOLINGUISTICS, TEXT LINGUISTICS). [D] . KEMPF, RENATE UTA. 1984

机译：文本分析的计算机辅助和传统方法-东西方德语报纸语言（社会语言学，文本语言学）的比较研究。
6. Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images [O] . Asghar Ali Chandio, Md. Asikuzzaman, Mark Pickering, 2020

机译：草书文本：用于自然场景图像中端到端乌尔都语文本识别的综合数据集
7. Detecting Oriented Text in Natural Images by Linking Segments [O] . Shi, Baoguang, Bai, Xiang, Belongie, Serge 2017

机译：通过链接段检测自然图像中的定向文本
8. Full-Text Access to Historical Newspapers [R] . Kanungo, T. , Allen, R. B. 1999

机译：全文访问历史报纸

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅