首页> 外文会议>Workshop on building and evaluating resources for biomedical text mining >Supervised classification of end-of-lines in clinical text with no manual annotation
【24h】

Supervised classification of end-of-lines in clinical text with no manual annotation

机译:没有手动注释的临床文本中的线末端的分类

获取原文

摘要

In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.
机译:在一些纯文本文档中,行终点标记可能或可能不会标记文本单元的边界(例如,段落)。这个烦恼问题可能会影响随后的自然语言处理组件,但很少在文献中解决。我们提出了一种方法,该方法使用没有手动注释来分类线尾是否必须被视为简单的空格(软线中断)或真实的文本单位边界。这种方法包括基于令牌和线长特征的自培训和共同训练步骤,在具有受控格式的短电子书的语料库上实现0.943 F测量,F = 0.904在24个临床文本的随机样本上软线断裂,并且F = 0.898在一组较大的混合临床文本上,可能或可能不包含软线断裂,对于没有手动注释的方法相当高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号