首页> 外文会议>Fifth workshop on building and evaluating resources for biomedical text mining >Supervised classification of end-of-lines in clinical text with no manual annotation
【24h】

Supervised classification of end-of-lines in clinical text with no manual annotation

机译:临床文本中行尾的监督分类,无需人工注释

获取原文
获取原文并翻译 | 示例

摘要

In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.
机译:在某些纯文本文档中,行尾标记可以标记也可以不标记文本单元(例如段落)的边界。这个令人烦恼的问题可能会影响随后的自然语言处理组件,但是在文献中很少涉及。我们提出一种不使用手动注释的方法来对行尾是否实际上必须视为简单的空格(软换行符)或真正的文本单元边界进行分类。该方法包括基于令牌和行长特征的自我训练和共训练步骤,可对24种临床文本的随机样本,受控格式的简短电子书的语料库实现0.943 F测度,F = 0.904软换行符,并且在较大的混合临床文本集上F = 0.898,这些文本可能包含也可能不包含软换行符,对于没有人工注释的方法来说,这是一个相当高的值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号