首页> 外文期刊>Computational linguistics >Unsupervised multilingual sentence boundary detection
【24h】

Unsupervised multilingual sentence boundary detection

机译:无监督多语言句子边界检测

获取原文
获取原文并翻译 | 示例

摘要

In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely, the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
机译:在本文中,我们提出了一种与语言无关,不受监督的句子边界检测方法。它基于这样的假设:一旦确定了缩写,就可以消除句子边界确定中的大量歧义。所提出的系统不必依赖于正字法线索,而是能够使用仅要求有关候选类型本身的信息且与上下文无关的三个标准来高精度检测缩写:缩写可以定义为由截断单词组成的非常紧密的搭配在最后一个句号中,缩写通常很短,并且缩写有时包含内部句号。我们还展示了搭配证据对于句子边界消除歧义的另外两个重要子任务,即首字母和序数的检测的潜力。所提议的系统已经在11种不同的语言和不同的文本类型上进行了广泛的测试。它无需任何进一步的修改或特定于语言的资源即可取得良好的效果。我们针对三种不同的基准评估其性能,并将其与文献中提出的用于句子边界检测的其他系统进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号