首页> 外文期刊>Computational linguistics >Extracting the Lowest-Frequency Words: Pitfalls and Possibilities
【24h】

Extracting the Lowest-Frequency Words: Pitfalls and Possibilities

机译:提取频率最低的单词:陷阱和可能性

获取原文
获取原文并翻译 | 示例
       

摘要

In a medical information extraction system, we use common word association techniques to extract side-effect-related terms. Many of these terms have a frequency of less than five. Standard word-association-based applications disregard the lowest-frequency words, and hence disregard useful information. We therefore devised an extraction system for the full word frequency range. This system computes the significance of association by the log-likelihood ratio and Fisher's exact test. The output of the system shows a recurrent, corpus-independent pattern in both recall and the number of significant words. We will explain these patterns by the statistical behavior of the lowest-frequency words. We used Dutch verb-particle combinations as a second and independent collocation extraction application to illustrate the generality of the observed phenomena. We will conclude that a) word-association-based extraction systems can be enhanced by also considering the lowest-frequency words, b) significance levels should not be fixed but adjusted for the optimal window size, c) hapax legomena, words occurring only once, should be disregarded a priori in the statistical analysis, and d) the distribution of the targets to extract should be considered in combination with the extraction method.
机译:在医学信息提取系统中,我们使用常见的单词关联技术来提取与副作用相关的术语。这些术语中的许多词的频率小于5。基于标准单词关联的应用程序忽略了最低频率的单词,因此忽略了有用的信息。因此,我们设计了用于整个单词频率范围的提取系统。该系统通过对数似然比和Fisher精确检验来计算关联的重要性。系统的输出在回忆和有效词的数量上都显示出循环的,语料库独立的模式。我们将通过最低频率单词的统计行为来解释这些模式。我们使用荷兰语动词-粒子组合作为第二个独立的搭配提取应用程序,以说明观察到的现象的普遍性。我们将得出以下结论:a)可以通过考虑最低频率的单词来增强基于单词关联的提取系统,b)重要性级别不应固定,而应针对最佳窗口大小进行调整,c)hapax legomena,单词仅出现一次,应在统计分析中不考虑先验,并且d)应结合提取方法考虑提取目标的分布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号