首页> 外文期刊>Journal of the American Medical Informatics Association : >Corpus-based statistical screening for phrase identification.
【24h】

Corpus-based statistical screening for phrase identification.

机译:基于语料库的统计筛选,用于短语识别。

获取原文
获取原文并翻译 | 示例
           

摘要

PURPOSE: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material. METHOD: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database. MEASUREMENTS: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings. RESULT: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases. CONCLUSION: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.
机译:目的:作者研究通过统计方法从自然语言数据库中提取有用短语。目的是通过为预处理的短语列表提供高百分比的有用材料来利用人工。方法:该方法是根据短语出现的不同方面开发六种不同的评分方法。这里的重点不是词法信息或句法结构,而是从大型数据库中可获得的词对和三元组的统计特性。测量:统一医学语言系统(UMLS)在医学领域内包含了大量人类可接受的短语,作为其结构的一部分。作者使用此短语列表作为验证其方法的黄金标准。一种好的方法是在所有研究的短语中将UMLS短语排名较高。测量是基于排名的11点平均精度值和精度调用曲线。结果:作者发现了六种不同的评分方法,每种方法都被证明可以有效地识别MEDLINE较大子集中的UMLS质量短语。这些方法适用于单词对和单词三元组。将这六种方法进行了最佳组合,以产生比任何一种方法都更有效的综合评分方法。复合方法的质量似乎足以支持将超链接自动放置在高排名短语站点中的文本中。结论:统计评分方法为从自然语言数据库中提取有用短语以为文本建立索引或提供超链接提供了一种有前途的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号