首页> 外文期刊>International journal on digital libraries >On the applicability of word sense discrimination on 201 years of modern english
【24h】

On the applicability of word sense discrimination on 201 years of modern english

机译:词义辨析在现代英语201年中的适用性

获取原文
获取原文并翻译 | 示例
       

摘要

As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user's information needs. In this algorithms applied on modem English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785-1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987-2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.
机译:随着语言的发展,用户无法访问长期存档中存储的文档。自动地,检测和处理语言演变将成为满足用户信息需求的必要条件。在此算法中,将其应用于现代英语中以找到词义,这些词义随后将用作发现进化的基础。我们对从The Times Archive(1785-1985)提取的所有名词和名词短语应用曲率聚类算法。我们将自然语言处理器用于词性标记和词性化,并报告这些处理器在整个期间的性能。我们使用WordNet评估群集,以验证它们是否对应于有效的词义。由于时代档案库包含OCR错误,因此我们研究了此类错误对单词义辨别结果的影响。最后,我们提出了一种纠正存档中存在的OCR错误的新颖方法,并表明曲率聚类算法的覆盖范围得到了改善。我们将群集数量增加了24%。为了验证我们的结果,我们使用《纽约时报》语料库(1987-2007)作为我们实验的基础,该文献集被认为是无差错的。我们发现在纠正《时代》档案馆中的OCR错误之后,在《时代》档案馆上应用的词义辨别性能可与地面事实相媲美。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号