首页> 外文会议>International Conference on Intelligent Text Processing and Computational Linguistics >Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List
【24h】

Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

机译:用于测试单词相似性的实证公式及其构建词频率列表的应用

获取原文

摘要

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.
机译:在文档分类和群集的许多任务中,必须自动从语料库中学习单词频率列表。然而,当程序将单词视为仅仅是字符串时,单词的形态变化会扰乱统计数据。因此,重要的是识别由相同基本含义的形态变异引起的琴弦。由于使用大的形态词典具有其众所周知的技术缺点,我们提出了一种基于用于测试两个单词的相似性的经验公式的这种识别的启发式近似方法。我们提供了一种确定式参数的简单方法。该公式基于两个单词的初始部分中的重合字母的数量和这两个单词的最终部分中的非重合字母的数量。迭代算法使用所有类似单词的常见部分构造字频率列表。我们提供英语和西班牙语的例子。所描述的技术在我们的系统字典设计器中实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号