Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

机译：用于测试单词相似性的实证公式及其构建词频率列表的应用

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.

机译：在文档分类和群集的许多任务中，必须自动从语料库中学习单词频率列表。然而，当程序将单词视为仅仅是字符串时，单词的形态变化会扰乱统计数据。因此，重要的是识别由相同基本含义的形态变异引起的琴弦。由于使用大的形态词典具有其众所周知的技术缺点，我们提出了一种基于用于测试两个单词的相似性的经验公式的这种识别的启发式近似方法。我们提供了一种确定式参数的简单方法。该公式基于两个单词的初始部分中的重合字母的数量和这两个单词的最终部分中的非重合字母的数量。迭代算法使用所有类似单词的常见部分构造字频率列表。我们提供英语和西班牙语的例子。所描述的技术在我们的系统字典设计器中实现。

著录项

来源
《International Conference on Intelligent Text Processing and Computational Linguistics》|2002年||共8页
会议地点
作者
Pavel Makagonov; Mikhail Alexandrov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词

相似文献

外文文献
中文文献
专利

1. The Similarities (and Familiarities) of Pseudowords and Extremely High-Frequency Words: Examining a Familiarity-Based Explanation of the Pseudoword Effect [J] . Ozubko J.D., Joordens S. Journal of experimental psychology. Learning, memory, and cognition . 2011,第1期

机译：伪单词和极高频单词的相似性（和相似性）：检查基于相似度的伪单词效应的解释
2. List context effects in reading Italian words and nonwords: Can the word frequency effect be eliminated? [J] . Despina Paizia* Cristina Buranib Pierluigi Zoccolottia European Journal of Cognitive Psychology . 2010,第7期

机译：列出阅读意大利语单词和非单词时的上下文效应：可以消除单词频率效应吗？
3. What's in a Word-list? Investigating Word Frequency and Keyword Extraction [J] . Alenka Sauperl The Journal of Documentation . 2011,第1期

机译：单词表中有什么？调查词频和关键词提取
4. Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List [C] . Pavel Makagonov, Mikhail Alexandrov Computational Linguistics and Intelligent Text Processing . 2002

机译：测试词相似度的经验公式及其在构建词频表中的应用
5. Implicit testing and pre-recognition processing of spoken words and non-words. [D] . Shaffer, Thomas Rauby. 2001

机译：语音和非单词的隐式测试和预识别处理。
6. Test-Retest Reliability of Word Recognition Score Using Korean Standard Monosyllabic Word Lists for Adults as a Function of the Number of Test Words [O] . Jinsook Kim, Junghak Lee, Kyoung Won Lee, 2015

机译：使用韩国标准成人单音节单词列表对单词识别分数进行重新测试的可靠性与测试单词数的关系
7. Constructing a Word Similarity Graph from Vector based Word Representation for Named Entity Recognition [O] . Miguel Feria, Juan Paolo Balbin, Francis Michael Bautista 2018

机译：从基于向量的命名实体识别的词表示单词相似性图

Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

摘要

著录项

相似文献

相关主题

期刊订阅