首页> 外文期刊>Journal of Information Science >A new method to compose long unknown Chinese keywords
【24h】

A new method to compose long unknown Chinese keywords

机译:一种新的构成长未知中文关键词的方法

获取原文
获取原文并翻译 | 示例
           

摘要

There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency-inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.
机译:现在,互联网上存储了大量电子文档。为了从该数据中检索信息,通常将每个文档表示为一组关键字,然后根据该组区分性词来分析所有文档。在信息检索中,识别文章中的单词是必不可少的步骤。但是,与英语不同,中文单词不以空格区分。因此,已经设计出许多方法来解析中文单词。在大多数当前系统中,基于字典的方法通常用于文本分割。但是,通用词典并不总是能够提供正确的参考来准确地解析特定于域的单词,尤其是未知单词。本文旨在提出一种新方法,通过将以前未知的关键字合并到关键字列表中而无需构建特定领域的词典,从而对中文文档中的较长关键字进行分类。我们的方法首先利用来自现有解析器的解析词,然后使用词频-反文档频率(TF-IDF)值过滤关键字;此外,基于解析的单词和关键词,使用T树来存储用于组成未知单词的候选。通过未知词(UW)系数阈值评估候选者,即,如果新组成的词的UW系数高于预定阈值,则将其视为新发现的未知词。最后,将解析后的单词和新组成的单词重新过滤以形成长关键字。几次与Google和Yahoo进行比较的实验结果表明,无论召回率,准确率和F量度如何,我们提出的方法均明显优于其他方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号