A new method to compose long unknown Chinese keywords

Yu-Chin Liu; Chun-Wei Lin

首页> 外文期刊>Journal of Information Science >A new method to compose long unknown Chinese keywords

【24h】

A new method to compose long unknown Chinese keywords

机译：一种新的构成长未知中文关键词的方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency-inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.

机译：现在，互联网上存储了大量电子文档。为了从该数据中检索信息，通常将每个文档表示为一组关键字，然后根据该组区分性词来分析所有文档。在信息检索中，识别文章中的单词是必不可少的步骤。但是，与英语不同，中文单词不以空格区分。因此，已经设计出许多方法来解析中文单词。在大多数当前系统中，基于字典的方法通常用于文本分割。但是，通用词典并不总是能够提供正确的参考来准确地解析特定于域的单词，尤其是未知单词。本文旨在提出一种新方法，通过将以前未知的关键字合并到关键字列表中而无需构建特定领域的词典，从而对中文文档中的较长关键字进行分类。我们的方法首先利用来自现有解析器的解析词，然后使用词频-反文档频率（TF-IDF）值过滤关键字；此外，基于解析的单词和关键词，使用T树来存储用于组成未知单词的候选。通过未知词（UW）系数阈值评估候选者，即，如果新组成的词的UW系数高于预定阈值，则将其视为新发现的未知词。最后，将解析后的单词和新组成的单词重新过滤以形成长关键字。几次与Google和Yahoo进行比较的实验结果表明，无论召回率，准确率和F量度如何，我们提出的方法均明显优于其他方法。

著录项

来源
《Journal of Information Science》 |2012年第4期|366-382|共17页
作者
Yu-Chin Liu; Chun-Wei Lin;
展开▼
作者单位

Department of Information Management, Shih Hsin University, No. 1, Lane 17, Sec. 1, Mu-Cha Road, Wenshan, Taipei 116, Taiwan, R.O.C.;

Wistron Corporation, Taiwan, R.O.C.;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Chinese word segmentation; unknown Chinese word; keyword retrieva;

机译：中文分词;未知的中文单词;关键字检索;

相似文献

外文文献
中文文献
专利

1. An approximate numerical method for solving Cauchy singular integral equations composed of multiple implicit parameter functions with unknown integral limits in contact mechanics [J] . Journal of Mathematical Analysis and Applications . 2020,第1期

机译：一种求解由多个隐式参数函数组成的Cauchy奇异积分方程的近似数值方法，包括接触力学未知积分限制
2. Method for the Estimation of the Mean Lorentzian Bandwidth in Spectra Composed of an Unknown Number of Highly Overlapped Bands [J] . Applied Spectroscopy . 2008,第6期

机译：未知数量的高度重叠频带组成的频谱中平均洛伦兹带宽的估计方法
3. Method for the Estimation of the Mean Lorentzian Bandwidth in Spectra Composed of an Unknown Number of Highly Overlapped Bands [J] . VICTOR A. LORENZ-FONFRIA, ESTEVE PADROS Applied Spectroscopy: Society for Applied Spectroscopy . 2008,第6期

机译：未知数量的高度重叠波段组成的光谱中平均洛伦兹带宽的估计方法
4. Hybrid Methods for POS Guessing of Chinese Unknown Words [C] . Xiaofei Lu, Association for Computational Linguistics(ACL), ACL-05 Association for Computational Linguistics Annual Meeting . 2005

机译：POS猜测中文未知词的混合方法
5. Floating as the keyword: Chinese independent documentary films in post-socialist China. [D] . Un, Siosan. 2009

机译：浮动关键字：后社会主义中国的中国独立纪录片。
6. Novel keyword co-occurrence network-based methods to foster systematic reviews of scientific literature [O] . Srinivasan Radhakrishnan, Serkan Erbis, Jacqueline A. Isaacs, 2012

机译：基于新型关键字共现网络的方法，以促进对科学文献的系统评价
7. A method to systematize keywords- Explicit keywords and implicit keywords - [O] . Georgiev Georgi V., Yamada Kaori, Taura Toshiharu, 2012

机译：一种将关键字（显式关键字和隐式关键字）系统化的方法-

A new method to compose long unknown Chinese keywords

摘要

著录项

相似文献

相关主题

期刊订阅