首页> 外文会议>International Conference on Natural Language Processing and Knowledge Engineering; 20031026-20031029; Beijing; CN >AUTOMATIC EXTRACTION OF THE UNLISTED TERMS IN THE FIELD OF INFORMATION TECHNOLOGY BASED ON THE DYNAMIC CIRCULATION CORPUS
【24h】

AUTOMATIC EXTRACTION OF THE UNLISTED TERMS IN THE FIELD OF INFORMATION TECHNOLOGY BASED ON THE DYNAMIC CIRCULATION CORPUS

机译:基于动态循环语料库的信息技术领域未列出术语的自动提取

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

This paper discusses automatic extraction of the unlisted terms in the field of Information Technology based on the large-scale DCC (Dynamic Circulation Corpus), under the theory of Dynamic Updating of Language and Knowledge. It proposes the concept of Concatenation Index to decide whether a character string is a word/phrase or not. It also presents a new approach named "Concatenation Index + TFIDF" for extracting unlisted terms in large scale corpus of a certain field. The experiment selects the texts, around 17 million Chinese characters, in the field of IT (Information Technology) as the object corpus; and the texts, around 600 million Chinese characters, in the field of common usage as the contrast corpus. As a result, the tentative work flow has been established, and the approach turned out to be efficient.
机译:本文基于语言和知识的动态更新理论,讨论了基于大规模DCC(动态循环语料库)的信息技术领域未列出术语的自动提取。它提出了级联索引的概念来确定字符串是否是单词/短语。它还提出了一种名为“连接索引+ TFIDF”的新方法,用于从某个领域的大规模语料库中提取未列出的术语。实验选择了IT(信息技术)领域中约1,700万个汉字作为目标语料。以及约6亿个汉字的文字作为对比语料库。结果,建立了暂定的工作流程,并且该方法被证明是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号