首页> 外文会议>International Conference on Large-Scale Knowledge Resources >Design and Prototype of a Large-Scale and Fully Sense-Tagged Corpus
【24h】

Design and Prototype of a Large-Scale and Fully Sense-Tagged Corpus

机译:大规模和完全感觉标记的语料库的设计和原型

获取原文

摘要

Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text sense tagged Corpus, which contains over 110,000 words. The Academia Sinica Balanced Corpus of Modern Chinese (also named Sinica Corpus) is treated as the tagging object, and there are 56 full texts extracted from this corpus. By using the N-gram statistics and the information of collocation, the preparation work for automatic sense tagging is planned by combining the techniques and methods of machine learning and the probability model. In order to achieve a highly precise result, the result of automatic sense tagging needs the touch of manual revising.
机译:感知标记的语料库对自然语言处理起到非常重要的作用,尤其是关于词学歧义和自然语言理解的研究。拥有大规模的中国感觉标记的语料库似乎是非常重要的,但实际上,这种大规模的语料库是当前阶段的临界缺陷。本文旨在设计大规模的中国全文感觉标记标记的语料库,其中包含超过110,000个字。近代汉语的学术学(也名叫Sinica Corpus)是标记对象的,并且从这个语料库中提取了56个完整文本。通过使用N-GRAM统计和搭配信息,通过组合机器学习技术和方法和概率模型来规划用于自动感测标记的准备工作。为了实现高精度的结果,自动感测标签的结果需要手动修改的触摸。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号