首页> 外文期刊>Procedia Computer Science >Domain Adaptation for Part-of-Speech Tagging of Indonesian Text Using Affix Information
【24h】

Domain Adaptation for Part-of-Speech Tagging of Indonesian Text Using Affix Information

机译:使用附件信息的印度尼西亚文本的词语标记的域改性

获取原文
           

摘要

Part-of-speech tagging is a process to apply word class of a word in texts. POS Tagger for specific language is usually built with generic domain corpus, for example using text from newspaper. If this POS Tagger tested against word from new domain or another specific domain, then the POS Tagger can possibly word class inaccurately. Solving specific domain adaptation can be done by using several methods, using clustering to change word representation or using model with big number of lexicon and using labelled texts from specific domain for training the model. In this research we apply domain adaptation method by using additional lexicon that built based on affix rule. Specific domain used is beauty product domain. Component for this system is a POS Tagger with generic domain and unlabeled lexicon from target domain. Word class in target domain lexicon applied based on affix information and the remains labelled manually. Based on observation to the dataset, words in English was often to be used, so the lexicon developed in Indonesian and English. The processed lexicon added in lexicon from original POS Tagger to give specific domain information to the POS Tagger with generic domain. The POS tags focused in this study are noun, proper noun, adjective and adverb because results from this POS Tagger are used for aspect and opinion extraction. Tagger with added lexicon achieve 68.99% accuracy and the percentage of words that are successfully recognized by tagger is 92.36%.
机译:词语标记是一个在文本中应用单词的单词类的过程。特定语言的POS标记通常是用泛型域语法构建的,例如使用报纸的文本。如果此POS标记测试从新域或其他特定域中的单词测试,则POS标记器可能无法粗俗地访问。通过使用群集来更改单词表示或使用大量词典和来自特定域的标记文本来培训模型的标记文本来完成特定域适应的特定域适应。在本研究中,我们使用基于附件规则构建的附加词典来应用域适应方法。使用的特定域是美容产品领域。此系统的组件是具有来自目标域的泛型域和未标记的词典的POS标记器。基于附件信息和手动标记的recease in应用的目标域词lexicon中的字类。基于对数据集的观察,通常使用英语单词,因此在印度尼西亚和英语中开发的词典。处理后的词典从原始POS标签中添加在词典中,将特定的域信息与泛型域提供给POS标记。专注于本研究的POS标签是名词,专有的名词,形容词和副词,因为该POS标签的结果用于方面和意见提取。添加Lexicon的标签达到68.99%的精度和标记成功识别的单词百分比为92.36%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号