首页> 外文会议>Australasian Joint Conference on Artificial Intelligence >Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents
【24h】

Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

机译:博客文献形态分析的无监督案文规范化方法

获取原文
获取外文期刊封面目录资料

摘要

In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods.
机译:在本文中,我们提出了一种通过用形式表达式替换特殊表达式来减少博客文档的未知单词数量的算法。日语博客文档包含许多奇特的表达式,被形态分析仪被视为未知序列。减少这些未知序列提高了博客文献的形态学分析的准确性。对形态词典的特殊表达的手动登记是一种常规解决方案,这是昂贵的,并且需要专门的知识。在我们的算法中,从正式书面文件(如报纸)自动检索特殊表达式的替换候选者并将其作为替换规则存储。对于正确的替换,基于三个标准选择替代规则;其出现频率在检索过程中,取代序列与原文之间的编辑距离,以及替换后词分割的估计准确性改进。实验结果表明,我们的算法将未知单词的数量降低30.3%,保持与传统方法相同的分割精度,这是传统方法的缩减率的两倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号