首页> 外文会议>AAAI Conference on Artificial Intelligence >Word Segmentation for Chinese Novels
【24h】

Word Segmentation for Chinese Novels

机译:中国小说的词分割

获取原文

摘要

Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Chinese segmentation is highly accurate on news data, but the accuracies drop significantly on other domains, such as science and literature. For scientific domains, a significant portion of out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentation significantly. For the literature domain, however, there is not a fixed set of domain terms. For example, each novel can contain a specific set of person, organization and location names. We investigate a method for automatically mining common noun entities for each novel using information extraction techniques, and use the resulting entities to improve a state-of-the-art segmentation model for the novel. In particular, we design a novel double-propagation algorithm that mines noun entities together with common contextual patterns, and use them as plug-in features to a model trained on the source domain. An advantage of our method is that no retraining for the segmentation model is needed for each novel, and hence it can be applied efficiently given the huge number of novels on the web. Results on five different novels show significantly improved accuracies, in particular for OOV words.
机译:单词分割是汉语文本自动句法分析的必要第一步。中国细分对新闻数据进行了高度准确的,但在其他领域,诸如科学和文学等域中的准确性显着下降。对于科学域,重要的词汇单词是特定于域的术语,因此词汇可以用于显着改善分段。但是,对于文献域,没有固定的域名。例如,每个小说可以包含特定的人,组织和位置名称。我们调查用于使用信息提取技术自动挖掘普通名词实体的方法,并使用所得实体来改善新颖的最先进的分段模型。特别是,我们设计一种新型的双重传播算法,该算法与常见的上下文模式一起挖掘名词实体,并将它们作为插件功能作为在源域培训的模型中。我们的方法的一个优点是每个小说都不需要为分段模型进行再培训,因此可以在网上赋予大量的小说来有效地应用。结果五种不同小说显示出显着提高的精度,特别是对于OOV字。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号