首页> 外文会议>Intelligence and Security Informatics >Chinese Word Segmentation for Terrorism-RelatedContents
【24h】

Chinese Word Segmentation for Terrorism-RelatedContents

机译:恐怖主义相关内容的中文分词

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through Mi-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.
机译:为了分析中文中与安全和恐怖主义有关的内容,对中文文档进行分词很重要。以前有很多关于中文分词的研究。两种主要方法是基于统计的方法和基于字典的方法。纯统计方法的精度较低,而基于纯字典的方法无法处理新单词,并且仅限于字典的覆盖范围。在本文中,我们提出了一种混合方法,避免了两种方法的局限性。通过在字典中使用后缀树和互信息(MI),我们的称为IASeg的分段器在领域训练可用时实现了高精确度的分词。它可以通过基于Mi的令牌合并和字典更新来识别新单词。此外,使用改进的Bigram方法,它还可以处理N-gram。为了评估细分工具的效果,我们将其与Hylanda细分工具和ICTCLAS细分工具(使用恐怖主义相关的语料库)进行了比较。实验结果表明,IASeg在准确性和查全率方面均优于两个基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号