首页> 外文期刊>Knowledge-Based Systems >Domain phrase identification using atomic word formation in Chinese text
【24h】

Domain phrase identification using atomic word formation in Chinese text

机译:中文文本中基于原子词形成的领域短语识别

获取原文
获取原文并翻译 | 示例

摘要

Chinese word segmentation is a difficult and challenging job because Chinese has no white space to mark word boundaries. Its result largely depends on the quality of the segmentation dictionary. Many domain phrases are cut into single words for they are not contained in the general dictionary. This paper demonstrates a Chinese domain phrase identification algorithm based on atomic word formation. First, atomic word formation algorithm is used to extract candidate strings from corpus after pretreatment. These extracted strings are stored as the candidate domain phrase set. Second, a lot of strategies such as repeated substring screening, part of speech (POS) combination filtering, and prefix and suffix filtering and so on are used to filter the candidate domain phrases. Third, a domain phrase refining method is used to determine whether a string is a domain phrase or not by calculating the domain relevance of this string. Finally, sort all the identified strings and then export them to users. With the help of morphological rules, this method uses the combination of statistical information and rules instead of corpus machine learning. Experiments proved that this method can obtain better results than traditional n-gram methods.
机译:中文分词是一项艰巨而具有挑战性的工作,因为中文没有空白来标记单词边界。其结果在很大程度上取决于分割字典的质量。许多领域短语被切成单个单词,因为它们不包含在通用词典中。本文提出了一种基于原子词形成的中文领域短语识别算法。首先,使用原子词形成算法从预处理后的语料库中提取候选字符串。这些提取的字符串存储为候选域短语集。其次,许多策略(例如重复子串筛选,词性(POS)组合过滤以及前缀和后缀过滤等)用于过滤候选域短语。第三,域短语精炼方法用于通过计算该字符串的域相关性来确定字符串是否为域短语。最后,对所有标识的字符串进行排序,然后将其导出给用户。在形态学规则的帮助下,该方法使用统计信息和规则的组合,而不是语料库机器学习。实验证明,该方法比传统的n-gram方法可获得更好的结果。

著录项

  • 来源
    《Knowledge-Based Systems》 |2011年第8期|p.1254-1260|共7页
  • 作者单位

    National Engineering Research Center for E-Learning, Huazhong Normal University. Wuhan. Hubei 430079, China,Department of Information Technology, Huazhong Normal University, Wuhan, Hubei 430079, China;

    National Engineering Research Center for E-Learning, Huazhong Normal University. Wuhan. Hubei 430079, China;

    National Engineering Research Center for E-Learning, Huazhong Normal University. Wuhan. Hubei 430079, China;

    Department of Information Technology, Huazhong Normal University, Wuhan, Hubei 430079, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    domain phrase; word formation; atomic word; string filtering; domain relevance;

    机译:域短语词的构成;原子词字符串过滤;领域相关性;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号