首页> 外文期刊>Information retrieval >Collection-based compound noun segmentation for Korean information retrieval
【24h】

Collection-based compound noun segmentation for Korean information retrieval

机译:基于集合的复合名词分割,用于朝鲜语信息检索

获取原文
获取原文并翻译 | 示例
           

摘要

Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be overcome by unsupervised approaches. However, previous unsupervised methods normally do not consider all possible segmentation candidates, and/or rely on character-based segmentation clues such as bi-grams or all-length n-grams. So, they are prone to falling into a local solution. To overcome the problem, this paper proposes an unsupervised segmentation algorithm that searches the most likely segmentation result from all possible segmentation candidates using a word-based segmentation context. As word-based segmentation clues, a dictionary is automatically generated from a corpus. Experiments using three test collections show that our segmentation algorithm is successfully applied to Korean information retrieval, improving a dictionary-based longest-matching algorithm.
机译:复合名词分割是韩语语言处理中的关键第一步。到目前为止,大多数方法都需要某种形式的人工监督,例如预先存在的字典,分段复合名词或启发式规则。结果,他们遭受了未知单词问题的困扰,这可以通过无监督的方法来克服。但是,以前的无监督方法通常不会考虑所有可能的分割候选,并且/或者依赖于基于字符的分割线索,例如二元语法或全长n元语法。因此,他们倾向于陷入本地解决方案。为了解决该问题,本文提出了一种无监督的分割算法,该算法使用基于单词的分割上下文从所有可能的分割候选中搜索最可能的分割结果。作为基于单词的细分线索,从语料库自动生成字典。使用三个测试集合的实验表明,我们的分割算法已成功应用于朝鲜语信息检索,改进了基于字典的最长匹配算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号