...
首页> 外文期刊>Communications, China >An improved unsupervised approach to word segmentation
【24h】

An improved unsupervised approach to word segmentation

机译:改进的无监督分词方法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose ExESA, the extension of ESA. In ExESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, ExESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that ExESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of ExESA can be predicted by a set of empirical formulae or combined with the minimum description length principle.
机译:ESA是Wang先前提出的一种无监督的分词方法,它是一个由三个阶段组成的迭代过程:评估,选择和调整。在本文中,我们提出了ExESA(ESA的扩展)。在ExESA中,原始方法扩展到了2遍处理,并且引入了不同单词长度的比率作为结合内聚和分离的第三种信息。采用最大策略来确定选择短语中字符序列的最佳分割。此外,在调整中,ExESA会重新评估分离信息和单个信息以克服高估频率。另外,应用了平滑算法来减轻稀疏性。实验结果表明,ExESA可以通过正确利用未注释的语料库中的更多信息来进一步提高性能,并节省时间。此外,ExESA的参数可以通过一组经验公式进行预测,或者与最小描述长度原理结合使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号