首页> 外文会议>Pacific Asia Conference on Language, Information and Computation >Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
【24h】

Vietnamese Word Segmentation with CRFs and SVMs: An Investigation

机译:越南词分割与CRFS和SVMS:调查

获取原文

摘要

Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical approaches or combined too many techniques. Consequently, their accuracies are not as high as expected. This paper reports a careful investigation of using conditional random fields (CRFs) and support vector machines (SVMs) - two of the most successful statistical learning methods in NLP and pattern recognition - for solving the task. We first build a moderate annotated corpus using different sources of materials. For a careful evaluation, different CRF and SVM models using different feature settings were trained and their results are compared and contrasted with each other. In addition, we discuss several important points about the accuracy, computational cost, corpus size and other aspects that might influence the overall quality of Vietnamese word segmentation.
机译:越南语的单词分割,就像为大多数亚洲语言一样,是对更高语言处理水平产生重大影响的重要任务。然而,由于缺乏用于评估和比较的常见注释语料库,它已经收到了很少的关注。此外,最先前的研究专注于无监督统计方法或组合太多技术。因此,它们的准确性不如预期的那么高。本文报告了对使用条件随机字段(CRF)和支持向量机(SVM)的仔细调查 - NLP中最成功的统计学习方法中的两个和模式识别 - 用于解决任务。我们首先使用不同的材料来源构建一个温和的注释语料库。对于仔细的评估,训练使用不同特征设置的不同CRF和SVM模型,并将其结果进行比较和彼此对比。此外,我们讨论了可能影响越南语分割整体质量的准确性,计算成本,语料库大小和其他方面的几个重要点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号