首页> 外文会议>Pacific Asia Conference on Language, Information and Computation; 20061101-03; Wuhan(CN) >Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
【24h】

Vietnamese Word Segmentation with CRFs and SVMs: An Investigation

机译:使用CRF和SVM进行越南语分词:一项调查

获取原文
获取原文并翻译 | 示例

摘要

Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical approaches or combined too many techniques. Consequently, their accuracies are not as high as expected. This paper reports a careful investigation of using conditional random fields (CRFs) and support vector machines (SVMs) - two of the most successful statistical learning methods in NLP and pattern recognition - for solving the task. We first build a moderate annotated corpus using different sources of materials. For a careful evaluation, different CRF and SVM models using different feature settings were trained and their results are compared and contrasted with each other. In addition, we discuss several important points about the accuracy, computational cost, corpus size and other aspects that might influence the overall quality of Vietnamese word segmentation.
机译:像大多数亚洲语言一样,越南语的分词是一项重要任务,对更高的语言处理水平具有重大影响。但是,由于缺少用于评估和比较的带注释的通用语料库,因此它很少受到社区的关注。此外,以前的大多数研究都集中在无监督统计方法或结合了太多技术的研究上。因此,它们的精度不如预期的高。本文报告了对使用条件随机场(CRF)和支持向量机(SVM)(在NLP和模式识别中最成功的两种统计学习方法)来解决任务的仔细研究。我们首先使用不同的资料来源建立一个中等注释的语料库。为了进行仔细评估,对使用不同功能设置的不同CRF和SVM模型进行了训练,并对它们的结果进行了比较和对比。此外,我们讨论了有关准确性,计算成本,语料库大小以及可能影响越南语分词整体质量的其他方面的几个重要问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号