Vietnamese Word Segmentation with CRFs and SVMs: An Investigation

机译：使用CRF和SVM进行越南语分词：一项调查

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical approaches or combined too many techniques. Consequently, their accuracies are not as high as expected. This paper reports a careful investigation of using conditional random fields (CRFs) and support vector machines (SVMs) - two of the most successful statistical learning methods in NLP and pattern recognition - for solving the task. We first build a moderate annotated corpus using different sources of materials. For a careful evaluation, different CRF and SVM models using different feature settings were trained and their results are compared and contrasted with each other. In addition, we discuss several important points about the accuracy, computational cost, corpus size and other aspects that might influence the overall quality of Vietnamese word segmentation.

机译：像大多数亚洲语言一样，越南语的分词是一项重要任务，对更高的语言处理水平具有重大影响。但是，由于缺少用于评估和比较的带注释的通用语料库，因此它很少受到社区的关注。此外，以前的大多数研究都集中在无监督统计方法或结合了太多技术的研究上。因此，它们的精度不如预期的高。本文报告了对使用条件随机场（CRF）和支持向量机（SVM）（在NLP和模式识别中最成功的两种统计学习方法）来解决任务的仔细研究。我们首先使用不同的资料来源建立一个中等注释的语料库。为了进行仔细评估，对使用不同功能设置的不同CRF和SVM模型进行了训练，并对它们的结果进行了比较和对比。此外，我们讨论了有关准确性，计算成本，语料库大小以及可能影响越南语分词整体质量的其他方面的几个重要问题。

著录项

来源
《Pacific Asia Conference on Language, Information and Computation; 20061101-03; Wuhan(CN)》|2006年|P.215-222|共8页
会议地点 Wuhan(CN)
作者
Cam-Tu Nguyen; Trung-Kien Nguyen; Xuan-Hieu Phan; Le-Minh Nguyen; Quang-Thuy Ha;
展开▼
作者单位

College of Technology, Vietnam National University, Hanoi;

School of Information Science, Japan Advanced Institute of Science and Technology;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
word segmentation; segmenting and labeling sequence data; conditional random fields; support vector machines; maximum matching;

机译：分词；分段和标记序列数据；条件随机字段；支持向量机；最大匹配;
入库时间 2022-08-26 14:21:00

相似文献

外文文献
中文文献
专利

1. Word Segmentation for Burmese Based on Dual-Layer CRFs [J] . Zhang Shaoning, Mao Cunli, Yu Zhengtao, ACM transactions on Asian language information processing . 2019,第1期

机译：基于双层CRF的缅甸语分词
2. Segmentation-free word spotting with exemplar SVMs [J] . Jon Almazán, Albert Gordo, Alicia Fornés, Pattern Recognition: The Journal of the Pattern Recognition Society . 2014,第12期

机译：使用示例SVM进行无分段的单词发现
3. Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node [J] . Nuo Qun, Hang Yan, Xi-Peng Qiu, 计算机科学技术学报（英文版） . 2020,第005期

机译：通过带有中继节点的BiLSTM + Semi-CRF进行中文分词
4. Vietnamese Word Segmentation with CRFs and SVMs: An Investigation [C] . Cam-Tu Nguyen, Trung-Kien Nguyen, Xuan-Hieu Phan, Pacific Asia Conference on Language, Information and Computation . 2006

机译：越南词分割与CRFS和SVMS：调查
5. Learning a two-stage SVM/CRF sequence classifier [D] . Hoefel, Guilherme 2008

机译：学习两阶段SVM / CRF序列分类器
6. Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing [O] . Fei Zhu, Bairong Shen 2009

机译：结合SVM-CRF用于最大双向压缩的生物命名实体识别
7. Vietnamese Word Segmentation with CRFs and SVMs: An Investigation [O] . Nguyen Cam-Tu, Nguyen Trung-Kien, Phan Xuan-Hieu, 2006

机译：使用CRF和SVM进行越南语分词：一项调查

Vietnamese Word Segmentation with CRFs and SVMs: An Investigation

摘要

著录项

相似文献

相关主题

期刊订阅