Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

机译：将机器学习与语言启发式技术相结合进行中文分词

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a POC-tagged character sequence into a word-segmented sentence. The tagging component uses a support vector machine based tagger to produce an initial tagging of the text and a transformation-based tagger to improve the initial tagging. In addition to the POC tags assigned to the characters, the merging component incorporates a number of linguistic and statistical heuristics to detect words with regular internal structures, recognize long words, and filter non-words. Experiments show that, without resorting to a separate unknown word identification mechanism, the model achieves an F-score of 95.0% for word segmentation and a competitive recall of 74.8% for unknown word recognition.

机译：本文描述了一种混合模型，该模型将机器学习与语言启发式方法相结合，以将未知单词识别与中文分词集成在一起。该模型由两个组件组成：字符位置（POC）标记组件，该组件使用指示其在单词中位置的POC标记注释句子中的每个字符，以及将POC标记的字符序列转换为字符的合并组件一个单词分段的句子。标记组件使用基于支持向量机的标记器来生成文本的初始标记，并使用基于转换的标记器来改善初始标记。除了分配给字符的POC标签之外，合并组件还合并了许多语言和统计启发法，以检测具有常规内部结构的单词，识别长单词并过滤非单词。实验表明，在不依靠单独的未知单词识别机制的情况下，该模型的单词分割F得分为95.0％，未知单词识别的竞争召回率为74.8％。

著录项

来源
《International Florida Artificial Intelligence Research Society Conference(FLAIRS 2007); 20070507-09; Key West,FL(US)》|2007年|P.241-246|共6页
会议地点 Key WestFL(US)
作者
Xiaofei Lu;
展开▼
作者单位

Department of Linguistics and Applied Language Studies The Pennsylvania State University University Park, PA 16802, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. Recognizing handwritten Chinese day and month words by combining a holistic method and a segmentation-based method [J] . Chongyang Zhang, Wei Li Neural Computing and Applications . 2013,第6期

机译：结合整体和基于分割的方法识别手写的中文日月单词
2. Recognizing handwritten Chinese day and month words by combining a holistic method and a segmentation-based method [J] . Chongyang Zhang, Wei Li Neural computing & applications . 2013,第6期

机译：结合整体和基于分割的方法识别手写的中文日月单词
3. Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation [C] . Xiaofei Lu International Florida Artificial Intelligence Research Society Conference . 2007

机译：将机器学习与语言细分的语言启发式相结合
4. Towards high-performance word sense disambiguation by combining rich linguistic knowledge and machine learning approaches. [D] . Chen, Jinying. 2006

机译：通过将丰富的语言知识和机器学习方法结合起来，实现高性能的单词歧义消除。
5. A combined machine-learning and graph-based framework for the segmentation of retinal surfaces in SD-OCT volumes [O] . Bhavna J. Antony, Michael D. Abràmoff, Matthew M. Harper, 2013

机译：结合机器学习和基于图的框架来分割SD-OCT卷中的视网膜表面
6. Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement [O] . Ayan Necip Fazil 2005

机译：结合语言学和机器学习技术来改善单词对齐

Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

摘要

著录项

相似文献

相关主题

期刊订阅