In this paper, we present an unsupervised model for Chinese word segmentation based on the word formation power of character string (the word form model, WFM) and the affinity of character junctures (the character juncture model, CJM). We also proposed a formula to measure the size of segmentation space and adopt a two-way segmentation algorithm in our system simultaneously. Finally, we devise a modified version of Chinese word-formation patterns to identify unknown words. Since all the parameters can be estimated directly from unsegmented texts, the approaches proposed have strong adaptability and have proved efficient through our primary experiments.
展开▼
机译:在本文中,我们基于字符串字形(Word Form Model,WFM)和字符时序的亲和力(角色时装模型,CJM)的亲和力,为中文字分割的无监督模型。我们还提出了一种测量分割空间大小的公式,并同时在我们的系统中采用双向分段算法。最后,我们设计了一个修改版的汉字形成模式,以识别未知的单词。由于所有参数都可以直接从未分段文本估算,因此提出的方法具有强大的适应性,并通过我们的主要实验证明了有效。
展开▼