首页> 外文学位 >Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences.
【24h】

Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences.

机译:汉语大块分割的高级技术和汉语句子的相似度度量。

获取原文
获取原文并翻译 | 示例

摘要

This thesis addresses two important problems in Chinese information processing, namely Chinese chunk segmentation and the similarity measure of Chinese sentences. The three main contributions reported in this thesis are: (1) a novel Chinese chunk segmentation technique using a statistical model combined with correction rules generated using an error-correction mechanism; (2) a novel similarity measure of Chinese sentences using both word/chunk sequences and POS (Part of Speech) tag sequences of Chinese sentences; and (3) the optimization of parameters used in the combined similarity measure approach by applying a relevance feedback technique and a neural network model.; In the first investigation, a statistical model combined with correction rules generated by an error-correction mechanism is proposed for Chinese chunk segmentation. Chunk segmentation of Chinese sentences in the training corpus was carried out manually to provide a ground rule for training the statistical model with which preliminary chunk segmentation results will be obtained. The chunk segmentation result (correctly and incorrectly segmented chunks) from the statistical model is utilized to generate a set of correction rules for refining the segmentation result. This set of correction rules is generated by an error-correction mechanism in which a comparison between the preliminary segmentation result and the manually segmented result is performed. The statistical model and the learned correction rules can then be used to perform Chinese chunk segmentation of unseen sentences.; In the second investigation, novel similarity measures of Chinese sentences are proposed by using word/chunk sequences and POS tag sequences of Chinese sentences. The sentence similarity measure is one of very important components in example-based machine translation (EBMT). For Chinese sentences there is no delimiter between any two words, which is different from English sentences. Hence, Chinese word/chunk delimitation should be performed first before a sentence similarity measure can be computed. Both word/chunk sequence feature and POS tag sequence feature used in our proposed similarity measures are based on word/chunk segmentation. Sentence structure information is partially reflected in the POS tag sequence. For the proposed word-sequence-matching-based (WSMB) method, we take into consideration three factors between two sentences: the number of identical word sequences, the length of each identical word sequence, and the average weighting (AW) of each identical word sequence. In computing AW we weight every POS tag according to its importance. The POS-tag-sequence-matching-based (PTSMB) method is to measure the similarity of Chinese sentences in terms of their structures. If the constituents in two Chinese sentences are similar, then we can judge that these two Chinese sentences are similar in structure. The main idea of this similarity measure is that we perform matching between the POS's of two Chinese sentences using directed graphs. The POS weighting is also utilized in the process.; In the third investigation, we propose a human-computer interaction approach to optimize parameters used in the combined similarity measure of Chinese sentences based on a relevance feedback scheme and a neural network model. In the relevance feedback process, users' intentions and preferences to rank the candidate sentences are captured and used to modify parameters in the similarity measure. For the parameter optimization research, a web-based questionnaire was designed to collect users' feedback data. In this pioneering study, we constructed 50 groups of sentences. There is one source sentence and ten sentences to be retrieved for every group. The ten test sentences are shown in descending order of similarity to the source sentence. The user is asked to provide a new rank according to his or her judgment if he/she does not agree with the ranking done by the computer. The new rank is converted
机译:本文针对中文信息处理中的两个重要问题,即中文大块分割和中文句子的相似度度量。本论文报道的三个主要贡献是:(1)一种新的中文大块分割技术,它使用统计模型结合使用纠错机制生成的校正规则; (2)使用单词/块序列和中文句子的POS(词性)标签序列两者的中文句子的新颖性相似性度量; (3)通过应用相关反馈技术和神经网络模型来优化组合相似性度量方法中使用的参数。在第一个研究中,提出了一种结合了由纠错机制生成的纠正规则的统计模型用于中文块分割。手动对训练语料库中的汉语句子进行大块分割,为训练统计模型提供了基础规则,通过该规则可以获取初步的大块分割结果。来自统计模型的块分割结果(正确和错误地分割的块)被用来生成一组校正规则以细化分割结果。这组校正规则由纠错机制生成,在该纠错机制中,执行了初步分割结果和手动分割结果之间的比较。然后,可以使用统计模型和学习到的校正规则对看不见的句子进行中文块分割。在第二项研究中,通过使用汉语句子的词/块序列和POS标签序列,提出了新颖的汉语句子相似度度量。句子相似性度量是基于示例的机器翻译(EBMT)中非常重要的组成部分之一。对于中文句子,任何两个单词之间没有定界符,这与英语句子不同。因此,在可以计算句子相似性度量之前,应首先执行中文单词/块定义。我们提出的相似性度量中使用的词/块序列特征和POS标签序列特征均基于词/块分割。句子结构信息部分反映在POS标签序列中。对于提出的基于单词序列匹配(WSMB)的方法,我们考虑了两个句子之间的三个因素:相同单词序列的数量,每个相同单词序列的长度以及每个相同单词的平均权重(AW)单词序列。在计算AW时,我们根据其重要性对每个POS标签进行加权。基于POS标签序列匹配(PTSMB)的方法是根据结构来测量中文句子的相似性。如果两个汉语句子中的成分相似,则可以判断这两个汉语句子在结构上相似。这种相似性度量的主要思想是,我们使用有向图在两个中文句子的POS之间进行匹配。 POS加权也用于该过程中。在第三次调查中,我们提出了一种基于相关反馈方案和神经网络模型的人机交互方法,以优化用于组合中文句子相似度的参数。在相关性反馈过程中,用户对候选句子进行排名的意图和偏好被捕获并用于修改相似性度量中的参数。为了进行参数优化研究,设计了一个基于Web的调查表来收集用户的反馈数据。在这项开创性研究中,我们构建了50组句子。每个组有一个原始句子和十个句子要检索。十个测试句子按与源句子相似的降序显示。如果他/她不同意计算机所做的排名,则要求用户根据他或她的判断提供新的排名。新等级转换

著录项

  • 作者

    Wang, Rongbo.;

  • 作者单位

    Hong Kong Polytechnic University (People's Republic of China).;

  • 授予单位 Hong Kong Polytechnic University (People's Republic of China).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 156 p.
  • 总页数 156
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:39:47

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号