首页> 中文期刊>模式识别与人工智能 >维吾尔文无监督自动切分及无监督特征选择*

维吾尔文无监督自动切分及无监督特征选择*

     

摘要

Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features, so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS, the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically, and the liner combinations of difference of t-test, mutual information and entropy of double word adjacency are taken as a measurement ( dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS, an improved unsupervised feature selection criterion ( UMRMR ) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself, and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.%维吾尔文常用切分方法会产生大量的语义抽象甚至多义的词特征,因此学习算法难以发现高维数据中隐藏的结构.提出一种无监督切分方法dme-TS和一种无监督特征选择方法UMRMR-UFS. dme-TS从大规模生语料中自动获取单词Bi-gram及上下文语境信息,并将相邻单词间的t-测试差、互信息及双词上下文邻接对熵的线性融合作为一个组合统计量( dme)来评价单词间的结合能力,从而将文本切分成语义具体的独立语言单位的特征集合. UMRMR-UFS用一种综合考虑最大相关度和最小冗余的无监督特征选择标准( UMRMR)来评价每一个特征的重要性,并将最重要的特征依次移入到特征子集中.实验结果表明dme-TS能有效控制原始特征集的规模,提高特征项本身的质量,用UMRMR-UFS的输出来表征文本时,学习算法也表现出其最高的性能.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号