首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification
【24h】

Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

机译:使用生成模型进行文本分类的跨领域挖掘区别和共性

获取原文
获取原文并翻译 | 示例
       

摘要

The distribution difference among multiple domains has been exploited for cross-domain text categorization in recent years. Along this line, we show two new observations in this study. First, the data distribution difference is often due to the fact that different domains use different index words to express the same concept. Second, the association between the conceptual feature and the document class can be stable across domains. These two observations actually indicate the distinction and commonality across domains. Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class, respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. An Expectation Maximization (EM) algorithm is developed to solve the CD-PLSA model, and further its distributed version is exploited to avoid uploading all the raw data to a centralized location and help to mitigate privacy concerns. After the training phase with all the data from multiple domains we propose to refine the immediate outputs using only the corresponding local data. In summary, we propose a two-phase method for cross-domain text classification, the first phase for collaborative training with all the data, and the second step for local refinement. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed method over existing state-of-the-art methods of supervised and transfer learning. It is noted to mention that as shown by the experimental results CD-PLSA for the - ollaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.
机译:近年来,多个域之间的分布差异已被用于跨域文本分类。沿着这条线,我们在这项研究中显示了两个新观察结果。首先,数据分布的差异通常是由于不同的域使用不同的索引词来表达相同的概念。其次,概念性要素与文档类之间的关联可以跨域保持稳定。这两个发现实际上表明了跨领域的区别和共性。受上述观察的启发,我们提出了一种生成统计模型,称为协作双重PLSA(CD-PLSA),以同时捕获多个域之间的域区别和共性。与仅具有一个潜在变量的概率潜在语义分析(PLSA)不同,该模型具有两个潜在因子y和z,分别对应于单词概念和文档类别。共享公共性与多个领域之间的区别交织在一起,并且还用作知识转换的桥梁。开发了期望最大化(EM)算法来解决CD-PLSA模型,并进一步利用其分布式版本来避免将所有原始数据上传到集中位置,并有助于缓解隐私问题。在训练阶段使用来自多个域的所有数据后,我们建议仅使用相应的本地数据来优化即时输出。总而言之,我们提出了一种用于跨域文本分类的两阶段方法,第一阶段用于对所有数据进行协作训练,第二阶段用于局部优化。最后,我们对数百个具有多个源域和多个目标域的分类任务进行了广泛的实验,以验证该方法相对于现有的监督学习和转移学习的最新方法的优越性。值得注意的是,如实验结果所示,用于协作训练的CD-PLSA更能容忍分布差异,并且局部分类的分类精度也得到了显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号