Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

Zhuang Fuzhen; Luo Ping; Shen Zhiyong; He Qing; Xiong Yuhong; Shi Zhongzhi; Xiong Hui

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

【24h】

Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

机译：使用生成模型进行文本分类的跨领域挖掘区别和共性

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The distribution difference among multiple domains has been exploited for cross-domain text categorization in recent years. Along this line, we show two new observations in this study. First, the data distribution difference is often due to the fact that different domains use different index words to express the same concept. Second, the association between the conceptual feature and the document class can be stable across domains. These two observations actually indicate the distinction and commonality across domains. Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class, respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. An Expectation Maximization (EM) algorithm is developed to solve the CD-PLSA model, and further its distributed version is exploited to avoid uploading all the raw data to a centralized location and help to mitigate privacy concerns. After the training phase with all the data from multiple domains we propose to refine the immediate outputs using only the corresponding local data. In summary, we propose a two-phase method for cross-domain text classification, the first phase for collaborative training with all the data, and the second step for local refinement. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed method over existing state-of-the-art methods of supervised and transfer learning. It is noted to mention that as shown by the experimental results CD-PLSA for the - ollaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.

机译：近年来，多个域之间的分布差异已被用于跨域文本分类。沿着这条线，我们在这项研究中显示了两个新观察结果。首先，数据分布的差异通常是由于不同的域使用不同的索引词来表达相同的概念。其次，概念性要素与文档类之间的关联可以跨域保持稳定。这两个发现实际上表明了跨领域的区别和共性。受上述观察的启发，我们提出了一种生成统计模型，称为协作双重PLSA（CD-PLSA），以同时捕获多个域之间的域区别和共性。与仅具有一个潜在变量的概率潜在语义分析（PLSA）不同，该模型具有两个潜在因子y和z，分别对应于单词概念和文档类别。共享公共性与多个领域之间的区别交织在一起，并且还用作知识转换的桥梁。开发了期望最大化（EM）算法来解决CD-PLSA模型，并进一步利用其分布式版本来避免将所有原始数据上传到集中位置，并有助于缓解隐私问题。在训练阶段使用来自多个域的所有数据后，我们建议仅使用相应的本地数据来优化即时输出。总而言之，我们提出了一种用于跨域文本分类的两阶段方法，第一阶段用于对所有数据进行协作训练，第二阶段用于局部优化。最后，我们对数百个具有多个源域和多个目标域的分类任务进行了广泛的实验，以验证该方法相对于现有的监督学习和转移学习的最新方法的优越性。值得注意的是，如实验结果所示，用于协作训练的CD-PLSA更能容忍分布差异，并且局部分类的分类精度也得到了显着提高。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2012年第11期|p.2025-2039|共15页
作者
Zhuang Fuzhen; Luo Ping; Shen Zhiyong; He Qing; Xiong Yuhong; Shi Zhongzhi; Xiong Hui;
展开▼
作者单位

The Key Laboratory of Intelligent Information Processing, Beijing;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Statistical generative models; classification; cross-domain learning; distinction and commonality;

机译：统计生成模型;分类;跨领域学习;区别和共性;

相似文献

外文文献
中文文献
专利

1. Erratum to "Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification" [J] . Zhuang Fuzhen, Luo Ping, Shen Zhiyong, Knowledge and Data Engineering, IEEE Transactions on . 2012,第12期

机译：勘误“使用文本分类的生成模型在多个域中挖掘区别和共性”
2. Generative Topographic Mapping-Based Classification Models and Their Applicability Domain: Application to the Biopharmaceutics Drug Disposition Classification System (BDDCS) [J] . He?le?na A. Gaspar, Gilles Marcou, Dragos Horvath, Journal of chemical information and modeling . 2013,第12期

机译：基于生成式地形图的分类模型及其适用范围：在生物制药药物处置分类系统（BDDCS）中的应用
3. Opinion mining using ensemble text hidden Markov models for text classification [J] . Kang Mangi, Ahn Jaelim, Lee Kichun Expert Systems with Application . 2018,第MARa期

机译：使用集成文本隐藏马尔可夫模型进行文本分类的观点挖掘
4. Collaborative Dual-PLSA: Mining Distinction and Commonality Across Multiple Domains for Text Classification [C] . Fuzhen Zhuang, Ping Luo, Zhiyong Shen, CIKM 10;ACM conference on information and knowledge management . 2011

机译：双重Dual-PLSA协作：在文本分类中跨多个领域的区别和共性
5. A semantic partition based text mining model for document classification. [D] . Inibhunu, Catherine. 2006

机译：用于文档分类的基于语义分区的文本挖掘模型。
6. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection [O] . Taxiarchis Botsis, Michael D Nguyen, Emily Jane Woo, 2011

机译：疫苗不良事件报告系统的文本挖掘：使用信息特征选择进行医学文本分类
7. Latent-Variable Generative Models for Data-Efficient Text Classification [O] . Xiaoan Ding, Kevin Gimpel 2019

机译：数据有效文本分类的潜在变量生成模型

Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅