Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

Jing-Ming Li; Jing-Tao Sun; Wen-Han Huang; Qiu-Yu Zhang; Zhen-Zhou Tian; Ning Lu

首页> 外文期刊>Technical Gazette >Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

【24h】

Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

机译：无监督的文本主题相关基因提取大型不平衡数据集

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability.

机译：有一个常见的概念认为，传统的无监督特征提取算法遵循假设数据集中不同群集的分布是平衡的。然而，特征选择是通过从大量未标记的不平衡文本数据集中提取主题关键字时的特征之间的相似性的指导。因此，所选功能无法真正反映原始数据集的信息，从而影响了分类器的后续性能。为了解决这个问题，本文提出了一种提取无监督主题相关基因的新方法。首先，通过因子分析和浓度峰值算法获得样本簇组，基于该数据集标记为此。然后，考虑到样品簇的不平衡分布对特征选择的影响，将平均局部密度和信息熵组合在一起的CHI统计矩阵特征选择方法用于增强低密度小样本簇的特征。最后，描述了基于探索多维统计数据中的高阶相关性探索的相关基因提取方法，其使用独立的分量分析来增强所选特征的不可行能力。以这种方式，可以从大型不平衡数据集中提取无监督的文本主题相关基因。实验结果表明，提取未经监督的文本相关基因的提出方法优于从低密度小样品簇中提取文本主体项的现有方法，并且具有更高的早产和特征尺寸减少能力。

著录项

来源
《Technical Gazette》 |2020年第3期|共11页
作者
Jing-Ming Li; Jing-Tao Sun; Wen-Han Huang; Qiu-Yu Zhang; Zhen-Zhou Tian; Ning Lu;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
CHI statistical selection methoddensity peaksfactor analysisinformation entropyindependent component analysistext feature gene...;

机译：Chi统计选择方法强度峰变性分析信息熵依赖性组分analysext特征基因......;

相似文献

外文文献
中文文献
专利

1. Unsupervised Meta-Analysis on Diverse Gene Expression Datasets Allows Insight into Gene Function and Regulation [J] . Julia C. Engelmann, Roland Schwarz, Ste?en Blenk, Bioinformatics and Biology Insights . 2008,第2期

机译：对各种基因表达数据集进行无监督的荟萃分析可深入了解基因功能和调控
2. Unsupervised domain adaptation for activity recognition across heterogeneous datasets [J] . Pervasive and Mobile Computing . 2020,第期

机译：无监督域适应异构数据集的活动识别
3. Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering [J] . Abualigah Laith Mohammad, Khader Ahamad Tajudin Journal of supercomputing . 2017,第11期

机译：基于混合遗传算法和遗传算子的无监督文本特征选择技术
4. GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation [C] . Zhijing Jin, Qipeng Guo, Xipeng Qiu, International Conference on Computational Linguistics . 2020

机译：Genwiki：无监督图形到文本生成的130万内容共享文本和图形的数据集
5. Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets. [D] . George, Richard Peyton. 2006

机译：扩展技术机会分析文本数据挖掘方法：数据提取，清理，在线分析处理分析以及大型多源数据集的报告。
6. Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets [O] . Y.-h. Taguchi -1

机译：在基于疾病和DrugMatrix数据集的基因表达集成分析中使用基于张量分解的无监督特征提取来识别候选药物
7. Identification of Candidate Drugs for Heart Failure using Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Integrated Analysis of Gene Expression between Heart Failure and DrugMatrix Datasets [O] . Y-h. Taguchi 2017

机译：基于张量分解的无调节特征提取鉴定心力衰竭的候选药物综合分析心力衰竭和药物粘土数据集的基因表达综合分析
8. General Architecture for Text Engineering (GATE) Developer for Entity Extraction: Overview for SYNCOIN [R] . Vanni, M, Neiderer, A 2014

机译：用于实体提取的文本工程通用架构（GaTE）开发人员：sYNCOIN概述

Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

摘要

著录项

相似文献

相关主题

期刊订阅