首页> 外文会议> >Empirical evaluation of ensemble feature subset selection methods for learning from a high-dimensional database in drug design
【24h】

Empirical evaluation of ensemble feature subset selection methods for learning from a high-dimensional database in drug design

机译:从药物设计中的高维数据库中学习的整体特征子集选择方法的实证评估

获取原文

摘要

Discovering a new drug is one of the most important goals in not only the pharmaceutical field but also a variety of fields including molecular biology, chemistry and medical science. The importance of computationally understanding the relationships between a given chemical compound and its drug activity has been pronounced. In the data set regarding drug activity of chemical compounds, each row corresponds to a chemical compound, and columns are the descriptors of the compound and a label indicating drug activity of the compound Recently, the size of the descriptors has become larger to obtain more detailed information from a given set of compounds. Actually, the number of columns (attributes or features) of some drug data sets reaches hundreds of thousands or a million. The purpose of this paper is to empirically evaluate the performance of ensemble feature subset selection strategies by applying them to such a high-dimensional data set actually used in the process of drug design. We examined the performance of three ensemble methods, including a query learning based method, comparing with that of one of the latest feature subset selection methods. The evaluation was performed on a data set which contains approximately 140,000 features. Our results show that the query learning based methodology outperformed the other three methods, in terms of the final prediction accuracy and time efficiency. We have also examined the effect of noise in the data and found that the advantage of the method becomes more pronounced for larger noise levels.
机译:发现新药不仅是制药领域而且是分子生物学,化学和医学等多个领域的最重要目标之一。通过计算理解给定化合物与其药物活性之间关系的重要性已得到显着体现。在与化合物的药物活性有关的数据集中,每一行对应于一种化合物,而列则是该化合物的描述符和表示该化合物的药物活性的标签。最近,描述符的大小变得越来越大,以获得更详细的信息。一组给定化合物的信息。实际上,某些药物数据集的列数(属性或特征)达到数十万或一百万。本文的目的是通过将集成特征子集选择策略应用于药物设计过程中实际使用的此类高维数据集,以实证评估其性能。我们检查了三种集成方法(包括基于查询学习的方法)与最新特征子集选择方法之一的性能。对包含大约140,000个要素的数据集进行了评估。我们的结果表明,基于查询学习的方法在最终预测准确性和时间效率方面均优于其他三种方法。我们还检查了数据中噪声的影响,发现该方法的优点对于较大的噪声水平变得更加明显。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号