首页> 外文会议>International Conference on Systems Engineering >Experiments in text-based mining and analysis of biological information from MEDLINE on functionally-related genes
【24h】

Experiments in text-based mining and analysis of biological information from MEDLINE on functionally-related genes

机译:基于文本的挖掘和生物信息分析与功能相关基因的生物学信息分析

获取原文
获取外文期刊封面目录资料

摘要

Technological advancements such as microarrays have enabled biologists to generate unprecedented quantities of data about biological entities. This has lead to the development of a large number of algorithms for processing and analysis of biological data. Challenges however remain; for instance, genes that function cooperatively need not have similar expression patterns. This suggests the use of non-numerical sources of information to explore the underlying biology. We experimentally study various factors that are inherent in algorithmic methodologies for text analysis. The proposed method accesses MEDLINE dynamically to account for the latest research, with the available literature corresponding to the genes analyzed to develop lists of keywords. Natural language processing (NLP) techniques such as stop-word filtering and stemming are then applied to the lists, and keyword frequencies weighted using the term frequency-inverse document frequency (TFIDF) scheme. The results are input to a hierarchical clustering algorithm to derive groupings of genes by functionality. The process is repeated using z-score weighting and latent semantic analysis (LSA) to determine which yields the most accurate clustering. The study presented examines the importance of these steps and their influence on the overall efficacy of the system. We believe that the analysis conducted as part of this research is invaluable to development and fine-timing of text mining methodologies for biological literature.
机译:微阵列等技术进步使生物学家能够产生关于生物实体的前所未有的数量。这导致了大量算法进行加工和分析生物数据。然而仍然存在挑战;例如,协同功能的基因不需要具有类似的表达模式。这表明使用非数值信息来源来探索潜在的生物学。我们通过实验研究文本分析算法方法中固有的各种因素。该提出的方法可动态访问MEDLINE,以考虑最新的研究,其中可用文献对应于分析的基因开发关键字列表。然后将自然语言处理(NLP)技术(如止血滤波和抛出)应用于列表,并且使用术语频率反转文档频率(TFIDF)方案加权的关键字频率。结果输入到分层聚类算法,以通过功能导出基因的分组。使用Z-Score加权和潜在语义分析(LSA)重复该过程以确定哪个产生最准确的聚类。本研究表明,研究了这些步骤的重要性及其对系统整体疗效的影响。我们认为,作为本研究的一部分进行的分析对于生物学文献的文本挖掘方法的开发和微观时间非常无价。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号