首页> 外文期刊>Bioinformatics >Allerdictor: fast allergen prediction using text classification techniques
【24h】

Allerdictor: fast allergen prediction using text classification techniques

机译:过敏原:使用文本分类技术快速预测过敏原

获取原文
获取原文并翻译 | 示例
           

摘要

Motivation: Accurately identifying and eliminating allergens from biotechnology- derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e. g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain. Results: We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took similar to 6 min on a single core PC to scan a whole Swiss-Prot database of similar to 540 000 sequences and identified < 1% of them as allergens.
机译:动机:准确识别和消除生物技术衍生产品中的过敏原对人类健康至关重要。从生物医学研究的角度来看,鉴定测序基因组中的过敏原也很重要。在过去的几年中已经开发了许多过敏原预测工具。尽管这些工具已经达到了一定程度的特异性,但是当应用于大规模的变应原发现(例如在全基因组规模)时,由于这些工具的极度偏斜,它们仍然会产生许多假阳性,因此准确性较低(即使在召回率较低的情况下)。数据(过敏原很少)。此外,最准确的工具相对较慢,因为它们使用蛋白质序列比对来构建过敏原分类器的特征向量。此外,只有当前过敏原预测工具的Web服务器实现才可以公开获得,并且不能进行大批量提交。这些弱点使大规模的过敏原发现在公共领域变得无效且效率低下。结果:我们开发了Allerdictor,这是一种快速,准确的基于序列的过敏原预测工具,该工具将蛋白质序列建模为文本文档,并在文本分类中使用支持向量机进行过敏原预测。在多个高度偏斜的数据集上的测试结果表明,Allerdictor可以快速准确地预测过敏原,而不会产生较高的召回率。例如,Allerdictor在单核PC上只花了大约6分钟的时间来扫描整个有约540 000个序列的Swiss-Prot数据库,并将其中不到1%的序列识别为过敏原。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号