首页> 外文期刊>BMC Genomics >Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
【24h】

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning

机译:使用机器学习识别癌症临床试验文件中遗传病变的状态

获取原文
           

摘要

Background Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents. Methods We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials). Results Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task. Conclusions We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.
机译:背景技术现在,许多癌症临床试验在纳入标准或排除标准中规定了患者肿瘤中遗传病变的特殊状态。为了便于潜在参与者和临床医生搜索和鉴定与基因相关的临床试验,重要的是要开发自动方法,以从叙述性试验文件中鉴定遗传信息。方法我们开发了一种两阶段分类方法来鉴定从美国国家癌症研究所(NCI)的医师数据查询(PDQ)癌症临床试验数据库中提取的临床试验文件中的基因和遗传病变状态。该方法包括两个步骤:1)区分基因实体和非基因实体,例如英语单词; 2)确定所鉴定的基因实体是否与哪个遗传病变状态有关。我们使用人工注释的数据集开发并评估了该方法的性能,该数据集包含癌症临床试验中八个最常提及的基因的1,143个实例。此外,我们将分类器应用于现实世界中的癌症试验注释任务,并使用更大的样本量(从250个试验中检测到249个不同的人类基因符号中的4,013个实例)对其进行了评估。结果我们使用人工注释数据集进行的评估表明,当使用优化特征集时,两阶段分类器优于单阶段分类器,并且对于八个最常提及的基因,其最佳平均准确度达到了83.7%。当我们将对一组基因训练的两阶段分类器应用于另一独立基因时,它也显示出更好的通用性。当将基因中性的两阶段分类器应用于现实世界中的癌症试验注释任务时,它达到了89.8%的最高准确度,证明了为该任务开发基因中性的分类器的可行性。结论我们提出了一种基于机器学习的方法,可从临床试验文件中检测基因实体和遗传损伤状态,并证明其在癌症试验注释中的用途。这样的方法对于建立针对基因相关临床试验的信息检索工具将是有价值的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号