首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Uncovering Machine Learning-Ready Data from Public Clinical Trial Resources: A case-study on normalization across Aggregate Content of ClinicalTrials.gov
【24h】

Uncovering Machine Learning-Ready Data from Public Clinical Trial Resources: A case-study on normalization across Aggregate Content of ClinicalTrials.gov

机译:从公共临床试验资源中揭开机器学习的数据:临床综合含量的正常化案例研究.gov

获取原文

摘要

The state of clinical data is a barrier to the development of machine learning models to improve healthcare. Uncontrolled clinical freetext is common in both the patient and clinical trials: the resulting spelling, grammatical errors, phrasing variation, and other resulting variability results in difficult-to-leverage data. As part of our effort to harmonize the Aggregate Analysis of ClinicalTrials.gov (AACT) drop-withdrawal reasons to a controlled vocabulary, we explored two solutions. Elastic's fuzzy matching capability matched entries in the AACT Drop-Withdrawal table to a list of user-specified terms (74.6% coverage). The second approach was a custom pipeline employing NLP preprocessing, Levenshtein Distance (Fuzzy Matching), and semantic similarity mapping using a pre-trained FastText Model (98% coverage). Although manual oversight is still required, the amount of effort to harmonize with a controlled vocabulary is notably reduced. This work enables the rapid harmonization of clinical databases, allowing them to be leveraged for machine learning and analytics.
机译:临床数据的状态是对机器学习模型的发展的障碍,以改善医疗保健。不受控制的临床近近常见于患者和临床试验中常见:由此产生的拼写,语法错误,措辞变化和其他产生的可变性导致难以利用的数据。作为努力协调临床治疗的总分析的一部分.GOV(AACT)辍学原因对受控的词汇,我们探讨了两个解决方案。 Elastic的模糊匹配能力匹配AACT丢弃表中的条目,到用户指定的术语列表(覆盖率74.6%)。第二种方法是使用NLP预处理,Levenshtein距离(模糊匹配)和使用预先培训的FastText模型(98%覆盖率)的语义相似性映射的定制管道。虽然仍然需要手动监督,但明显减少了与受控词汇协调的努力。这项工作能够快速协调临床数据库,使他们能够利用机器学习和分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号