首页> 外文OA文献 >Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach
【2h】

Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach

机译:使用多模型监督机器学习方法从自由文本组织病理学报告中鉴定恶性肿瘤

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics.
机译:我们探索了各种机器学习(ML)模型来评估每个模型如何在分类组织病理学报告的任务中进行。我们用随机梯度下降(SGD),支持向量机(SVM),随机森林(RF),K最近邻居(KNN),自适应升压(AB),决策树(DT),高斯,高斯天真贝叶斯(GNB),Logistic回归(LR)和虚拟分级器。我们开始使用60,083个组织病理学报告,预处理后减少到60,069。 SVM,SGD KNN,RF,DT,LR,AB和GNB的F1分数分别为97%,96%,96%,96%,92%,96%,84%和88%,而且错误分类率分别为3.31%,5.25%,4.39%,1.75%,3.5%,4.26%,23.9%和19.94%。近似运行时间为2小时,20分钟,40分钟,8小时,40分钟,10分钟,50分钟和4分钟。 RF具有最长的运行时间,但标记数据的错误分类率最低。我们的研究表明,在撒哈拉以南非洲撒哈拉非洲环境中加工癌症发出率报告的自由文本病理报告中应用ML技术的可能性。这是对资源受限环境的重要考虑因素利用ML技术来减少工作量,提高癌症统计数据报告的及时性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号