首页> 外文OA文献 >Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction
【2h】

Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

机译:手术病理报告的机器学习分类和块识别,以减少信息提取噪声

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
机译:背景与目的:尚未充分探索用于癌症相关临床文档的文本挖掘的机器学习技术。本文介绍了一些可用于预处理自由文本乳腺癌病理报告的技术,目的是促进与癌症分期有关的信息的提取。材料和方法:第一种技术是使用免费提供的软件RapidMiner进行分类的根据报告的总体布局:“半结构化”和“非结构化”。第二种技术是使用开源语言工程框架GATE开发的,旨在预测报告文本的大块,其中包含与癌症形态,肿瘤大小,激素受体状态和阳性结节数有关的信息。对分类器分别进行了训练和测试,分别来自北爱尔兰癌症登记处的635份和163份手动分类或注释的报告。结果:产生了99.4%的最佳准确性结果,其中仅包括一份预测为非结构化的半结构化报告。通过使用k最近算法的布局分类器,使用带有停用词过滤器并修剪的二进制项出现词向量类型。对于块识别,除了对包含癌症形态的块进行预测之外,对于所有情况,使用具有相同参数的PAUM算法都能找到最佳结果。对于半结构化报告,其精度和召回性的范围为0.97至0.94,并且从0.92到0.83,而对于非结构化报告,其精度和召回性的范围为0.91至0.64,并且从0.68到0.41。当对分类器进行半结构化报告的训练但对非结构化报告进行测试时,发现了较差的结果。结论:这些结果表明,预测报告的布局是可能且有益的,并且预测报告的哪些部分可能包含的准确性某些信息对报告的布局和所需信息的类型敏感。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号