首页> 外文期刊>Journal of medical Internet research >Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
【24h】

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study

机译:使用文本聚类的高效数据清理将半结构化医疗报告应用于大规模粪便检查报告:方法学研究

获取原文
       

摘要

BackgroundSince medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data.ObjectiveIn this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data.MethodsThe proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion.ResultsA total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words.ConclusionsOur data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.
机译:背景技术自从基于大数据的医学研究变得越来越普遍以来,社区对分析大量半结构化或非结构化文本数据(例如检查报告)的兴趣和精力迅速增加。但是,由于印刷错误,不一致或数据输入问题,这些大型文本数据通常不易于应用于分析。因此,需要一种有效的数据清洗过程来确保此类数据的准确性。目的本文提出了一种有效的大规模医学文本数据数据清洗过程,该过程采用了文本聚类方法和价值转换技术,并对其进行了评估。方法提出的数据清理过程包括文本聚类和价值合并。在文本聚类步骤中,我们建议以互补的方式使用键冲突和最近邻居方法。同一群集中的单词(称为值)应被视为正确的值及其错误表示。在值转换步骤中,每个已识别群集的错误值都将转换为它们的正确值。我们将这些数据清洗过程应用于1995年至2015年在三星医疗中心生产的574,266份粪便检查报告中用于寄生虫分析的过程。我们对提议过程的性能进行了检查,并与基于单一聚类方法的数据清洗过程进行了比较。我们使用了OpenRefine 2.7,这是一个开放源代码应用程序,它提供了多种文本聚类方法和一个高效的用户界面,可通过共值建议进行价值转换。结果调查了粪便检查报告中的1,167,104个单词。在数据清理过程中,我们发现了30个正确的单词和45个印刷错误和重复的样式。我们发现具有印刷错误(98.61%)和印刷错误模式(97.78%)的单词的正确率较高。基于总单词数,结果数据准确性接近100%。结论我们基于键冲突和最近邻方法的组合使用的数据清理过程可以有效清除大规模文本数据,从而提高了数据准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号