首页> 外文期刊>Clinical medicine & research. >C-C4-01: Rapid Exploration of Large Clinical Text Corpora for Information Extraction Feasibility Studies
【24h】

C-C4-01: Rapid Exploration of Large Clinical Text Corpora for Information Extraction Feasibility Studies

机译:C-C4-01:大型临床文本语料库的快速探索,以进行信息提取可行性研究

获取原文
           

摘要

Background/AimsLarge amounts of information are "buried" in unstructured clinical text such as chart notes, pathology reports and radiology reports. Through electronic medical record systems, much of this text is available for computer-aided analysis. Determining the specific language used in clinical text to express content of interest is an important early step in text-mining efforts. MethodsWe copied ~29.2 million clinical documents from our Epic Clarity database and other data sources to a secure SQL Server 2008 database, adding a full-text index to the textual content. Details will be discussed. To query and view clinical text we developed a Clinical Text Explorer application using Microsoft Access. None of the clinical text documents are de-identified; IRB approval is required for use. Features include: intuitive interface for testing and refining search schemes; quickly returns chart documents containing specified text; user can review either random or "best match" samples of documents to inform estimates of sensitivity and specificity of the search; highlighting marks facilitate visual scanning for terms of interest. ResultsClinical Text Explorer allows researchers to quickly and easily identify patients who could not have been identified reliably by searching only on structured data.The following example describes the iterative process of defining the best search terms to find records mentioning results of the Oncotype DX test for breast cancer. In less than an hour we determined that the search "oncotype dx" was too narrow, while "oncot*" was too broad (drawing in records with words like "oncotech"). The search "oncotyp*" was the most comprehensive without losing specificity. When limiting the search to test results, we found that adding the additional criterion that "oncotyp*" occur near "score" eliminated most irrelevant documents, while adding "recurrence" narrowed the results too far. This search returned substantially more records than were discovered by searching structured data from lab results alone. ConclusionsThe ease with which complex searches over large amounts of clinical text can be executed by this application eliminates barriers to text exploration posed by conventional methods such as regular expressions in SAS, allowing the domain expert (epidemiologist, physician, chart abstractor, etc.) to directly evaluate the results and refine the search.
机译:背景/目的大量信息被“隐藏”在非结构化的临床文本中,例如图表注释,病理报告和放射学报告。通过电子病历系统,本文的大部分内容可用于计算机辅助分析。确定临床文本中用于表达感兴趣内容的特定语言是文本挖掘工作中重要的早期步骤。方法我们将Epic Clarity数据库和其他数据源中的约2920万份临床文档复制到了安全的SQL Server 2008数据库中,为文本内容添加了全文索引。细节将被讨论。为了查询和查看临床文本,我们使用Microsoft Access开发了Clinical Text Explorer应用程序。没有任何临床文本文件被取消标识;使用需要IRB批准。功能包括:直观的界面,用于测试和完善搜索方案;快速返回包含指定文本的图表文档;用户可以查看随机或“最佳匹配”的文档样本,以告知对搜索的敏感性和特异性的估计;高亮标记便于视觉扫描感兴趣的术语。结果临床文本浏览器使研究人员仅通过搜索结构化数据即可快速轻松地识别出无法可靠鉴定的患者。以下示例描述了定义最佳搜索词以查找提及乳腺癌Oncotype DX测试结果的记录的迭代过程癌症。在不到一个小时的时间内,我们确定搜索“ oncotype dx”的范围太窄,而“ oncot *”的范围太广(用“ oncotech”之类的词来记录)。搜索“ oncotyp *”是最全面的,并且不会失去特异性。当将搜索限制为测试结果时,我们发现添加“ oncotyp *”出现在“比分”附近的附加标准消除了大多数不相关的文档,而添加“重复发生”则使结果范围缩小。与仅从实验室结果中搜索结构化数据所发现的结果相比,此搜索返回的记录要多得多。结论通过此应用程序可以轻松地对大量临床文本进行复杂的搜索,从而消除了传统方法(例如SAS中的正则表达式)构成的文本探索障碍,使领域专家(流行病学家,医师,图表摘要等)能够直接评估结果并优化搜索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号