Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

Wolfgang Himmel; Ulrich Reincke; Hans Wilhelm Michelmann

首页> 外文期刊>Journal of medical Internet research >Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

【24h】

Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

机译：文本挖掘和自然语言处理方法可将基于Web的专家论坛的待命请求自动分类

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background: Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services.Objective: To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies.Methods: We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website “Rund ums Baby” (“Everything about Babies”) into one or more of 38 categories belonging to two dimensions (“subject matter” and “expectations”). After creating start and synonym lists, we calculated the average Cramer’s V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification.Results: According to the manual classification of 988 documents, 102 (10%) documents fell into the category “in vitro fertilization (IVF),” 81 (8%) into the category “ovulation,” 79 (8%) into “cycle,” and 57 (6%) into “semen analysis.” These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as “general information” and 351 (36%) as a wish for “treatment recommendations.” The generation of indicator variables based on the chi-square analysis and Cramer’s V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, “words”) also included variables from other categories, most often with a negative sign. For example, absence of words predictive for “menstruation” was a strong indicator for the category “pregnancy test.”Conclusions: Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.

机译：背景：健康人和病人都越来越多地使用电子媒体来获取医疗信息和建议。例如，Internet用户可以将请求发送到基于Web的专家论坛或所谓的“问医生”服务。目的：使用不同的文本挖掘策略的组合来自动分类到Internet医学专家论坛的外行请求。：我们首先在德国网站“ Rund ums Baby”（“关于婴儿的一切”）上，将针对非自愿性无子女论坛的988个请求样本进行了手动分类，分为两个类别（“主题”和“期望”）。创建开始列表和同义词列表后，我们计算出每个单词与每个类别的关联的Cramer V统计平均值。我们还使用主成分分析和奇异值分解作为进一步的文本挖掘策略。通过这些措施，我们训练了回归模型，并在最佳回归模型的基础上，针对任何请求确定了属于38个不同类别中每个类别的概率，临界值为50％。计算出样本的召回率和精密度，作为自动分类的质量度量。结果：根据988个文档的手动分类，有102个（10％）文档属于“体外受精（IVF）”类别，81 （8％）进入“排卵”类别，79（8％）进入“周期”，57（6％）进入“精液分析”。这是主题维度中最常见的四个类别（由32个类别组成）。期望维度包括六个类别;我们将533份文档（占54％）分类为“一般信息”，将351份文档（占36％）分类为“治疗建议”。基于卡方分析和Cramer's V的指标变量生成被证明是大约一半类别中自动分类的最佳方法。结合其他两种方法，在测试样本的38个类别中的18个类别（47％）中实现了100％的精度和100％的查全率。对于35个类别（92％），准确性和召回率均优于80％。对于某些类别，输入变量（即“单词”）还包括来自其他类别的变量，大多数情况下带有负号。例如，缺少可预测“月经”的单词是“妊娠测试”类别的有力指标。结论：我们的方法提出了一种在Internet专家论坛中自动分类和分析非结构化信息的方法。该技术可以对新请求进行初步分类，并帮助Internet医学专家更好地处理大量信息并提供专业反馈。

著录项

来源
《Journal of medical Internet research》 |2009年第3期|共13页
作者
Wolfgang Himmel; Ulrich Reincke; Hans Wilhelm Michelmann;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词
入库时间 2022-08-18 17:30:57

相似文献

外文文献
中文文献
专利

1. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data [J] . Dreisbach Caitlin, Koleck Theresa A., Bourne Philip E., International journal of medical informatics . 2019,第MAY期

机译：对电子患者撰写的文本数据中自然语言处理和症状的文本挖掘的系统评价
2. Natural Language Processing: Text Categorization And Classifications [J] . Mona Nasr, Andrew karam, Mina Atef, International Journal of Advanced Networking and Applications . 2020,第2期

机译：自然语言处理：文本分类和分类
3. MODERN STATISTICAL AND LINGUISTIC APPROACHES TO PROCESSING TEXTS IN NATURAL LANGUAGES [J] . ALEKSANDR EVGENJEVICH PETROV, DMITRII ALEKSANDROVICH SYTNIK Journal of Theoretical and Applied Information Technology . 2016,第2期

机译：自然语言处理文本的现代统计和语言方法
4. Multi-Class Categorization of Design-Build Contract Requirements Using Text Mining and Natural Language Processing Techniques [C] . Fahad ul Hassan, Tuyen Le, Duc-Hoc Tran Construction Research Congress . 2020

机译：使用文本挖掘和自然语言处理技术进行设计 - 建立合同要求的多级分类
5. Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages. [D] . Jarman, Jay. 2011

机译：将自然语言处理与统计文本挖掘相结合：专业语言与公共语言的研究。
6. Natural Language Processing and Automatic SNOMED-Encoding of Free Text: An Analysis of Free Text Data from a Routine Electronic Patient Record Application with a Parsing Tool Using the German SNOMED II [O] . Joerg H. Hohnloser, Matthias Holzer, Martin R.G. Fischer, 1996

机译：自然语言处理和自由文本的自动SNOMED编码：使用德语SNOMED II的解析工具对例行电子病历应用中的自由文本数据进行分析
7. Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums [O] . Himmel, Wolfgang, Reincke, Ulrich, Michelmann, Hans Wilhelm 2009

机译：文本挖掘和自然语言处理方法可将基于Web的专家论坛的待命请求自动分类
8. Categorization of Survey Text Utilizing Natural Language Processing and Demographic Filtering. [R] . Cairoli, C. M. 2017

机译：利用自然语言处理和人口过滤对调查文本进行分类。

Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

摘要

著录项

相似文献

相关主题

期刊订阅