Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

Miratrix Luke; Ackerman Robin

首页> 外文期刊>Statistical Analysis and Data Mining >Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

【24h】

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

机译：在文本语料库中任意长的短语进行稀疏特征选择，重点是可解释性

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

xml:id="sam11323-para-0001"> We propose a general framework for topic‐specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an Occupational Safety and Health Administration (OSHA) database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death), and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency‐based methods currently in wide use, and more heavyweight, model‐intensive methods such as latent Dirichlet allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high‐dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch‐and‐bound approach, this method can incorporate phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing the computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, cssStyle="font-family:monospace"> textreg . Overall, we argue that sparse methods have much to offer in text analysis and is a branch of research that should be considered further in this context. ? 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

机译： xml：id =“sam11323-para-0001”> 我们提出了一般框架的大型文本语料表的专题概要，并说明了如何在两个完全不同的背景下用于分析：职业安全和健康管理局（OSHA）死亡和灾难报告数据库（以促进监测导致伤害或死亡的情况的模式，以及关于工人赔偿索赔的法律决定（探索相关判例法）。我们的摘要框架基于稀疏分类方法，是目前在广泛使用的简单词频率的方法之间的折衷，以及更多的重量级，模型密集型方法，如潜在的Dirichlet分配（LDA）。对于兴趣的特定话题（例如，精神健康残疾或一氧化碳曝光），我们将文件标记标记到文档中所有其他单词和短语的高维计数上。然后将得到的一组小一组短语作为预测性被收获为摘要。使用分支和绑定的方法，该方法可以包含任意长度的短语，这允许潜在的汇总。我们讨论了摘要目的的焦点如何为调整参数和模型约束的选择。通过将结果字列表的计算时间和摘要统计进行比较到文献中的三种其他方法，通过将结果和摘要统计进行比较来评估此工具。我们还提供了一个新的R包， cssstyle =“font-family：monospace”> textreg 。总的来说，我们认为稀疏方法在文本分析中提供了很多，并且是在这种情况下进一步考虑的研究分支。还2016 Wiley期刊，Inc。统计分析和数据挖掘：ASA数据科学期刊，2016年

著录项

来源
《Statistical Analysis and Data Mining》 |2016年第6期|共26页
作者
Miratrix Luke; Ackerman Robin;
展开▼
作者单位

Harvard Graduate School of EducationCambridge MA USA;

US Department of LaborBoston MA USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类经济统计学;
关键词
concise comparative summarization; sparse classification; regularized regression; Lasso; text summarization; text mining; key‐phrase extraction; text classification; high‐dimensional analysis; L2 normalization;

机译：简洁的比较摘要;稀疏分类;正规化的回归;套索;文本摘要;文本挖掘;关键 - 短语提取;文本分类;高尺寸分析;L2标准化;

相似文献

外文文献
中文文献
专利

1. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Luke Miratrix, Robin Ackerman Statistical Analysis and Data Mining . 2016,第6期

机译：对文本语料库中的任意长短语进行稀疏特征选择，重点是可解释性
2. Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes [J] . Marafino Ben J., Boscardin W. John, Dudley R. Adams Journal of biomedical informatics. . 2015,第Null期

机译：通过弹性网对生物医学文本进行有效而稀疏的特征选择：从护理记录中应用于ICU风险分层
3. Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes [J] . Marafino Ben J., Boscardin W. John, Dudley R. Adams Journal of biomedical informatics. . 2015,第Null期

机译：通过弹性网的生物医学文本分类的高效和稀疏特征选择：从护理笔记中申请ICU风险分层
4. Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining [C] . Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro International Conference on Data Science, Technology and Applications . 2020

机译：从未标记的社会文本消息中学习可解释和统计上的知识：描述性文本挖掘的新方法
5. Feature selection for evolutionary commercial-off-the-shelf software: Studies focusing on time-to-market, innovation and hedonic-utilitarian trade-offs [D] . Kakar, Adarsh Kumar. 2013

机译：进化型现成商用软件的功能选择：针对上市时间，创新和享乐主义-功利主义权衡的研究
6. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach [O] . Xiang Ren, Ahmed El-Kishky, Chi Wang, -1

机译：大规模文本语料库的自动实体识别和键入：一种短语和网络挖掘方法
7. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [O] . Miratrix, Luke, Ackerman, Robin 2016

机译：对文本中任意长的短语进行稀疏特征选择语料库，侧重于可解释性

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

摘要

著录项

相似文献

相关主题

期刊订阅