xml:id='sam11323-para-0001'> We propose a general framework for topic‐specific summarization of large text corpora, and illustrate '/> Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability
首页> 外文期刊>Statistical Analysis and Data Mining >Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability
【24h】

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

机译:在文本语料库中任意长的短语进行稀疏特征选择,重点是可解释性

获取原文
获取原文并翻译 | 示例
           

摘要

xml:id="sam11323-para-0001"> We propose a general framework for topic‐specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an Occupational Safety and Health Administration (OSHA) database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death), and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency‐based methods currently in wide use, and more heavyweight, model‐intensive methods such as latent Dirichlet allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high‐dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch‐and‐bound approach, this method can incorporate phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing the computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, cssStyle="font-family:monospace"> textreg . Overall, we argue that sparse methods have much to offer in text analysis and is a branch of research that should be considered further in this context. ? 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016
机译: xml:id =“sam11323-para-0001”> 我们提出了一般框架的大型文本语料表的专题概要,并说明了如何在两个完全不同的背景下用于分析:职业安全和健康管理局(OSHA)死亡和灾难报告数据库(以促进监测导致伤害或死亡的情况的模式,以及关于工人赔偿索赔的法律决定(探索相关判例法)。我们的摘要框架基于稀疏分类方法,是目前在广泛使用的简单词频率的方法之间的折衷,以及更多的重量级,模型密集型方法,如潜在的Dirichlet分配(LDA)。对于兴趣的特定话题(例如,精神健康残疾或一氧化碳曝光),我们将文件标记标记到文档中所有其他单词和短语的高维计数上。然后将得到的一组小一组短语作为预测性被收获为摘要。使用分支和绑定的方法,该方法可以包含任意长度的短语,这允许潜在的汇总。我们讨论了摘要目的的焦点如何为调整参数和模型约束的选择。通过将结果字列表的计算时间和摘要统计进行比较到文献中的三种其他方法,通过将结果和摘要统计进行比较来评估此工具。我们还提供了一个新的R包, cssstyle =“font-family:monospace”> textreg 。总的来说,我们认为稀疏方法在文本分析中提供了很多,并且是在这种情况下进一步考虑的研究分支。还2016 Wiley期刊,Inc。统计分析和数据挖掘:ASA数据科学期刊,2016年

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号