首页> 外文学位 >A methodology of machine learning in automated entity summarization.
【24h】

A methodology of machine learning in automated entity summarization.

机译:自动实体摘要中的机器学习方法。

获取原文
获取原文并翻译 | 示例

摘要

Conducting background research is a time consuming, yet important, part of every research endeavor. It includes compiling relevant sources, reading those sources, and comprehending the information. We find that this information scales rapidly in the current information age. The use of automated text summarization, among other techniques (e.g., search engines), helps to improve efficiency in exploring data by distilling large amounts of information that is becoming prevalent.;For the purpose of summarizing entity and topic interaction in large information stores, in this dissertation a methodology of automatic entity summarization is presented. The methodology is broken into three steps: Reading, Assembly, and Interpretation. In the Reading step, the appropriate information sources are determined and, subsequently, the interrelated entities are extracted within each source. Four inputs are necessary in this step: a topic extraction algorithm, a named entity recognition algorithm, information sources, and property information for the entities. In the Assembly step, the relationships between entities across sources is represented through knowledge networks. A trimodal weighted co-occurrence hypergraph is presented and then projected into unimodal and bimodal graphs. Finally, in the Interpretation step, graph analytics are presented to summarize the graphs. A novel diversity heuristic is derived based on information entropy to compare information diversity in different streams of literature over time.;To test the methodology, three experiments were conducted. Data from the PubMed Central Open Access Subset, which consisted of 740,418 journal citations in 4,404 journals, was downloaded on July 14, 2014. The first experiment addressed the relationship between the size of the information network and the number of files input into the methodology. It was found that a power law relationship exists, as shown in linguistic theory. The second experiment addressed the validity of the methodology in extracting meaningful connections and predicting the top chemicals using two gold standards. Results indicate that the methodology can be used to determine the top chemicals and that meaningful connections are those with the highest weight in the network. Finally, the diversity heuristic was used in the third experiment to empirically compare the diversity of information in a stream of articles relating to honeybee research to the diversity of information in a stream of articles relating to diabetes research. It was seen that the existing heuristic provides quite noisy results when applied to information networks and that the new heuristic has better asymptotic properties. This research is among the first efforts towards building improved literature-based discovery algorithms that are capable of automating the hypothesis generation process in large literature sets. iv.
机译:进行背景研究是一项耗时但重要的研究工作。它包括编译相关资源,阅读这些资源以及理解信息。我们发现,在当前信息时代,这种信息迅速扩展。除其他技术(例如搜索引擎)外,自动文本摘要的使用还有助于通过提取大量正在流行的信息来提高探索数据的效率。出于汇总大型信息存储中实体和主题交互的目的,本文提出了一种自动实体汇总的方法。该方法分为三个步骤:阅读,汇编和解释。在阅读步骤中,确定适当的信息源,然后,在每个源中提取相互关联的实体。此步骤中需要四个输入:主题提取算法,命名实体识别算法,信息源和实体的属性信息。在组装步骤中,跨来源的实体之间的关系通过知识网络表示。提出了三峰加权共现超图,然后将其投影为单峰和双峰图。最后,在“解释”步骤中,将显示图分析以汇总图。基于信息熵推导了一种新颖的多样性启发式算法,以比较不同文献流中信息多样性随时间的变化。为了测试该方法,进行了三个实验。 2014年7月14日,下载了PubMed Central Open Access子集的数据,该数据由4,404种期刊的740,418种期刊引文组成。第一个实验研究了信息网络的规模与输入该方法的文件数量之间的关系。如语言学理论所示,发现存在幂律关系。第二个实验解决了该方法在提取有意义的连接和使用两种金标准预测顶级化学品方面的有效性。结果表明,该方法可用于确定顶级化学品,有意义的连接是网络中权重最高的那些。最后,在第三项实验中使用了多样性启发法,以经验方式将与蜜蜂研究有关的文章流中的信息多样性与与糖尿病研究有关的文章流中的信息多样性进行比较。可以看出,现有的启发式方法在应用于信息网络时会提供非常嘈杂的结果,并且新的启发式方法具有更好的渐近性质。这项研究是构建基于文献的发现算法的第一步工作,该算法能够自动处理大型文献集中的假设生成过程。 iv。

著录项

  • 作者

    Chonde, Seifu.;

  • 作者单位

    The Pennsylvania State University.;

  • 授予单位 The Pennsylvania State University.;
  • 学科 Industrial engineering.;Information science.;Information technology.;Management.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 201 p.
  • 总页数 201
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号