首页> 外文学位 >Graph-based Algorithms for Keyphrase Extraction in Social Text.
【24h】

Graph-based Algorithms for Keyphrase Extraction in Social Text.

机译:基于图的社交文本中关键词提取算法。

获取原文
获取原文并翻译 | 示例

摘要

The sheer volume of text in the web mandates automated approaches for identifying keyphrases that distinguish documents and help recognize important topics within documents. Automatic extraction of keyphrases is accomplished by designing algorithms capable of quantifying saliency in text. Measuring saliency score of textual units has traditionally used bag-of-words (BOW) approaches where the ranking is measured without considering the context. Such approach has several limitations as in polysemy and synonymy, where it is hard to detect these natural characteristics of text without understanding the context. In contrast, graph-based approaches model the relation between textual units that alleviate the aforementioned problems. In this dissertation, we introduce a collection of novel approaches for graph-based keyphrase ranking.;First, we propose a novel random walk extension for graph-based ranking that can leverage weights on both vertices and edges, called NE-Rank. The ranking algorithm combines additional ranking methods that enhance existing graph-based approaches. Specifically, we combine a discriminative ranking approach as in tf-idf to the co-occurrence ranking in graphs. Moreover, the ranking model uses social tags, as in Twitter's hashtags, and explores leveraging them by boosting their weights for the task of keyphrase extraction in social microposts. Additionally, we propose a lexical graph expansion through social tags for keyphrase extraction. After modeling the textual content of microposts in a lexical graph, we expand the graph by finding more similar content linked by tags. We show a number of different approaches to lexical graph expansion through Twitter hashtags, and show a significant improvement over using the textual content alone.;Second, we propose a new approach for measuring saliency in short documents. We model the textual units in a hypergraph by modeling words as vertices and short documents as hyperedges, and we study a high-order co-occurrence relation that is beyond the pair-wise relation in graphs. Therefore, we propose a novel probabilistic random walk over hypergraphs that captures weights on vertices and hyperedges to rank vertices. We compare our proposed random walk with different random walk approaches for hypergraphs and show the validity of the approach. Finally, we propose a complete ranking framework for extracting keyphrases from short documents using the hypergraph proposed random walk. The ranking takes into account temporal and social attributes that are important for a dynamic genre such as Twitter.
机译:网络中庞大的文本量要求使用自动方法来识别可区分文档并帮助识别文档​​中重要主题的关键短语。通过设计能够量化文本显着性的算法,可以完成关键词的自动提取。传统上,衡量文本单位的显着性分数使用的是词袋(BOW)方法,在不考虑上下文的​​情况下衡量排名。这种方法在多义和同义词中有一些局限性,在这种情况下,如果不了解上下文就很难检测文本的这些自然特征。相反,基于图的方法对减轻上述问题的文本单元之间的关系建模。本文介绍了一系列基于图的关键词排序的新方法。首先,我们提出了一种基于图的排序的新型随机游走扩展,它可以利用顶点和边缘上的权重,称为NE-Rank。排序算法结合了其他排序方法,这些方法增强了现有的基于图的方法。具体来说,我们将tf-idf中的判别式排名方法与图中的同时出现排名相结合。此外,排名模型使用社交标签(如Twitter的#标签),并通过增加社交标签的权重来探索它们的权重,从而利用它们。此外,我们提出了通过社交标签进行词法图扩展以提取关键短语的方法。在对词图中的微博的文本内容进行建模之后,我们通过查找更多由标签链接的相似内容来扩展图。我们展示了许多通过Twitter主题标签扩展词法图的方法,并显示了相对于仅使用文本内容的显着改进。其次,我们提出了一种测量短文档中显着性的新方法。我们通过将单词建模为顶点并将短文档建模为超边来对超图中的文本单位进行建模,并且我们研究了图中的成对关系之外的高阶共现关系。因此,我们提出了一种超概率图上的新型概率随机游走,它捕获了顶点和超边上的权重以对顶点进行排序。我们将我们提出的随机游走方法与针对超图的不同随机游走方法进行比较,并证明了该方法的有效性。最后,我们提出了一个完整的排名框架,用于使用超图提议的随机游走从短文档中提取关键短语。排名考虑了时间和社交属性,这些属性对诸如Twitter之类的动态类型很重要。

著录项

  • 作者

    Al-Dhelaan, Mohammed.;

  • 作者单位

    The George Washington University.;

  • 授予单位 The George Washington University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 154 p.
  • 总页数 154
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:53:28

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号