首页> 外文学位 >Advanced Information Retrieval within Blogosphere and Micro-Blogosphere.
【24h】

Advanced Information Retrieval within Blogosphere and Micro-Blogosphere.

机译:Blogosphere和Micro-Blogosphere中的高级信息检索。

获取原文
获取原文并翻译 | 示例

摘要

Social media, such as blogs and microblogs (Twitter), gain popularity in people's daily life. The volumes of blogosphere and micro-blogosphere are rapidly growing, which triggers the application of information retrieval on social media to address the information needs of users. However, simple ad-hoc information retrieval cannot meet users' information needs over social media. An ad-hoc information retrieval system retrieves the documents that are relevant to queries. However, current users usually want to retrieve the social media documents, such as blog posts and tweets, that are not only relevant to queries but also satisfy some extra conditions. In this thesis, we study two such advanced information retrieval problems: faceted blog distillation over blogosphere and real-time tweet ad-hoc retrieval over micro-blogosphere.;Faceted blog distillation aims at retrieving blogs (i.e. RSS feeds) that are not only relevant to a given query but also satisfy a facet of interest. In this thesis, the facets under consideration are opinionated vs. factual, personal vs. official and in-depth vs. shallow. Opinionated blogs provide the blog posts that contain relevant opinions to queries while factual blogs consist of the blog posts that describe the topics of queries without opinionated contents. The blog posts in personal blogs depict the topics related to the personal experiences of bloggers while those in official blogs deliver the commercial purposes of bloggers. In-depth blogs have the blog posts that provide the deep analysis about the topics of interest while the posts in shallow blogs simply mention the topics, without analyzing the implications of the provided information. For the opinionated and factual facets, we propose a classifier using syntactic and semantical features to determine whether opinions are relevant to queries in the context of blog posts. For the personal and official facets, we propose two categories of methods to identify the personal and official posts. The first category of methods is based on classification. We propose three classifiers that are established based on the different assumptions about two research issues. The first issue is whether a blog post exhibiting the personal or official facet depends on the query topic it mentions. The second issue is whether the personal or official facet of a blog post is dependent on whether it exhibits the opinionated or factual facet respectively. The second category of methods is based on generative models. In specific, we first propose a generative model that calculates the probabilistic distributions of blog posts exhibiting topics and those of exhibiting the personal and official facets. Observing that the posts from a feed are likely to exhibit the same facet, we improve the first model by proposing a second generative model that constrains the post from a feed to have the same facet distribution in its generative process. For the in-depth and shallow facets, we propose to calculate the depth of the coverage of a blog post on a given query by the occurrences of the concepts that are related to the query within the post. We also discussed the relationships among the facets. For example, we validate that the personal or official facet of a post depends on its opinionated or factual facet respectively. Experimental results on the TREC Blogs06 collection and the TREC Blogs08 collection show that the proposed techniques are not only effective in finding facet-oriented blogs (or posts) but also significantly outperform the best known results reported over both collections.;Real-time tweet ad-hoc retrieval ranks relevant tweets to queries in reverse-chronological order of their publishing times. In the context of this problem, to respond to a query with a timestamp t, the retrieved tweets should satisfy the following three conditions: (a) relevant to the query, (b) published on or before time t, and (c) ranked in reverse-chronological order of their publishing times. In this thesis, we propose a two-phase approach where we retrieve tweets in an ad-hoc way during the first phase and then utilize the temporal information of queries and tweets to enhance the retrieval effectiveness of tweets during the second phase. Tweets can be categorized into two types. One type consists of short messages not containing any URL of a web page. The other type has at least one URL of a web page in addition to a short message. These two types of tweets have different structures. In the first phase, we propose a method to rank tweets using the divide-and-conquer strategy to address the structural difference of tweets. Specifically, we first rank the two types of tweets separately. This produces two rankings, one for each type. Then we merge these two rankings of tweets into one ranking. In the second phase, we first categorize queries into several types by exploring the temporal distributions of their top-retrieved tweets from the first phase; then we calculate the time-related relevance scores of tweets according to the classified types of queries; finally we combine the time scores with the IR scores from the first phase to produce a ranking of tweets. Experimental results achieved by using the TREC 2011 and TREC 2012 queries over the TREC Tweets2011 collection show that the proposed divide-and-conquer method of ranking tweets yields better retrieval effectiveness than ranking them simultaneously and our proposed incorporation of temporal information into retrieval process yields further improvements. Our method also compares favorably with state-of-the-art methods in retrieval effectiveness.
机译:博客和微博客(Twitter)等社交媒体在人们的日常生活中越来越受欢迎。博客圈和微博客圈的数量正在迅速增长,这触发了信息检索在社交媒体上的应用,以满足用户的信息需求。但是,简单的临时信息检索无法满足用户在社交媒体上的信息需求。临时信息检索系统检索与查询相关的文档。但是,当前用户通常希望检索社交媒体文档,例如博客文章和推文,这些文档不仅与查询相关,而且还满足一些额外条件。在这篇论文中,我们研究了两个这样的高级信息检索问题:在Blogsphere上进行多面博客蒸馏和在微Blog上实时进行tweet ad-hoc即时检索。既可以满足给定查询的需求,又可以满足您的兴趣。在本文中,所考虑的方面是有针对性的,事实性的,个人性的与官员性的以及深度的与较浅的。有观点的博客提供了包含与查询相关的观点的博客文章,而事实博客则由描述了没有主题内容的查询主题的博客文章组成。个人博客中的博客文章描述了与博客作者的个人经历有关的主题,而官方博客中的博客文章则提供了博客作者的商业目的。深度博客的博客文章提供了有关感兴趣主题的深入分析,而浅薄博客中的文章仅提及主题,而没有分析所提供信息的含义。对于观点和事实方面,我们提出使用句法和语义特征的分类器,以确定观点是否与博客文章上下文中的查询相关。对于个人和官方方面,我们提出了两种方法来识别个人和官方职位。第一类方法是基于分类的。我们提出了基于两个研究问题的不同假设而建立的三个分类器。第一个问题是显示个人或官方方面的博客帖子取决于它提到的查询主题。第二个问题是博客帖子的个人方面还是官方方面分别取决于其展示的是个人观点还是事实方面。第二类方法基于生成模型。具体而言,我们首先提出一个生成模型,该模型计算展示主题的博客帖子以及展示个人和官方方面的博客帖子的概率分布。观察到提要中的帖子可能具有相同的构面,我们通过提出第二个生成模型来约束提要中的帖子在其生成过程中具有相同的构面分布,从而改进了第一个模型。对于较深和较浅的方面,我们建议通过与帖子中与查询相关的概念的出现来计算博客在给定查询中的覆盖深度。我们还讨论了各个方面之间的关系。例如,我们验证帖子的个人或正式方面分别取决于其有见地或事实方面。 TREC Blogs06集合和TREC Blogs08集合的实验结果表明,所提出的技术不仅可以有效地找到面向方面的博客(或帖子),而且在性能上也远胜于两个集合中报告的最知名结果。 -hoc检索将相关推文按其发布时间的倒序排列给查询。在此问题的上下文中,为响应带有时间戳t的查询,检索到的tweet应满足以下三个条件:(a)与查询相关,(b)在时间t或之前发布,以及(c)排名按其发布时间的倒序排列。本文提出了一种分两个阶段的方法,即在第一阶段以特殊方式检索推文,然后在第二阶段利用查询和推文的时间信息来增强推文的检索效率。推文可以分为两种类型。一种类型是短消息,不包含网页的任何URL。除了短消息之外,另一种类型还具有网页的至少一个URL。这两种类型的推文具有不同的结构。在第一阶段,我们提出一种使用分而治之策略对推文进行排名的方法,以解决推文的结构差异。具体来说,我们首先分别对这两种类型的推文进行排名。这将产生两个排名,每种类型一个。然后,我们将这两个推文等级合并为一个等级。在第二阶段,我们首先通过研究从第一阶段开始检索最多的推文的时间分布,将查询分为几种类型;然后根据查询的分类类型,计算推文的时间相关性得分;最后,我们将时间分数与第一阶段的IR分数相结合,以生成推文排名。通过在TREC Tweets2011集合上使用TREC 2011和TREC 2012查询获得的实验结果表明,所提出的对推文进行排名的分治法比同时对它们进行排名具有更好的检索效果,并且我们建议将时间信息纳入检索过程中可以进一步产生收益。改进。在检索效率方面,我们的方法也与最先进的方法相媲美。

著录项

  • 作者

    Jia, Lifeng.;

  • 作者单位

    University of Illinois at Chicago.;

  • 授予单位 University of Illinois at Chicago.;
  • 学科 Computer Science.;Mass Communications.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 185 p.
  • 总页数 185
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 遥感技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号