首页> 外文期刊>Distributed and Parallel Databases >Scalable entity-based summarization of web search results using MapReduce
【24h】

Scalable entity-based summarization of web search results using MapReduce

机译:使用MapReduce的可扩展的基于实体的Web搜索结果汇总

获取原文
获取原文并翻译 | 示例

摘要

Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.
机译:尽管Web搜索引擎可以索引并提供对大量文档的访问,但是用户查询通常仅返回线性的命中列表。尽管这对于集中搜索通常是令人满意的,但它并未提供对结果的探索或更深入的分析。实现利用Web搜索中结构化(和语义)数据可用性的高级探索工具的一种方法是,通过对搜索结果的全部内容进行实体挖掘来丰富它。这样的服务为用户提供了信息空间的初步概览,使他们可以逐渐限制它,直到找到所需的命中为止,即使他们的排名较低。这在专业搜索领域(例如医学搜索,专利搜索等)尤其重要。在本文中,我们考虑了提供诸如元服务(即在支持关键字搜索的系统上分层)之类的服务而不提供以下内容的一般方案:基础文档集合的优先索引。为了使此类服务适用于大量数据,我们在云基础架构(Amazon EC2)上使用了MapReduce分布式计算模型。具体来说,我们展示了如何可以分解所需的计算任务并将其表示为MapReduce函数。我们工作的关键贡献是对平台配置和调整的全面评估,这在以前的工作中经常被忽略和不足,但是对于有效利用资源至关重要。最后,我们报告了在各种设置下实现加速的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号