首页> 外文期刊>Journal of web semantics: >Searching web data: An entity retrieval and high-performance indexing model
【24h】

Searching web data: An entity retrieval and high-performance indexing model

机译:搜索Web数据:实体检索和高性能索引模型

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

More and more (semi) structured information is becoming available on the web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with the ultimate goal of making it exploitable by humans and machines alike. This article examines the shift from the traditional web document model to a web data object (entity) model and studies the challenges faced in implementing a scalable and high performance system for searching semi-structured data objects over a large heterogeneous and decentralised infrastructure. Towards this goal, we define an entity retrieval model, develop novel methodologies for supporting this model and show how to achieve a high-performance entity retrieval system. We introduce an indexing methodology for semi-structured data which offers a good compromise between query expressiveness, query processing and index maintenance compared to other approaches. We address high-performance by optimisation of the index data structure using appropriate compression techniques. Finally, we demonstrate that the resulting system can index billions of data objects and provides keyword-based as well as more advanced search interfaces for retrieving relevant data objects in sub-second time. This work has been part of the Sindice search engine project at the Digital Enterprise Research Institute (DERI), NUI Galway. The Sindice system currently maintains more than 200 million pages downloaded from the web and is being used activelv bv manv researchers within and outside of DERI.
机译:越来越多(半)结构化信息以嵌入元数据的文档(例如RDF,RDFa,Microformats等)的形式在网络上可用。已经有数以亿计的此类文档可供访问,并且它们的数量正在迅速增长。这就要求大型系统提供有效的手段来搜索和检索这种半结构化信息,其最终目标是使之能够被人和机器利用。本文研究了从传统的Web文档模型到Web数据对象(实体)模型的转变,并研究了在实现可扩展的高性能系统以在大型异构和分散式基础结构上搜索半结构化数据对象时面临的挑战。为了实现这一目标,我们定义了一个实体检索模型,开发了支持该模型的新颖方法,并展示了如何实现高性能的实体检索系统。我们介绍了一种针对半结构化数据的索引方法,与其他方法相比,该方法在查询表达性,查询处理和索引维护之间提供了很好的折衷方案。我们通过使用适当的压缩技术优化索引数据结构来解决高性能问题。最后,我们证明了所产生的系统可以索引数十亿个数据对象,并提供基于关键字的以及更高级的搜索界面,以便在不到一秒的时间内检索相关的数据对象。这项工作是NUI戈尔韦数字企业研究所(DERI)的Sindice搜索引擎项目的一部分。 Sindice系统目前维护着超过2亿个从Web下载的页面,并且正在DERI内部和外部被活跃的研究人员使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号