首页> 外文学位 >Searching and ranking XML data in a distributed environment.
【24h】

Searching and ranking XML data in a distributed environment.

机译:在分布式环境中搜索和排序XML数据。

获取原文
获取原文并翻译 | 示例

摘要

Due to the increasing number of independent data providers on the web, there is a growing number of web applications that require searching and querying data sources distributed at different locations over the internet. Since XML is rapidly gaining in popularity as a universal data format for data exchange and integration, locating and ranking distributed XML data on the web are gaining importance in the database community. Most of existing XML indexing techniques combine structure indexes and inverted lists extracted from XML documents to fully evaluate a full-text query against these indexes and return the actual XML fragments of the query answer. In general, these approaches are well-suited for a centralized date repository since they perform costly containment joins over long inverted lists in order to evaluate full-text XML queries, which does not scale very well to large distributed systems.;In this thesis work, we present a novel framework for indexing, locating and ranking schema-less XML documents based on concise summaries of their structural and textual content. Instead of indexing each single element or term in a document, we extract a structural summary and a small number of data synopses from the document, which are indexed in a way suitable for query evaluation. The search query language used in our framework is XPath extended with full-text search. We introduce a novel data synopsis structure to summarize the textual content of an XML document that correlates textual with positional information in a way that improves query precision. In addition, we present a two-phase containment filtering algorithm based on these synopses that speeds up the searching process. To return a ranked list of answers, we integrate an effective aggregated document ranking scheme into the query evaluation, inspired by TF*IDF ranking and term proximity, to score documents and return a ranked list of document locations to the client. Finally, we extend our framework to apply to structured peer-to-peer systems, routing a full-text XML query from peer to peer, collecting relevant documents along the way, and returning list of document locations to the user. We conduct many experiments over XML benchmark data to demonstrate the advantages of our indexing scheme, the query precision improvement of our data synopses, the efficiency of the optimization algorithm, the effectiveness of our ranking scheme and the scalability of our framework.;We expect that the framework developed in this thesis will serve as an infrastructure for collaborative work environments within public web communities that share data and resources. The best candidates to benefit from our framework are collaborative applications that host on-line repositories of data and operate on a very large scale. Furthermore, good candidates are those applications that seek high system and data availability and scalability to the network growth. Finally, our framework can also benefit to those applications that require complex/hierarchical data, such as scientific data, schema flexibility, and complex querying capabilities, including full-text search and approximate matching.
机译:由于Web上独立数据提供者的数量不断增加,因此有越来越多的Web应用程序需要搜索和查询分布在Internet上不同位置的数据源。由于XML作为一种用于数据交换和集成的通用数据格式而迅速流行,因此在数据库中对分布式XML数据进行定位和排名在数据库社区中变得越来越重要。大多数现有的XML索引技术都将结构索引和从XML文档中提取的反向列表结合在一起,以针对这些索引全面评估全文查询,并返回查询答案的实际XML片段。通常,这些方法非常适合集中式日期存储库,因为它们在长的倒排列表上执行昂贵的包含连接,以评估全文XML查询,这不适用于大型分布式系统。 ,我们提出了一个新颖的框架,该框架基于其结构和文本内容的简洁摘要来对无模式的XML文档进行索引,定位和排名。我们没有为文档中的每个元素或术语建立索引,而是从文档中提取结构摘要和少量数据概要,然后以适合查询评估的方式对它们进行索引。我们的框架中使用的搜索查询语言是XPath扩展的全文搜索。我们引入了一种新颖的数据概要结构来总结XML文档的文本内容,该XML文档将文本与位置信息相关联,从而提高了查询精度。此外,我们根据这些提要提出了一种两阶段的包含过滤算法,可以加快搜索过程。为了返回答案的排序列表,我们根据TF * IDF排序和术语接近度,将有效的汇总文档排序方案集成到查询评估中,对文档进行评分,并将排序后的文档位置列表返回给客户。最后,我们将框架扩展到适用于结构化的点对点系统,在点对点之间路由全文XML查询,一路收集相关文档,并将文档位置列表返回给用户。我们对XML基准数据进行了许多实验,以证明我们的索引方案的优点,数据概要的查询精度的提高,优化算法的效率,排名方案的有效性以及框架的可伸缩性。本文开发的框架将作为共享数据和资源的公共Web社区内协作工作环境的基础结构。从我们的框架中受益的最佳人选是协作应用程序,这些应用程序托管在线数据存储库,并且可以大规模运行。此外,优秀的候选人是那些寻求高系统和数据可用性以及网络可扩展性的应用程序。最后,我们的框架还可以使那些需要复杂/分层数据的应用程序受益,例如科学数据,模式灵活性和复杂的查询功能,包括全文搜索和近似匹配。

著录项

  • 作者

    He, Weimin.;

  • 作者单位

    The University of Texas at Arlington.;

  • 授予单位 The University of Texas at Arlington.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 119 p.
  • 总页数 119
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号