首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Category-Based Infidelity Bounded Queries over Unstructured Data Streams
【24h】

Category-Based Infidelity Bounded Queries over Unstructured Data Streams

机译:非结构化数据流上基于类别的不忠有界查询

获取原文
获取原文并翻译 | 示例

摘要

We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-$(K)$ documents, Caicos returns the top-$(K)$ categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current "window"). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the "right" metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence,- proposed. Experimental evaluation of Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains.
机译:我们提出了Caicos系统,该系统支持对数据流进行连续的无限保真度有界查询,其中(该流的)每个数据项都属于多个类别。 Caicos由四个基元组成:关键字,查询,数据项和类别。类别是一个虚拟实体,由属于该类别的所有那些数据项组成。通过评估数据项上的布尔谓词(与每个类别相关联)来确定数据项对类别的隶属度。每个数据项和查询又与多个关键字关联。给定一个关键字查询,与传统的非结构化数据查询技术返回前$(K)$个文档不同,Caicos返回的前$(K)$个类别的保真度小于用户指定的保真度界限。 Caicos旨在连续跟踪高度动态数据流中存在的不断发展的信息。因此,它使用最近过去(即,在当前“窗口”内)在流中出现的数据项来计算类别与连续关键词查询的相关性。为了有效地提供对连续查询的最新答案,Caicos需要准确地维护所需的元数据。这需要解决两个子问题:1)标识需要更新以提供准确结果的“正确”元数据,以及2)以有效方式更新元数据。我们表明,识别正确的元数据的问题可以进一步细分为两个子部分。我们将第一部分建模为不等式约束的最小化问题,并为此提出了一种创新的迭代算法。第二部分要求我们构建一个有效的基于动态编程的算法,该算法可以帮助我们找到需要更新的正确元数据。在多个处理器上更新元数据是一个调度问题,其复杂度在输入长度上呈指数增长。因此,提出了一种近似的多处理器调度算法。使用真实世界动态数据对Caicos进行的实验评估表明,与文献中提出的技术相比,Caicos能够使用少45%的资源提供接近100%的保真度。 Caicos的这种能力即使在数据到达率较高的情况下也可以准确有效地工作,使其适合于数据密集型应用程序域。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号