首页> 外国专利> METHODS AND SYSTEMS TO INCREMENTALLY COMPUTE SIMILARITY OF DATA SOURCES

METHODS AND SYSTEMS TO INCREMENTALLY COMPUTE SIMILARITY OF DATA SOURCES

机译：增量计算数据源相似性的方法和系统

页面导航

摘要
著录项
相似文献

摘要

Methods and systems for efficiently determining a similarity between two or more datasets. In one embodiment, the similarity is determined based on comparing a subset of sorted frequency-weighted blocks from one dataset to a subset of sorted frequency-weighed blocks from another dataset. Data blocks of a dataset are converted into hash values that are frequency-weighted. These frequency-weighted hash values can be compared to frequency-weighted hash values of another dataset to determine a similarity of the two datasets. In another embodiment, upon a change of a block in a subset of the dataset, the similarity value is re-determined without resorting or hashing the blocks of a dataset other than the blocks of the subset, resulting in an increased performance of a similarity comparison. In another embodiment, blocks of a dataset are excluded based on a block-filtering rule to increase the accuracy of the similarity comparison.

机译：用于有效地确定两个或多个数据集之间的相似性的方法和系统。在一个实施例中，基于将来自一个数据集的排序的频率加权块的子集与来自另一数据集的排序的频率加权的块的子集进行比较来确定相似性。数据集的数据块将转换为经过频率加权的哈希值。可以将这些频率加权的哈希值与另一个数据集的频率加权的哈希值进行比较，以确定两个数据集的相似性。在另一个实施例中，在改变数据集的子集中的块时，在不求助于或散列除子集的块之外的数据集的块的情况下，重新确定相似度值，从而导致相似度比较的性能提高。。在另一个实施例中，基于块过滤规则排除数据集的块以增加相似性比较的准确性。

著录项

公开/公告号EP2652649A4

专利类型
公开/公告日2015-10-07

原文格式PDF
申请/专利权人 NETAPP INC.;
展开▼

申请/专利号EP20110848750
发明设计人 DIXIT SAGAR;GAONKAR SHRAVAN;
展开▼

申请日2011-12-19
分类号G06F17/40;G06F12/00;G06F17/30;
国家 EP
入库时间 2022-08-21 15:06:36

相似文献

专利
外文文献
中文文献