首页>
外国专利>
METHODS AND SYSTEMS TO INCREMENTALLY COMPUTE SIMILARITY OF DATA SOURCES
METHODS AND SYSTEMS TO INCREMENTALLY COMPUTE SIMILARITY OF DATA SOURCES
展开▼
机译:增量计算数据源相似性的方法和系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
Methods and systems for efficiently determining a similarity between two or more datasets. In one embodiment, the similarity is determined based on comparing a subset of sorted frequency-weighted blocks from one dataset to a subset of sorted frequency-weighed blocks from another dataset. Data blocks of a dataset are converted into hash values that are frequency-weighted. These frequency-weighted hash values can be compared to frequency-weighted hash values of another dataset to determine a similarity of the two datasets. In another embodiment, upon a change of a block in a subset of the dataset, the similarity value is re-determined without resorting or hashing the blocks of a dataset other than the blocks of the subset, resulting in an increased performance of a similarity comparison. In another embodiment, blocks of a dataset are excluded based on a block-filtering rule to increase the accuracy of the similarity comparison.
展开▼