首页> 外国专利> Automatic indexing of relevant domains in a data lake for data discovery and integration

Automatic indexing of relevant domains in a data lake for data discovery and integration

机译：数据发现和集成数据湖中相关域的自动索引

页面导航

摘要
著录项
相似文献

摘要

Techniques are provided for data discovery and data integration in a data lake. One method comprises obtaining data files from a data lake, wherein each data file comprises multiple records having multiple fields; selecting multiple candidate fields from a data file based on a record type; determining a relevance score for each candidate field from the data file based on multiple features extracted from the data file; and clustering the scored candidate fields into clusters of similar domains using a hashing algorithm, wherein a given cluster comprises candidate fields, wherein multiple data files can be integrated based on a domain of the candidate fields in the given cluster. The relevance score for each candidate field is based on multiple features comprising, for example, features that take into account a morphological or semantic similarity between file name, file metadata and/or file records and features that consider statistics of candidate fields in a data file.

机译：为数据湖中的数据发现和数据集成提供了技术。一种方法包括从数据湖获取数据文件，其中每个数据文件包括具有多个字段的多个记录; 根据记录类型从数据文件中选择多个候选字段; 根据从数据文件中提取的多个功能确定从数据文件中的每个候选字段的相关性分数; 使用散列算法将所分级的候选字段聚类为类似域的集群，其中给定的群集包括候选字段，其中可以基于给定群集中的候选字段的域集成多个数据文件。每个候选字段的相关性得分基于多个特征，包括例如考虑文件名，文件元数据和/或文件记录以及考虑数据文件中候选字段统计数据之间的形态或语义相似性的功能。

著录项

公开/公告号US11120031B2

专利类型
公开/公告日2021-09-14

原文格式PDF
申请/专利权人 EMC IP HOLDING COMPANY LLC;
展开▼

申请/专利号US201916669678
发明设计人 ADRIANA BECHARA PRADO;VITOR SILVA SOUSA;MARCIA LUCAS PESCE;PAULO DE FIGUEIREDO PIRES;FÁBIO ANDRÉ MACHADO PORTO;ALTOBELLI DE BRITO MANTUAN;RODOLPHO ROSA DA SILVA;WAGNER DOS SANTOS VIEIRA;
展开▼

申请日2019-10-31
分类号G06F17;G06F16/2458;G06N20;G06F16/28;G06F16/22;
国家 US
入库时间 2022-08-24 21:01:26

相似文献

专利
外文文献
中文文献