首页> 外国专利> Automatic indexing of relevant domains in a data lake for data discovery and integration

Automatic indexing of relevant domains in a data lake for data discovery and integration

机译:数据发现和集成数据湖中相关域的自动索引

摘要

Techniques are provided for data discovery and data integration in a data lake. One method comprises obtaining data files from a data lake, wherein each data file comprises multiple records having multiple fields; selecting multiple candidate fields from a data file based on a record type; determining a relevance score for each candidate field from the data file based on multiple features extracted from the data file; and clustering the scored candidate fields into clusters of similar domains using a hashing algorithm, wherein a given cluster comprises candidate fields, wherein multiple data files can be integrated based on a domain of the candidate fields in the given cluster. The relevance score for each candidate field is based on multiple features comprising, for example, features that take into account a morphological or semantic similarity between file name, file metadata and/or file records and features that consider statistics of candidate fields in a data file.
机译:为数据湖中的数据发现和数据集成提供了技术。 一种方法包括从数据湖获取数据文件,其中每个数据文件包括具有多个字段的多个记录; 根据记录类型从数据文件中选择多个候选字段; 根据从数据文件中提取的多个功能确定从数据文件中的每个候选字段的相关性分数; 使用散列算法将所分级的候选字段聚类为类似域的集群,其中给定的群集包括候选字段,其中可以基于给定群集中的候选字段的域集成多个数据文件。 每个候选字段的相关性得分基于多个特征,包括例如考虑文件名,文件元数据和/或文件记录以及考虑数据文件中候选字段统计数据之间的形态或语义相似性的功能 。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号