首页> 外文会议>Knowledge-Based Systems for Safety Critical Applications >Text joins for data cleansing and integration in an RDBMS
【24h】

Text joins for data cleansing and integration in an RDBMS

机译:文本联接用于RDBMS中的数据清理和集成

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.
机译:由于转录错误,信息不完整,缺乏文本数据的标准格式或其组合,因此组织的数据记录通常很嘈杂。数据清理系统的基本任务是匹配引用同一实体的文本属性(例如,组织名称或地址)。这种匹配是通过信息检索字段中的余弦相似性度量有效执行的。为了提高鲁棒性和可伸缩性,最好在RDBMS内部完成这些“文本连接”,而RDBMS可能是数据所在的位置。不幸的是,计算文本联接的确切答案可能会很昂贵。我们提出了一种近似的,基于采样的文本联接执行策略,该策略可以在标准的未经修改的RDBMS中可靠地执行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号