Text joins for data cleansing and integration in an RDBMS

机译：文本联接用于RDBMS中的数据清理和集成

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

机译：由于转录错误，信息不完整，缺乏文本数据的标准格式或其组合，因此组织的数据记录通常很嘈杂。数据清理系统的基本任务是匹配引用同一实体的文本属性（例如，组织名称或地址）。这种匹配是通过信息检索字段中的余弦相似性度量有效执行的。为了提高鲁棒性和可伸缩性，最好在RDBMS内部完成这些“文本连接”，而RDBMS可能是数据所在的位置。不幸的是，计算文本联接的确切答案可能会很昂贵。我们提出了一种近似的，基于采样的文本联接执行策略，该策略可以在标准的未经修改的RDBMS中可靠地执行。

著录项

来源
《Knowledge-Based Systems for Safety Critical Applications》|1994年|p.729-731|共3页
会议地点
作者
Gravano L.; Ipeirotis P.G.; Koudas N.; Srivastava D.;
展开▼
作者单位

Columbia Univ., NY, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database: an expanded data model integrating natural language text and sequence analysis programs. [J] . Kantor R, Machekano R, Gonzales MJ, Nucleic Acids Research . 2001,第1期

机译：人类免疫缺陷病毒逆转录酶和蛋白酶序列数据库：扩展的数据模型，集成了自然语言文本和序列分析程序。
2. Joining the dots: Can UW‐QoL free‐text data assist in understanding individual treatment experiences and QoL outcomes in head and neck cancer? [J] . Pateman K.A., Batstone M.D., Ford P.J. Psycho-Oncology: Journal of the Psychological Social and Behavioral Dimensions of Cancer . 2017,第12期

机译：加入小点：可以UW-QOL自由文本数据辅助在头部和颈部癌症中了解个人治疗经验和QOL结果吗？
3. A Mixed Methods Study of Public Perception of Social Distancing: Integrating Qualitative and Computational Analyses for Text Data [J] . Pauline Ho, Kaiping Chen, Anqi Shao, Journal of mixed methods research . 2021,第3期

机译：公众对社会疏散感知的混合方法研究：对文本数据的定性和计算分析
4. Text joins for data cleansing and integration in an RDBMS [C] . Gravano, L., Ipeirotis, . 2003

机译：文本联接用于RDBMS中的数据清理和集成
5. Integrative text mining and management in multidimensional text databases. [D] . Zhang, Duo. 2012

机译：多维文本数据库中的集成文本挖掘和管理。
6. Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database: an expanded data model integrating natural language text and sequence analysis programs [O] . Rami Kantor, Rhoderick Machekano, Mathew J. Gonzales, 2001

机译：人类免疫缺陷病毒逆转录和蛋白酶序列数据库：扩展结合自然语言文本和序列分析的数据模型程式
7. Text Joins for Data Cleansing and Integration in an RDBMS [O] . Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, 2003

机译：文本联接，用于RDBMS中的数据清理和集成

Text joins for data cleansing and integration in an RDBMS

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅