首页> 外文会议> >ADD: Arabic duplicate detector - a duplicate detection data cleansing tool
【24h】

ADD: Arabic duplicate detector - a duplicate detection data cleansing tool

机译:添加:阿拉伯文重复检测器-重复检测数据清除工具

获取原文

摘要

Summary form only given. Data mining is a relatively new term; it was introduced in the 1990s. Data mining is the process of extracting useful information from huge amounts of data. It is sometimes referred to as "data discovery" or "knowledge discovery" in databases. What exactly defines useful information depends on the goal that data mining was for in the first place. Useful information can be used to increase revenue and to cut costs. It can also be used for the purpose of research. Advances in hardware and software in the late 1990s made data centralizing possible. Data centralizing is also called "data warehousing" or "data warehouse for the centralized data". With the process of data centralization came a very important issue, the quality of the data that has been centralized, since centralization includes the joining of multiple data sources. The data given as an input for the data mining process should be of high quality in order for the results of the data mining process to be accurate and reliable. Before data could be mined to extract useful information, it goes through a process called data cleansing. This process is as old as the word "data" itself; however, the term regained significance in the 1990s. Data cleansing involves several steps and processes that include one or more algorithms. We address one important step, which is duplicate data detection. We present a duplicate detection method called the efficient k-way sorting method. We also present a tool called Arabic duplicate detection, which is based on our method and is tailored for Arabic data.
机译:仅提供摘要表格。数据挖掘是一个相对较新的术语。它是在1990年代引入的。数据挖掘是从大量数据中提取有用信息的过程。在数据库中,有时将其称为“数据发现”或“知识发现”。准确定义有用信息的内容首先取决于数据挖掘的目标。有用的信息可用于增加收入和削减成本。它也可以用于研究目的。 1990年代后期,硬件和软件的进步使数据集中成为可能。数据集中化也称为“数据仓库”或“用于集中式数据的数据仓库”。在数据集中化过程中出现了一个非常重要的问题,即数据的质量已经集中化,因为集中化包括多个数据源的联接。作为数据挖掘过程的输入提供的数据应该是高质量的,以使数据挖掘过程的结果准确可靠。在挖掘数据以提取有用的信息之前,它会经历一个称为数据清除的过程。这个过程与“数据”一词一样古老。但是,这个词在1990年代重新变得重要。数据清除涉及包括一个或多个算法的几个步骤和过程。我们解决了一个重要步骤,即重复数据检测。我们提出了一种称为高效k-way排序方法的重复检测方法。我们还提供了一个名为“阿拉伯文重复检测”的工具,该工具基于我们的方法并针对阿拉伯数据量身定制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号