首页> 外国专利> Scalable system for expectation maximization clustering of large databases

Scalable system for expectation maximization clustering of large databases

机译:用于大型数据库的期望最大化集群的可扩展系统

摘要

In one exemplary embodiment the invention provides a data mining system for use in finding clusters of data items in a database or any other data storage medium. Before the data evaluation begins a choice is made of the number M of models to be explored, and the number of clusters (K) of clusters within each of the M models. The clusters are used in categorizing the data in the database into K different clusters within each model. An initial set of estimates for a data distribution of each model to be explored is provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database.
机译:在一个示例性实施例中,本发明提供了一种数据挖掘系统,用于在数据库或任何其他数据存储介质中查找数据项的集群。在开始数据评估之前,要选择要探索的模型数量M,以及每个M模型中的集群数量(K)。这些聚类用于将数据库中的数据分类为每个模型内的K个不同聚类。提供了将要探索的每个模型的数据分布的初始估计集。然后,从存储介质中读取数据库中的部分数据,并将其放入快速访问内存缓冲区,该缓冲区的大小由用户或操作系统根据可用内存资源确定。数据缓冲区中包含的数据用于更新所有M个模型中K个集群中每个集群中的原始模型数据分布。属于群集的某些数据被汇总或压缩并存储为表示数据足够统计量的数据的简化形式。从数据库访问更多数据,并更新模型。从汇总数据(足够的统计数据)和新获取的数据中确定集群的一组更新的参数。评估停止条件,以确定是否应从数据库访问更多数据。

著录项

  • 公开/公告号US6263337B1

    专利类型

  • 公开/公告日2001-07-17

    原文格式PDF

  • 申请/专利权人 MICROSOFT CORPORATION;

    申请/专利号US19980083906

  • 发明设计人 PAUL S. BRADLEY;USAMA FAYYAD;CORY REINA;

    申请日1998-05-22

  • 分类号G06F170/00;

  • 国家 US

  • 入库时间 2022-08-22 01:03:48

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号