首页> 外文会议> >Clustering OCR-ed texts for browsing document image database
【24h】

Clustering OCR-ed texts for browsing document image database

机译:聚集OCR版本的文本以浏览文档图像数据库

获取原文
获取外文期刊封面目录资料

摘要

Document clustering is a powerful tool for browsing throughout a document database. Similar documents are gathered into several clusters and a representative document of each cluster is shown to users. To make users infer the content of the database from several representatives, the documents must be separated into tight clusters, in which documents are connected with high similarities. At the same time, clustering must be fast for user interaction. We propose an O(n/sup 2/) time, O(n) space cluster extraction method. It is faster than the ordinal clustering methods, and its clusters compare favorably with those produced by Complete Link for tightness. When we deal with OCR-ed documents, term loss caused by recognition faults can change similarities between documents. We also examined the effect of recognition faults to the performance of document clustering.
机译:文档集群是用于浏览整个文档数据库的强大工具。将相似的文档收集到几个群集中,并向用户显示每个群集的代表文档。为了使用户可以从多个代表推断数据库的内容,必须将文档分成多个紧密的簇,在簇中文档之间具有高度相似性。同时,群集必须快速进行用户交互。我们提出了O(n / sup 2 /)时间,O(n)空间簇提取方法。它比常规聚类方法要快,并且其聚类性与Complete Link产生的紧密性相比具有优势。当我们处理OCR版本的文档时,由于识别错误导致的术语损失会改变文档之间的相似性。我们还检查了识别错误对文档聚类性能的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号