Clustering OCR-ed texts for browsing document image database

机译：聚集OCR版本的文本以浏览文档图像数据库

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Document clustering is a powerful tool for browsing throughout a document database. Similar documents are gathered into several clusters and a representative document of each cluster is shown to users. To make users infer the content of the database from several representatives, the documents must be separated into tight clusters, in which documents are connected with high similarities. At the same time, clustering must be fast for user interaction. We propose an O(n/sup 2/) time, O(n) space cluster extraction method. It is faster than the ordinal clustering methods, and its clusters compare favorably with those produced by Complete Link for tightness. When we deal with OCR-ed documents, term loss caused by recognition faults can change similarities between documents. We also examined the effect of recognition faults to the performance of document clustering.

机译：文档集群是用于浏览整个文档数据库的强大工具。将相似的文档收集到几个群集中，并向用户显示每个群集的代表文档。为了使用户可以从多个代表推断数据库的内容，必须将文档分成多个紧密的簇，在簇中文档之间具有高度相似性。同时，群集必须快速进行用户交互。我们提出了O（n / sup 2 /）时间，O（n）空间簇提取方法。它比常规聚类方法要快，并且其聚类性与Complete Link产生的紧密性相比具有优势。当我们处理OCR版本的文档时，由于识别错误导致的术语损失会改变文档之间的相似性。我们还检查了识别错误对文档聚类性能的影响。

著录项

来源
《》|1995年|P.171-174|共4页
会议地点
作者
Tsuda; K.; Senda; S.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Probing Text Databases and Clustering to Extract New Topic Documents [J] . Takanori MOURl, Hiroyuki KITAGAWA 電子情報通信学会技術研究報告. デ-タ工学. Data Engineering . 2003,第192期

机译：探测文本数据库和聚类以提取新主题文档
2. Probing Text Databases and Clustering to Extract New Topic Documents [J] . Takanori MOURl, Hiroyuki KITAGAWA 電子情報通信学会技術研究報告. デ-タ工学. Data Engineering . 2003,第192期

机译：探测文本数据库和群集提取新主题文档
3. DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering [J] . Lakshmi R., Baskar S. Journal of Information Science . 2019,第6期

机译：DIC-DOC-K-means：使用K-means的DOCument聚类基于不相似性的初始质心选择，以提高文本文档聚类的效率
4. Clustering OCR-ed texts for browsing document image database [C] . Tsuda K., Senda S., Institute of Electric and Electronic Engineer International Conference on Document Analysis and Recognition . 1995

机译：群集用于浏览文档图像数据库的OCR-ED文本
5. Text document topical recursive clustering and automatic labeling of a hierarchy of document clusters. [D] . Li, Xiaoxiao. 2012

机译：文本文档主题递归群集和文档群集层次结构的自动标记。
6. Swarm Intelligence Algorithms in Text Document Clustering with Various Benchmarks [O] . Suganya Selvaraj, Eunmi Choi 2021

机译：文本文档集群中的群智能算法与各种基准
7. Document image database retrieval and browsing using texture analysis [O] . John F. Cullen, Jonathan J. Hull, Peter E. Hart 1997

机译：使用纹理分析文档图像数据库检索和浏览

Clustering OCR-ed texts for browsing document image database

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅