首页> 外文会议>Conference on Document Recognition and Retrieval; 20080129-31; San Jose,CA(US) >Segmentation-Based Retrieval of Document Images from Diverse Collections
【24h】

Segmentation-Based Retrieval of Document Images from Diverse Collections

机译:基于分割的多样集合文档图像检索

获取原文
获取原文并翻译 | 示例

摘要

We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc, in documents represented as bilevel, greylevel, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.
机译:我们描述了一种从庞大的极其多样化的馆藏中检索文档图像的方法。首先,我们执行内容提取,即在以双层,灰度或彩色图像表示的文档中包含手写,机器打印的文本,照片,空白等的区域的位置和度量。最近的实验表明,即使是适度的按像素的内容分类精度也可以支持有用的较高的查全率和准确率(例如80-90%),用于文档集合内的检索查询,以查找包含某种类型的内容的一部分。当内容和错误率在整个馆藏中的分布均匀时,可以从分类度量中得出IR度量,反之亦然。迄今为止,我们最大的实验由80个训练图像组成,总计超过4.16亿像素,用以说明这些结论。该数据集比以前的实验更具代表性,其中包含内容类型的更均衡的分布。该数据集中还包含从手持数码相机获得的文本图像,并讨论了现有方法(未进行修改)对这些图像进行分类的成功经验。还描述了区分线条艺术与上述四个类别的初始实验。我们还将讨论影响地面实况和评估措施的方法论问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号