首页> 外文期刊>Knowledge-Based Systems >Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images
【24h】

Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

机译:具有基于特征的重新聚类功能的半监督聚类和标签,以减少泰国文档图像中的噪声

获取原文
获取原文并翻译 | 示例

摘要

Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively. (C) 2015 Elsevier B.V. All rights reserved.
机译:噪声成分是文档分析性能不佳的主要原因。为了减少不需要的组件,最近的研究工作已经应用了图像处理技术。但是,这些技术的有效性仅适用于拉丁文字文档,而不适合非拉丁文字文档。非拉丁文字文档(例如Thai)的特征比拉丁文字文档复杂得多,并且包括许多级别的字符对齐,没有单词或句子分隔符以及字符大小的可变性。当将图像处理技术应用于泰国文档时,我们通常会删除相对接近噪点的字符。因此,在本文中,我们通过应用机器学习技术对文档图像中的噪声进行分类和减少,提出了一种新颖的降噪方法。所提出的方法使用具有改进标记方法的半监督聚类和标记方法,即特征选择子聚类标记。功能选择的子集群标记集中于常规标记方法未正确标记的集群。这些群集通过根据类标签选择的新功能集被重新分成小组。实验结果表明,该方法可以显着提高标注实例的准确性和分类性能。我们比较了所提出的方法和两种相关的降噪方法(即两相笔划状图案噪声(SPN)去除)和商用降噪软件ScanFix Xpress 6.0之间的降噪和字符保留性能。结果表明,半监督降噪明显优于比较方法,后者的特征和噪声F度量分别为86.01和97.82。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号