首页> 外文会议>International Conference on Computer Processing of Oriental Languages >Segmentation of Mixed Chinese/English Document Including Scattered Italic Characters
【24h】

Segmentation of Mixed Chinese/English Document Including Scattered Italic Characters

机译:混合中文/英语文件的分割,包括分散斜体字符

获取原文

摘要

It is difficult to segment mixed Chinese/English documents when there are many italic characters scattered in documents. Most contributions attach more attention to English documents. However, mixed document is different from English document and some special features should be considered. This paper gives a new way to solve the problem. At first, an appropriate character area is chosen to detect italic. Next, a two-step strategy is adopted. Italic determination is done first and then if the character pattern is identified as italic, the estimation of slant angle will be done. Finally the italic character pattern is corrected by shear transform. A method of adopting two-step weighted projection profile histogram for italic determination is introduced. And a fast algorithm to estimate slant angle is also introduced. Three large sample collections, including character and character-pair and document respectively, are provided to evaluate our method and encouraging results are achieved.
机译:当文件分散在文档中时,难以进行混合的汉语/英语文件。大多数贡献都要更多地关注英语文件。但是,混合文件与英语文件不同,应考虑一些特殊功能。本文给出了解决问题的新方法。首先,选择适当的字符区域以检测斜体。接下来,采用两步策略。首先完成斜体确定,然后如果将字符模式被识别为斜体,则将完成倾斜角的估计。最后,通过剪切变换来校正斜体字符模式。介绍了采用两步加权投影曲线直方图的方法进行斜体确定。还引入了一种快速算法来估计倾斜角度。分别提供了三个大型样本集合,包括字符和字符对和文件,以评估我们的方法,并达到令人鼓舞的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号