首页> 外文会议>Advanced language technologies for digital libraries >Efficient Search in Hidden Text of Large DjVu Documents
【24h】

Efficient Search in Hidden Text of Large DjVu Documents

机译:大型DjVu文档的隐藏文本中的有效搜索

获取原文
获取原文并翻译 | 示例

摘要

The paper describes an open-source tool which allows to present end-users with results of advanced language technologies. It relies on the DjVu format, which for some applications is still superior to other modern formats including PDF/A. The DjVu GPLed tools are not limited just to the DjVuLibre library, but are being supplemented by various new programs, such as pdf2djvu developed by Jakub Wilk. It allows in particular to convert to DjVu the PDF output of popular OCR programs like FineReader preserving the hidden text layer and some other features. The tool in question has been conceived by the present author and consist of a modification of the Poliqarp corpus query tool, used for National Corpus of Polish; his ideas have been very succesfully implemented by Jakub Wilk. The new system, called here simply Poliqarp for DjVu, inherits from its origin not only the powerfull search facilities based two-level regular expressions, but also the ability to represent low-level ambiguities and other linguistic phenomena. Although at present the tool is used mainly to facilitate access to the results of dirty OCR, it is ready to handle also more sophisticated output of linguistic technologies.
机译:本文介绍了一种开源工具,该工具可为最终用户提供高级语言技术的成果。它依赖于DjVu格式,对于某些应用程序,它仍然优于其他现代格式,包括PDF / A。 DjVu GPLed工具不仅限于DjVuLibre库,还由各种新程序进行了补充,例如Jakub Wilk开发的pdf2djvu。它特别允许将流行的OCR程序(例如FineReader)的PDF输出转换为DjVu,以保留隐藏的文字层和其他一些功能。该工具由本作者构想而成,包括对波兰国家语料库使用的Poliqarp语料库查询工具的修改;他的想法已由雅各布·威尔克(Jakub Wilk)成功实施。新系统在这里简称为DjVu的Poliqarp,它的起源不仅继承了基于两级正则表达式的强大搜索功能,而且还具有代表低级歧义和其他语言现象的能力。尽管目前该工具主要用于促进获取脏OCR的结果,但它也准备处理语言技术的更复杂输出。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号