首页> 外文会议>International Conference on Universal Digital Library(ICUDL2005); 20051031-1102; Hangzhou(CN) >Research on PDF Documents Information Extraction System Based on XML
【24h】

Research on PDF Documents Information Extraction System Based on XML

机译:基于XML的PDF文档信息提取系统研究。

获取原文
获取原文并翻译 | 示例

摘要

With the development of Internet, Web has become the biggest resource warehouse. Facing the more and more abundant digital information resources, the traditional way for information management cannot catch up with the development of modern society. So a kind of systematic technology is needed urgently to manage the digital information resources. To meet with this kind of need the digital library came into being. The digital library is a new developing and promising application, among which Information Integration is a basic component. The digital library's information systems adopt the specific resource forms, For example, the resource form of Wangfan is pdf and that of CNKI is caj. So it is unconveniencing that users have to install specific reader before reading these resources. The work we are doing is to develop a new digital library, from which the users can read the resources directly and it is not necessary for the users to read them by the specific reader. From the point of literature management, it is necessary for us to decompose a resource both in structure and semantics. The latter is the advantage of XML. XML is a technology dependent on content and Internet flat - independent. On the basis, we propose a metadata form which is imbedded among resources based on XML, as a result of which a new data resource format can be integrated. First. Change each kind of possible form (PDF, CAJ, RTF, etc.) into XML, Second. Make and change XML mark into a conversion rule of HTML mark. After that, the old format of the resources are transmitted to the new format, which can be showed by the browser directly.
机译:随着Internet的发展,Web已成为最大的资源仓库。面对越来越丰富的数字信息资源,传统的信息管理方式无法赶上现代社会的发展。因此,迫切需要一种系统的技术来管理数字信息资源。为了满足这种需求,数字图书馆应运而生。数字图书馆是一个新兴的有前途的应用,其中信息集成是基本组成部分。数字图书馆的信息系统采用特定的资源形式,例如,Wangfan的资源形式为pdf,CNKI的资源形式为caj。因此,用户在阅读这些资源之前必须安装特定的阅读器会带来不便。我们正在做的工作是开发一个新的数字图书馆,用户可以从中直接阅读资源,而用户不必由特定的读者阅读它们。从文献管理的角度来看,我们有必要分解资源的结构和语义。后者是XML的优势。 XML是一项依赖于内容和Internet Flat的技术。在此基础上,我们提出了一种基于XML的资源中嵌入的元数据形式,从而可以集成新的数据资源格式。第一。将各种可能的形式(PDF,CAJ,RTF等)更改为XML,第二种。将XML标记制作并更改为HTML标记的转换规则。之后,资源的旧格式将传输为新格式,浏览器可以直接显示该格式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号