Research on PDF Documents Information Extraction System Based on XML

机译：基于XML的PDF文档信息提取系统研究。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the development of Internet, Web has become the biggest resource warehouse. Facing the more and more abundant digital information resources, the traditional way for information management cannot catch up with the development of modern society. So a kind of systematic technology is needed urgently to manage the digital information resources. To meet with this kind of need the digital library came into being. The digital library is a new developing and promising application, among which Information Integration is a basic component. The digital library's information systems adopt the specific resource forms, For example, the resource form of Wangfan is pdf and that of CNKI is caj. So it is unconveniencing that users have to install specific reader before reading these resources. The work we are doing is to develop a new digital library, from which the users can read the resources directly and it is not necessary for the users to read them by the specific reader. From the point of literature management, it is necessary for us to decompose a resource both in structure and semantics. The latter is the advantage of XML. XML is a technology dependent on content and Internet flat - independent. On the basis, we propose a metadata form which is imbedded among resources based on XML, as a result of which a new data resource format can be integrated. First. Change each kind of possible form (PDF, CAJ, RTF, etc.) into XML, Second. Make and change XML mark into a conversion rule of HTML mark. After that, the old format of the resources are transmitted to the new format, which can be showed by the browser directly.

机译：随着Internet的发展，Web已成为最大的资源仓库。面对越来越丰富的数字信息资源，传统的信息管理方式无法赶上现代社会的发展。因此，迫切需要一种系统的技术来管理数字信息资源。为了满足这种需求，数字图书馆应运而生。数字图书馆是一个新兴的有前途的应用，其中信息集成是基本组成部分。数字图书馆的信息系统采用特定的资源形式，例如，Wangfan的资源形式为pdf，CNKI的资源形式为caj。因此，用户在阅读这些资源之前必须安装特定的阅读器会带来不便。我们正在做的工作是开发一个新的数字图书馆，用户可以从中直接阅读资源，而用户不必由特定的读者阅读它们。从文献管理的角度来看，我们有必要分解资源的结构和语义。后者是XML的优势。 XML是一项依赖于内容和Internet Flat的技术。在此基础上，我们提出了一种基于XML的资源中嵌入的元数据形式，从而可以集成新的数据资源格式。第一。将各种可能的形式（PDF，CAJ，RTF等）更改为XML，第二种。将XML标记制作并更改为HTML标记的转换规则。之后，资源的旧格式将传输为新格式，浏览器可以直接显示该格式。

著录项

来源
《International Conference on Universal Digital Library(ICUDL2005); 20051031-1102; Hangzhou(CN)》|2005年|322-326|共5页
会议地点 Hangzhou(CN)
作者
ZHANG Wen-de; SONG Yan-juan;
展开▼
作者单位

Library of Fuzhou university,Fuzhou 350002, China;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类电子图书馆、数字图书馆;
关键词
information extraction; PDF; XML;

机译：信息提取； PDF; XML格式;
入库时间 2022-08-26 14:22:28

相似文献

外文文献
中文文献
专利

1. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International journal of computational intelligence research . 2021,第1期

机译：使用正则表达式和自然语言处理从PDF文档的规则的块提取
2. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International Journal of Applied Engineering Research . 2015,第3期

机译：使用正则表达式和自然语言处理从PDF文档中基于规则的块提取
3. Metadata Extraction Approach of PDF Documents Based on Measurement Fusion [J] . Junmin Zhao, Huazhong Liu Journal of Multimedia . 2013,第6期

机译：基于测量融合的PDF文档元数据提取方法
4. Research on PDF Documents Information Extraction System Based on XML [C] . ZHANG Wen-de, SONG Yan-juan International Conference on Universal Digital Library(ICUDL2005); 20051031-1102; Hangzhou(CN) . 2005

机译：基于XML的PDF文档信息提取系统研究。
5. XML2REL: An efficient system for storing and querying XML documents using relational databases [D] . Atay, Mustafa 2006

机译：XML2REL：使用关系数据库存储和查询XML文档的有效系统
6. Application of an XML-based Document Framework to Knowledge Content Authoring and Clinical Information System Development [O] . Nathan C. Hulse, Roberto A. Rocha, Richard Bradshaw, 2003

机译：基于XML的文档框架在知识内容创作和临床信息系统开发中的应用
7. Document Transformation System from Papers to XML Data Based on Pivot XML Document Method [O] . Yasuto ISHITANI 2003

机译：基于Pivot XML文档方法的论文到XML数据转换系统

Research on PDF Documents Information Extraction System Based on XML

摘要

著录项

相似文献

相关主题

期刊订阅