为了有效地从Web页面上提取数据信息,本文建立一种基于XML的Web信息收集数据库.利用开源工具JTidy将Web页面加以整理,利用XML良好的结构特性,使用Dom4j工具包解析XML文件;按照XML中的标签层次特点作为对数据进行储存的依据;最后使用Hibernate将数据持久化地储存于数据库中,方便数据的储存与查询.%In order to extract information and data from Web pages effectively, this paper constructs a database used for collecting data based on XML. The HTML documents are transformed to XHTML and analyzed by the open-source tools Jtidy and Dom4j. Data are extracted and saved based on the tag characteristics of XML documents. Finally the data are persisted in the database by the ORM tool-Hibernate.
展开▼