首页> 中文期刊>沈阳大学学报 >基于XML的Web内容挖掘方法

基于XML的Web内容挖掘方法

     

摘要

在分析Web内容挖掘特征的基础上,提出一种基于XML技术的Web内容挖掘模型.利用HITS算法确定权威Web页面,利用HTMLTidy工具将非XML文件经过数据清洗后转换成结构良好的XMI。文档,结合互联网上传统科技论文的自动抽取系统实例,采用文本聚类分类技术进行面向XML文档数据的数据挖掘.实验结果表明,该模型工作良好,可以自动、有效地提取网页内容.%The characteristics of Web content mining were analyzed and a model of Web content mining was proposed base on XML. The HITS algorithm was used to determine the authority of Web pages, the HTML Tidy tool was used for non-XML documents through the data cleansing and transform XML documents into well-formed, and text clustering techniques were used for XML document classification data in data mining. Combining with the examples of traditional scientific papers of automated extraction system from Internet, the model is proved to work well, and it can automatically and effectively extract web page content.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号