首页> 外文期刊>Journal of Southeast University >Algorithms of mining data records from website automatically
【24h】

Algorithms of mining data records from website automatically

机译:自动从网站上挖掘数据记录的算法

获取原文
获取原文并翻译 | 示例
       

摘要

In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages that have uniform structure, only differing in main information. A web page which contains many links that link to isomorphic web pages is called a directory page. Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method. It first sorts the link, and then counts the links in each directory. If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results. A function for an isomorphic web page judgment is also proposed. Algorithm 2 can mine data records from an isomorphic page using a noise information filter. It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different. Algorithm 3 can mine data records from an entire website using the technology of spider. The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms. Mining data records from isomorphic pages is an efficient method.
机译:为了提高从网络上挖掘数据记录的准确性和完整性,提出了同构页面和目录页面的概念以及三种算法。同构网页是一组具有统一结构的网页,只是主要信息不同。包含许多链接到同构网页的链接的网页称为目录页面。算法1可以使用类似分析方法的相邻链接在网络中查找目录网页。它首先对链接进行排序,然后计算每个目录中的链接。如果计数大于给定阀门,则在目录中找到相似的子页面链接并给出结果。还提出了同构网页判断功能。算法2可以使用噪声信息过滤器从同构页面中挖掘数据记录。基于这样的事实,即噪声信息在两个同构页面中相同,只是主要信息不同。算法3可以使用Spider技术从整个网站上挖掘数据记录。实验表明,与现有算法相比,所提算法可以更完整地挖掘数据记录。从同构页面中挖掘数据记录是一种有效的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号