首页> 外文会议>Youth Academic Annual Conference of Chinese Association of Automation >Design and application of intelligent dynamic crawler for web data mining
【24h】

Design and application of intelligent dynamic crawler for web data mining

机译:Web数据挖掘智能动态爬虫的设计与应用

获取原文

摘要

Web data acquisition is the foundation of Web data mining. Web crawler is an important tool for Web data acquisition, but the frequent updates of Web data structures, data sources and distribution channels, resulted in high costs of crawler program development and maintenance. In order to solve this problem, this paper designed and implemented an intelligent dynamic crawler, which stored the data extraction rules of XPath in database, loaded the rules dynamically according to the target, and used TF-IDF method to calculate the relevance. The Web crawling rules can be automatically acquired, which made the crawler intelligent and dynamic, improved the adaptability of the crawler for the complex web environment, and reduced the maintenance and update cost. Finally, this paper applies the intelligent dynamic crawler to the threat awareness of public vulnerabilities, with the method of data collection and analysis of the vulnerability community and the network node search engine. The experiment used the prototype system on three vulnerability communities to collect and analyze the data. The results showed that the intelligent dynamic crawler can realize the high-efficient and flexible data collection function, and laid the foundation for Web data mining.
机译:Web数据获取是Web数据挖掘的基础。 Web搜寻器是获取Web数据的重要工具,但是Web数据结构,数据源和分发渠道的频繁更新导致搜寻器程序开发和维护的高昂成本。为了解决这个问题,本文设计并实现了一种智能动态爬虫,该爬虫将XPath的数据提取规则存储在数据库中,并根据目标动态地加载这些规则,并使用TF-IDF方法来计算相关性。可以自动获取Web爬网规则,使爬网程序具有智能性和动态性,提高了爬网程序对复杂Web环境的适应性,并减少了维护和更新成本。最后,本文通过对漏洞社区和网络节点搜索引擎进行数据收集和分析的方法,将智能动态爬虫应用于公共漏洞的威胁感知。实验使用了三个漏洞社区上的原型系统来收集和分析数据。结果表明,智能动态爬虫可以实现高效灵活的数据采集功能,为Web数据挖掘奠定了基础。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号