首页> 外文会议>International Conference on Big Data and Smart Computing >XPath based crawling method with crowdsourcing for targeted online market places
【24h】

XPath based crawling method with crowdsourcing for targeted online market places

机译:基于XPath的具有众包的针对目标在线市场的爬网方法

获取原文

摘要

An increasing number of online market places have emerged as online shopping becomes more popular for a couple of decades. During that time, technologies to construct web sites have been evolved as well and, currently, AJAX is a representative technique to construct dynamic web pages. Crawling is a basic tool to collect information in the internet, and traditional crawling techniques randomly choose and follow links represented by the anchor tag in order to navigate the Word-Wide-Web. However, when a traditional crawler is applied for gathering information from a targeted up-to-date online market place, there are some critical problems. The first issue is that there are too many links, among which only few are enough to navigate all web pages in the site. The second issue is that most links are given by JavaScript but not by the anchor tags, which cannot be followed by the traditional web crawlers. Therefore, to overcome these issues, we suggest a webpage crawling method which can extract only necessary and sufficient links by adopting crowdsourcing approach and can follow JavaScript links by using a navigating information represented by XPaths.
机译:随着在线购物在几十年中变得越来越流行,已经出现了越来越多的在线市场。在此期间,构建网站的技术也得到了发展,目前,AJAX是构建动态网页的代表技术。爬网是一种在Internet上收集信息的基本工具,而传统的爬网技术会随机选择并跟随由定位标记表示的链接,以浏览Word-Wide-Web。但是,当使用传统的搜寻器从目标最新的在线市场收集信息时,会遇到一些严重的问题。第一个问题是链接太多,其中只有很少的链接足以浏览站点中的所有网页。第二个问题是,大多数链接是由JavaScript提供的,而不是由锚标记提供的,传统的Web爬网程序无法跟随这些锚。因此,为了克服这些问题,我们建议一种网页爬网方法,该方法可以通过采用众包方法仅提取必要和足够的链接,并可以通过使用XPaths表示的导航信息来跟踪JavaScript链接。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号