首页> 外文会议>International Conference on Industrial Enterprise and System Engineering >Comparison of Web Scraping Techniques: Regular Expression, HTML DOM and Xpath
【24h】

Comparison of Web Scraping Techniques: Regular Expression, HTML DOM and Xpath

机译:Web刮擦技术的比较:正则表达式,HTML DOM和XPath

获取原文

摘要

Data collection is the initial stage of research. There are various data sources on the internet that can be used in the research process. The process of taking data or information from sites on the internet is called web scraping. Some methods of web scraping include Regular Expression (Regex), HTML DOM and XPath. This study aims to determine the performance of the three methods of web scraping. The Comparison is done by testing each method when retrieving data from the target website, then measuring the performance of the process and comparing it. Process time, memory usage, and data consumption are used as measurement parameters in the experiment. The results of the experiment show that web scraping with the regex method is the smallest in memory usage compared to the HTML DOM method, and Xpath. While HTML DOM requires the least amount of time and the smallest data consumption compared to Regular Expression and XPath methods.
机译:数据收集是研究的初始阶段。 Internet上有各种数据源可用于研究过程。从互联网上的站点获取数据或信息的过程称为Web刮擦。一些Web擦伤方法包括正则表达式(正则表达式),HTML DOM和XPath。本研究旨在确定三种Web刮擦方法的性能。通过在从目标网站检索数据时测试每个方法进行比较,然后测量过程的性能并进行比较。处理时间,内存使用和数据消耗用作实验中的测量参数。实验结果表明,与HTML DOM方法和XPath相比,使用Regex方法的Web擦除是内存使用中最小的。与正则表达式和XPath方法相比,HTML DOM需要最少的时间和最小的数据消耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号