首页> 外文期刊>Journal of software >A DOM-based Anchor-Hop-T Method for Web Application Information Extraction
【24h】

A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

机译:基于DOM的Anchor-Hop-T Web应用信息提取方法

获取原文
获取原文并翻译 | 示例

摘要

In order to implement the information fusion of electronic products, the widely adopted approach is to extract information from HTML structure of business Website with deeply data processing. However, modeling Web application is hard to be solved that the data in HTML is semi-formal which displayed as DOM (Document Object Model) tree when using XML schema to data analysis. How to understand and extract information is first to be researched. The general model Anchor-Hop considering the text property and label property is simple to handle this problem. Therefore, it has low effectiveness. This model is sensitive to the data of HTML structure, that if the website structure is slightly changed the issue of extraction accuracy is encountered. As a result, the extraction rules should be redefined because of the changed structure. In order to improve extraction efficiency, this paper proposed a DOM-based dynamic model Anchor-Hop-T information extraction model. The HTML tags including table, ol and ul can be searched and processed using XPath so that it is convenience to extract corresponding Anchor data block. Furthermore, the location of Hop point is considered as invariant, by which our new model based on Anchor and Hop point introduces more concepts for extracting information, such as Anchor data block, Anchor locating library and AH relevance value. Finally, we try to give out an experiment to demonstrate the applicability of our approach.
机译:为了实现电子产品的信息融合,广泛采用的方法是从商业网站的HTML结构中提取信息,并进行深入的数据处理。但是,使用XML模式进行数据分析时,HTML中的数据是半正式的并显示为DOM(文档对象模型)树,因此很难解决对Web应用程序进行建模的问题。首先要研究如何理解和提取信息。考虑text属性和label属性的通用模型Anchor-Hop很容易处理此问题。因此,它的效率很低。该模型对HTML结构的数据很敏感,如果网站结构稍有变化,则会遇到提取精度问题。因此,由于结构更改,应重新定义提取规则。为了提高提取效率,提出了一种基于DOM的动态模型Anchor-Hop-T信息提取模型。可以使用XPath搜索和处理包括table,ol和ul在内的HTML标签,以便于提取相应的Anchor数据块。此外,跳点的位置被认为是不变的,因此我们基于锚点和跳点的新模型引入了更多的信息提取概念,例如锚点数据块,锚点定位库和AH相关值。最后,我们尝试给出一个实验来证明我们的方法的适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号