首页> 外国专利> WEBPAGE ENTITY EXTRACTION THROUGH JOINT UNDERSTANDING OF PAGE STRUCTURES AND SENTENCES

WEBPAGE ENTITY EXTRACTION THROUGH JOINT UNDERSTANDING OF PAGE STRUCTURES AND SENTENCES

机译:通过页面结构和句子的联合理解提取网页实体

摘要

Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.
机译:描述了一种用于理解网页的实体,例如在网页上标记实体的技术。迭代和双向框架处理网页,该网页包括文本理解组件(例如,扩展的Semi-CRF模型),该文本理解组件向结构理解组件(例如,扩展的HCRF模型)提供文本分割功能。结构理解组件使用网页的文本分割特征和视觉布局特征来识别结构(例如,标记的块)。文本理解组件又使用标记的块来进一步理解文本。该过程反复进行,直到满足相似性标准为止,此时可以标记实体。还描述了在网页中多次提及一组文本以帮助标记实体的方法。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号