首页>
外国专利>
System and method for recognizing non-body text in webpage
System and method for recognizing non-body text in webpage
展开▼
机译:网页中非正文文本的识别系统及方法
展开▼
页面导航
摘要
著录项
相似文献
摘要
The invention discloses a system and method for recognizing the non-body text in a webpage, and relates to the field of main body extraction. The system comprises: a webpage grabber configured to grab data of all the webpages of a target website; a DOM tree construction unit configured to construct a DOM tree corresponding to each webpage of the target website; a DOM tree analysis unit configured to find out a unit text section in the webpage according to the DOM tree; a text statistics unit configured to conduct statistics on the number of occurrence of the unit text section in all the webpages of the target website; and a text recognition unit configured to recognize the unit text section as a non-body text when the number of occurrence is greater than a predetermined threshold. The system and the method overcome the problem of lag of recognition of a non-body text in the prior art method, and have a high recognition accuracy.
展开▼