首页>
外国专利>
WEB PAGE CLASSIFICATION BASED ON NOISE REMOVAL
WEB PAGE CLASSIFICATION BASED ON NOISE REMOVAL
展开▼
机译:基于去噪的网页分类
展开▼
页面导航
摘要
著录项
相似文献
摘要
Systems and methods for improving accuracy of web content classification by removing perceived noise are provided. The system receives a Uniform Resource Locator (URL) of a web page that needs to be classified, and parses the web page so as to construct a tree containing a list of tags. Unwanted tags are removed from the list of tags to yield a tree containing only desired tags that form part of the web page. Subsequently, a list of hyperlinks are based on processing of the tree having desired tags, wherein the list of hyperlinks can include unwanted/undesired/invalid hyperlinks and valid hyperlinks. Unwanted hyperlinks can accordingly be removed from the list of hyperlinks, and each valid hyperlink can be categorized based on a list of categories, and a final category for the web page is determined based on a vector analysis of each category assigned to each valid hyperlink.
展开▼