To more thoroughly purge the noises in webpage and reduce the effect of webpage noises on accuracy of news content extraction, we propose two cleaning methods,the template page-based cleaning method for same noise blocks and the class attribute-based cleaning methodfor similar noise blocks and special noise blocks;based on that,by using the characteristic of webpage of news in contents layout structure,we present the beginning block and end block-based news content extraction algorithm.Experimental results show that compared with existing algorithm,the proposed algorithm has higher extraction accuracy rate and can adapt to the situation that the text content is stored in either single block or multiple blocks,and it effectively solves the extraction problem of shorter text content.%为了更彻底地清洗网页噪音,减少网页噪音对新闻内容抽取准确率的影响,提出基于模板页的相同噪音块清洗方法和基于class属性的同类噪音块和特殊噪音块清洗方法;在此基础上,利用新闻网页在内容布局结构上的特征,提出基于起始块和终止块的新闻内容抽取方法。实验结果表明,与已有的算法相比,提出的方法抽取准确率更高,能够同时适应正文内容存放在单块和多块的情形,并且有效地解决了正文内容较短时的抽取问题。
展开▼