为高效便捷地获取互联网上发布的真实事件信息,提出了一种无监督的互联网事件抽取框架.该框架利用DOM树模型的平行结构特性对表格页面进行事件抽取,并以表格页面抽取的事件作为种子采总结详情页面的对应模式,进一步使用总结的模式在详情页面中抽取.在大量网站页面中应用该框架,并将抽取结果与常用的包装器生成算法进行比较,结果表明了该框架的有效性以及在详情页面中的抽取质量优于包装器算法.%To acquire real event information published to intemet effectively and easily, an unsupervised web event extraction framework is proposed. This framework extracts events from table WebPages by using DOM' s parallel structure, the events extracted from table WebPages are used as seeds to summary corresponding patterns from detail WebPages, then patterns summarized are used to further extract events from detail WebPages. Masses ofwebsites are used to verify this framework and the result ofextraetion, which is eompared to common wrapper-generation algorithm, indicated that this framework is feasible and better than wrapper-generation algorithm in quality of detail webpage extraction.
展开▼