首页> 外文会议>International Conference on Data Engineering >Bootstrapping Semantic Annotation for Content-Rich HTML Documents
【24h】

Bootstrapping Semantic Annotation for Content-Rich HTML Documents

机译:对内容丰富的HTML文档引导语义注释

获取原文

摘要

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique.
机译:巨大数量的语义数据仍在HTML文档中编码。在此类文档中识别和注释隐式的语义概念使其直接适用于语义Web处理。在本文中,我们描述了一种高度自动化的技术,用于注释HTML文档,尤其是基于模板的内容丰富的文档,包含每个文档许多不同的语义概念。从一组HTML文档中的一个(小)种子的语义概念的种子,我们引导了一个注释过程,它自动识别其他文档中存在的未标记的概念实例。引导技术利用观察到内容丰富的文档中的语义相关项目表现出呈现风格和空间局部的一致性,以学习用于准确识别从各种WEB源绘制的HTML文档中的不同语义概念的统计模型。我们还提出了对技术的有效性的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号