...
首页> 外文期刊>Computer Science & Information Technology >A Semantic Based Approach for Information Retrieval from Html Documents Using Wrapper Induction Technique
【24h】

A Semantic Based Approach for Information Retrieval from Html Documents Using Wrapper Induction Technique

机译:基于语义的Html文档信息检索方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy
机译:大多数Internet应用程序是使用HTML之类的Web技术构建的。网页的设计方式是,它显示来自基础数据库的数据记录,或仅以非结构化格式显示文本,但使用一些固定模板。汇总分散在不同网页中的这些数据非常繁琐而繁琐,并且会花费大量时间和精力。可以在整个网页上使用一种称为包装器归纳技术的监督学习技术来学习数据提取规则。通过将这些学习到的规则应用于网页,可以使信息提取过程更轻松。本文着重于开发一种用于从非结构化数据中提取信息的工具。语义网络技术的使用大大简化了过程。该工具使我们能够以独特的方式查询散布在多个网页上的数据。这可以通过以下步骤完成:从多个网页提取数据,以RDF三元组的形式存储它们,使用本体集成多个RDF文件,基于用户查询生成SPARQL查询,并以表格或图表的形式生成报告从SPARQL查询的结果。使用本体识别各种相关网页之间的关系,并以更好的方式进行查询,从而提高搜索效率

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号