首页> 外文会议>International Conference on Reliability, Infocom Technologies and Optimization >Stanford parser based approach for extraction of Link- Context from non-descriptive Anchor-Text
【24h】

Stanford parser based approach for extraction of Link- Context from non-descriptive Anchor-Text

机译:基于STANFORD解析器的提取方法从非描述性锚文本中提取链接 - 上下文

获取原文

摘要

Link Context Analysis has been widely explored for determining the context of the target web page. But most of the researchers have only considered descriptive or meaningful anchor text and left the undiscriptive anchor text. By researching the World Wide Web it is analyzed that a good percentage of web pages can be reached by following the undescriptive anchor text. So an algorithm has been proposed and implemented for Link context determination (LCD) to determine the context of non-descriptive anchor text in this paper. In this work non-descriptive anchor text are mainly considered for Link Context determination. A corpus of different web pages belonging to a common domain has been considered first. Then the pages were manually analyzed and relation between the anchor text and the words in its vicinity were discovered. Certain numbers of rules were formed and represented in the form of a tree, based upon these relationships. In our proposed and implemented architecture for LCD we have used three components(1) Stanford parser (2) Rules (3) Link Context Determination. The input sentence is given to the Stanford parser which creates a parse tree for the read sentence. This tree is then used by the link context determiner along with the appropriate rules tree to determine the link context. The proposed approach has been implemented and validated by considering limited samples of non-descriptive ATs. The results have shown that, the proposed LCD has extracted 100% actual link-context of each considered non-descriptive Anchor Text (AT's).
机译:链接上下文分析已被广泛探索用于确定目标网页的上下文。但是,大多数研究人员只考虑了描述性或有意义的锚文本,并留下了未识别的锚文本。通过研究全球网络,分析了通过遵循未使用的锚文本可以达到良好的网页百分比。因此,已经提出并实现了用于链接上下文确定(LCD)来确定本文中的非描述性锚文本的上下文的算法。在此工作中,非描述性锚文本主要被认为是链接上下文确定。首先考虑属于公共域的不同网页的语料库。然后,在手动分析页面并在锚文本与其附近的单词之间的关系。基于这些关系,形成了某些规则并以树的形式表示。在我们提出和实施的LCD架构中,我们使用了三个组件(1)斯坦福解析器(2)规则(3)链接上下文确定。输入句子给斯坦福解析器,它为读取句创建一个解析树。然后,链接上下文确定器使用该树以及适当的规则树来确定链接上下文。通过考虑有限的非描述性ATS样本,已经实施和验证了所提出的方法。结果表明,所提出的LCD已经提取了每次考虑非描述性锚文本(AT)的100%实际链接 - 上下文。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号