为提高主题网络爬虫的效率及收获率,提出一种基于主题语义 URL 的信息搜索方法。该方法将种子 URL 映射到主题树的主题结点上,以主题路径上的主题文本扩充种子 URL 的语义,引导爬虫高效准确地抓取主题页面,并利用链接重要度与页面重要度因子在抓取过程中自动选育新的 URL 优良种子。重点阐述上述搜索方法的原理及其在系统中的实现。实验结果表明,该搜索方法能有效改善网络爬虫的搜索效率及收获率,且种子链接的选育性能良好。%This paper presents a topic semantics URL-based information search method for improving the efficiency and harvest ratio of topic networks crawler.The method maps the seed URL onto the topic nodes of topic tree,and expands the semantics of seed URL by using the topic text on topic path as well as guides the crawler to efficiently and precisely crawl the topic pages.Furthermore,it makes use of the factors of link importance and page importance to automatically select and breed new URL seeds during the crawling process.The paper emphatically elucidates the principle of the search method above mentioned and its realisation in the system.Experimental results demonstrate that this method can effectively improve the search efficiency and harvest ratio of network crawlers,and the selection and breeding performance of seeds link is excellent as well.
展开▼