At present, focused crawler usually crawl pages using the link structure or page contents. But both of them have some flaws. So we designed an efficient crawling strategy, which combine the link structure with content similarity. We extracted topic feature vector automatically and judge the topic similarity of a page using combination of link structure and page content. We also forecast the URL similarity using link structure in topic pages. Experiments showed that this strategy effectively increase the precision of fetching topic pages.
展开▼