Open Source Forge (OSF) websites provide information on massive open source software projects, extracting these web data is important for open source research. Traditional extraction methods use string matching among pages to detect page template, which is time-consuming. A recent work published in VLDB exploits redundant entities among websites to detect web page coordinates of these entities. The experiment gives good results when these coordinates are used for extracting other entities of the target site. However, OSF websites have few redundant project entities. This paper proposes a modified version of that redundancy-based method tailored for OSF websites, which relies on a similar yet weaker presumption that entity attributes are redundant rather than whole entities. Like the previous work, we also construct a seed database to detect web page coordinates of the redundancies, but all at the attribute-level. In addition, we apply attribute name verification to reduce false positives during extraction. The experiment result indicates that our approach is competent in extracting OSF websites, in which scenario the previous method can not be applied.
展开▼