HTML pages are designed to convey semantic information to human users through visual emphases, demarcations, spatial cues and repeating patterns which act as "perceptual markup". This human-centric syntax is not easy for machines to identify. Naturally-occurring HTML, especially the machine-generated variety, rarely follows strict markup rules and provides no semantic cues. The visual cues humans use to extract information from a Web page, however, must be reflected in the page's markup. If a human could convey the relationship between visual cues, available to the program as markup patterns, and semantic categories, passed to the program as user-supplied labels, the program would have been instructed in "how to extract information from that page". Scriptor is a program which, run in tandem with a Web browser, allows a user to interactively design a data extraction script for the Web site. It is intended for highly structured repetitive information such as is found in classified listings, online stores, tables for weather, stock or airline schedules, course listings, and other similar sources. Scriptor interleaves a variety of learning methods to allow the specification of extraction rules using extremely simple methods. These consist of repeating pattern recognition, supervised learning, deictics through highlighting, and dialogs in which the user selects the desired result for a set of possible extraction rules. Learning is augmented by direct instructions such as: "label text following '/spl sim/' as 'Author' ". Performance data for the authors and naive subjects are presented for a collection of Web pages showing the potential of this form of highly interactive instruction. Our results demonstrate that very simple programming by example techniques can generate effective parse rules in highly repetitive domains.
展开▼