首页> 外文会议>IEEE International Conference on Systems, Man, and Cybernetics >Scriptor: using deictics, dialog, and supervised learning to convey instructions
【24h】

Scriptor: using deictics, dialog, and supervised learning to convey instructions

机译:编制程序:使用契法,对话框和监督学习传达指示

获取原文

摘要

HTML pages are designed to convey semantic information to human users through visual emphases, demarcations, spatial cues and repeating patterns which act as "perceptual markup". This human-centric syntax is not easy for machines to identify. Naturally-occurring HTML, especially the machine-generated variety, rarely follows strict markup rules and provides no semantic cues. The visual cues humans use to extract information from a Web page, however, must be reflected in the page's markup. If a human could convey the relationship between visual cues, available to the program as markup patterns, and semantic categories, passed to the program as user-supplied labels, the program would have been instructed in "how to extract information from that page". Scriptor is a program which, run in tandem with a Web browser, allows a user to interactively design a data extraction script for the Web site. It is intended for highly structured repetitive information such as is found in classified listings, online stores, tables for weather, stock or airline schedules, course listings, and other similar sources. Scriptor interleaves a variety of learning methods to allow the specification of extraction rules using extremely simple methods. These consist of repeating pattern recognition, supervised learning, deictics through highlighting, and dialogs in which the user selects the desired result for a set of possible extraction rules. Learning is augmented by direct instructions such as: "label text following '/spl sim/' as 'Author' ". Performance data for the authors and naive subjects are presented for a collection of Web pages showing the potential of this form of highly interactive instruction. Our results demonstrate that very simple programming by example techniques can generate effective parse rules in highly repetitive domains.
机译:HTML页面旨在通过视觉重点,分界,空间线索和重复模式来传达对人类用户的语义信息,作​​为“感知标记”。这种以人为本的语法识别的机器并不容易。自然发生的HTML,特别是机器生成的品种,很少遵循严格的标记规则,并提供任何语义线索。然而,人类用于从网页中提取信息的视觉提示必须反映在页面的标记中。如果人类可以传达视觉提示之间的关系,可用于程序作为标记模式,以及将程序传递给程序作为用户提供的标签,则该程序将被指示“如何从该页面中提取信息”。编制程序是一个程序,它与Web浏览器一起运行,允许用户交互设计网站的数据提取脚本。它适用于高度结构化的重复信息,例如在分类列表中找到,在线商店,天气,股票或航空公司日程表,课程列表和其他类似源的表格。编译器交织​​各种学习方法,以允许使用极其简单的方法说明提取规则。这些包括重复模式识别,通过突出显示,以及用户选择一组可能的提取规则的对话框。学习通过直接指示来增强:“标签文本后面'/ spl sim /'作为'作者'。作者和Naive科目的性能数据呈现出一系列网页,显示这种形式的高度交互指令的潜力。我们的结果表明,通过示例技术非常简单地编程可以在高重复域中生成有效的解析规则。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号