Record-Boundary Discovery in Web Documents

机译：Web文档中的录制边界发现

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Extraction of information form unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

机译：信息表单的提取非结构化或半系统的Web文件通常需要记录的识别和界定。（通过“记录”，我们的意思是一组与某些实体相关的信息。）如果没有根据记录边界包含多个记录的第一块块文档，则提取记录信息的提取将不太可能成功。在本文中，我们描述了一种在Web文档中发现录制边界的启发式方法。在我们的方法中，我们将文档的结构捕获为嵌套HTML标签的树，找到包含感兴趣记录的子树，使用五个独立启发式识别子树内的候选分隔符标记，并基于组合选择共识分离器标签启发式。我们的方法很快（在更大的数据提取问题的背景下线性运行），准确（我们进行的实验中100％）。

著录项

来源
《ACM SIGMOD International Conference on Management of Data》|1999年||共12页
会议地点
作者
D. W. Embley; Y. Jiang; Y.-K. Ng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-532;
关键词

相似文献

外文文献
中文文献
专利

1. An architecture for non-linear discovery of aggregated multimedia document web search results [J] . Abdur Rehman Khan, Umer Rashid, Khalid Saleem, PeerJ Computer Science . 2021,第a期

机译：聚合多媒体文档Web搜索结果的非线性发现的架构
2. Efficient Semantic Web Document Discovery Scheme Based on Hybrid Similarity [J] . Yuanfa Hu, rnYuesheng Gu Journal of information and computational science . 2010,第2期

机译：基于混合相似度的高效语义Web文档发现方案
3. A robot-based resource discovery tool for adding chemical meta-informatoin and value to web-based documents [J] . Georgios V.Gkoutos, Philip R.Kenway, Henry S.Rzepa New Journal of Chemistry . 2001,第4期

机译：基于机器人的资源发现工具，用于向基于Web的文档中添加化学元信息和价值
4. Record-boundary discovery in Web documents [C] . D. W. Embley, Y. Jiang, Y.-K. Ng, ACM SIGMOD international conference on Management of data . 1999

机译：Web文档中的记录边界发现
5. From document clues to descriptive metadata: Document characteristics used by graduate students in judging the usefulness of Web documents. [D] . Lan, Wen-Chin. 2002

机译：从文档线索到描述性元数据：研究生在判断Web文档有用性时使用的文档特征。
6. Desktop document delivery using portable document format (PDF) files and the Web. [O] . J P Shipman, W L Gembala, J M Reeder, 1998

机译：使用可移植文档格式（PDF）文件和Web进行桌面文档传递。
7. Record-boundary discovery in Web documents [O] . D. W. Embley, Y. S. Jiang, Y. -k. Ng Y 1999

机译：Web文档中的记录边界发现

Record-Boundary Discovery in Web Documents

摘要

著录项

相似文献

相关主题

期刊订阅