Characteristic Sets of Strings Common to Semi-structured Documents

机译：半结构化文档共同的字符串特征

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x_1, ..., x_d) of strings such that each x_i is a suffix of x_i+1 and all x_i's appear in a document without overlaps. A characteristic set matches semi-structured documetns with primitives or user's defined macros. For example, ("set", "characteristic set", " characteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, sovles Maximum Agreement Problem in O(n~2h~d) time, where n is the total length of documents and h is the height of the suffix tree of the documents.

机译：我们考虑最大的协议问题，赋予正面和负面文件，找到一个与许多积极文件相匹配但拒绝许多消极文件的特征集。特征集是字符串的序列（x_1，...，x_d），使得每个x_i是x_i + 1的后缀，并且所有x_i都出现在文档中而没有重叠。一个特征集匹配具有基元或用户定义的宏的半结构化Documetns。例如，（“设置”，“特征集”，“特征集”）是从HTML文件中提取的特征集。但是，一种解决最大协议问题的算法不会输出无用的特性集，例如仅由HTML的标签制成的算法，因为这样的特征集可以匹配大多数正面文档，但也与大多数负面文件匹配。我们介绍了一种算法，给定一个整数D，它是特征集中的字符串的数量，在O（n〜2h〜d）的时间内sovles最大协议问题，其中n是文档的总长度，h是高度文档的后缀树。

著录项

来源
《International conference on discovery science》|1999年||共9页
会议地点
作者
Daisuke Ikeda;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类显示技术;
关键词

相似文献

外文文献
中文文献
专利

1. String-averaging methods for best approximation to common fixed point sets of operators: the finite and infinite cases [J] . Censor Yair, Nisenbaum Ariel Fixexd point theory and applications . 2021,第a期

机译：字符串平均方法，用于最佳近似对运算符的共同固定点组：有限和无限案例
2. Common Sets of Promoter Elements Determine the Expression Characteristics of Three Arabidopsis Genes Encoding Isoforms of Mitochondrial Cytochrome c Oxidase Subunit 6b [J] . Eduardo F. Mufarrege Graciela C. Curi and Daniel H. Gonzalez* Plant and Cell Physiology . 2009,第7期

机译：共同的启动子元素集确定编码线粒体细胞色素c氧化酶亚基6b亚型的三个拟南芥基因的表达特征
3. Common Sets of Promoter Elements Determine the Expression Characteristics of Three Arabidopsis Genes Encoding Isoforms of Mitochondrial Cytochrome c Oxidase Subunit 6b. [J] . Mufarrege Eduardo F., Curi Graciela C., Gonzalez Daniel H. Plant and cell physiology . 2009,第7期

机译：共同的启动子元素集确定了编码线粒体细胞色素c氧化酶亚基6b亚型的三个拟南芥基因的表达特征。
4. Characteristic Sets of Strings Common to Semi-structured Documents [C] . Daisuke Ikeda International conference on discovery science . 1999

机译：半结构化文档共同的字符串特征
5. Content-based filtering for semi-structured documents. [D] . Zhang, Lanbo. 2013

机译：基于内容的半结构化文档过滤。
6. JSONize: A Scalable Machine Learning Pipeline to Model Medical Notes as Semi-structured Documents [O] . Everett N. Rush, Ioana Danciu, George Ostrouchov, 2020

机译：JSONize：可扩展的机器学习管道可将医学笔记建模为半结构化文档
7. Character-level Analysis of Semi-Structured Documents for Set Expansion [O] . Richard C. Wang, William W. Cohen 2010

机译：集扩张半结构化文件的特征层次分析

Characteristic Sets of Strings Common to Semi-structured Documents

摘要

著录项

相似文献

相关主题

期刊订阅