首页> 外文会议>International conference on discovery science >Characteristic Sets of Strings Common to Semi-structured Documents
【24h】

Characteristic Sets of Strings Common to Semi-structured Documents

机译:半结构化文档共同的字符串特征

获取原文

摘要

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x_1, ..., x_d) of strings such that each x_i is a suffix of x_i+1 and all x_i's appear in a document without overlaps. A characteristic set matches semi-structured documetns with primitives or user's defined macros. For example, ("set", "characteristic set", " characteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, sovles Maximum Agreement Problem in O(n~2h~d) time, where n is the total length of documents and h is the height of the suffix tree of the documents.
机译:我们考虑最大的协议问题,赋予正面和负面文件,找到一个与许多积极文件相匹配但拒绝许多消极文件的特征集。特征集是字符串的序列(x_1,...,x_d),使得每个x_i是x_i + 1的后缀,并且所有x_i都出现在文档中而没有重叠。一个特征集匹配具有基元或用户定义的宏的半结构化Documetns。例如,(“设置”,“特征集”,“特征集”)是从HTML文件中提取的特征集。但是,一种解决最大协议问题的算法不会输出无用的特性集,例如仅由HTML的标签制成的算法,因为这样的特征集可以匹配大多数正面文档,但也与大多数负面文件匹配。我们介绍了一种算法,给定一个整数D,它是特征集中的字符串的数量,在O(n〜2h〜d)的时间内sovles最大协议问题,其中n是文档的总长度,h是高度文档的后缀树。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号