We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x_1, ..., x_d) of strings such that each x_i is a suffix of x_i+1 and all x_i's appear in a document without overlaps. A characteristic set matches semi-structured documetns with primitives or user's defined macros. For example, ("set", "characteristic set", " characteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, sovles Maximum Agreement Problem in O(n~2h~d) time, where n is the total length of documents and h is the height of the suffix tree of the documents.
展开▼