Tables in HTML Web pages have become precious knowledge sources. Therefore it is reasonable and necessary to develop an algorithm to extract knowledge from them. For this, we need a system to identify the boundary between attributes and values of a table in HTML. In this paper, we propose an algorithm for this purpose. The outline of the algorithm is that if we find a row(or column) having low similarity with other rows (or columns), it is probably an attribute name row (or column), otherwise value data rows(or columns). The algorithm based on this idea results in 82% accuracy of recognition of lengthways and 78% accuracy of recognition of sideways for 300 tables in HTML of Web pages downloaded from the Web.
展开▼