Pattern-matching based document compression systems rely on finding a small set of patterns that can be used to represent the whole document. When analyzing and comparing this kind of system two factors have to be considered: the compression rate attained and the speed and associated complexity of the codebook building. In order to reduce the computational burden of the pattern matching operation while keeping a good compression ratio, we propose a new fast algorithm to carry out a WAN (weighted AND-NOT) matching process. Thus, codebook building is performed in two stages: the first step is based on FWAN (fast WAN) with a loose threshold; in the second one a more accurate but slower method (CTM, EPM) is applied over the initial approximate codebook. This screening greatly reduces the search space for the clustering procedure implicit in obtaining the library without altering the compression ratio. Experimental results show a very good speed performance for this new algorithm: at least three times faster than the usual WAN.
展开▼