The Shortest Superstring Problem (SSP) consists, for a set of strings S = {s_1, ? ? ? , s_n} (with no s_i substring of s_j), to find a minimum length string that contains all s_i, 1 ≤ i ≤ n, as substrings. This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being 2 (11)/(30), which has been achieved through a long and difficult process. SSP is highly used in practice on Next Generation Sequencing (NGS) data, which plays an increasingly important role in modern biological and medical research. In this note, we show that on NGS data the SSP approximation ratio reached by the classical algorithm of Blum et al. [2], is usually below 2 (11)/(30), while assuming specific characteristics of the data that are experimentally verified on a large sampling set. Moreover, we present an efficient linear time test for any input of strings of equal length, which allows to compute the approximation ratio that can be reached using the classical algorithm in [2].
展开▼