The tremendous expanse of search engines, dictionary and thesaurus storage,and other text mining applications, combined with the popularity of readilyavailable scanning devices and optical character recognition tools, hasnecessitated efficient storage, retrieval and management of massive textdatabases for various modern applications. For such applications, we propose anovel data structure, INSTRUCT, for efficient storage and management ofsequence databases. Our structure uses bit vectors for reusing the storagespace for common triplets, and hence, has a very low memory requirement.INSTRUCT efficiently handles prefix and suffix search queries in addition tothe exact string search operation by iteratively checking the presence oftriplets. We also propose an extension of the structure to handle substringsearch efficiently, albeit with an increase in the space requirements. Thisextension is important in the context of trie-based solutions which are unableto handle such queries efficiently. We perform several experiments portrayingthat INSTRUCT outperforms the existing structures by nearly a factor of two interms of space requirements, while the query times are better. The ability tohandle insertion and deletion of strings in addition to supporting all kinds ofqueries including exact search, prefix/suffix search and substring search makesINSTRUCT a complete data structure.
展开▼