Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things

J. A. M. Rexie; Kumudha Raimond; Mythily MurugaaboopathyD. BrindhaHenock Mulugeta

摘要

An area of medical science, that is, gaining prominence, is DNA sequencing. Genetic mutations responsible for the disease have been detected using DNA sequencing. The research is focusing on pattern identification methodologies for dealing with DNA-sequencing problems relating to various applications. A few examples of such problems are alignment and assembly of short reads from next generation sequencing (NGS), comparing DNA sequences, and determining the frequency of a pattern in a sequence. The approximate matching of DNA sequences is also well suited for many applications equivalent to the exact matching of the sequence since the DNA sequences are often subject to mutation. Consequently, recognizing pattern similarity becomes necessary. Furthermore, it can also be used in virtually every application that calls for pattern matching, for example, spell-checking, spam filtering, and search engines. According to the traditional approach, finding a similar pattern in the case where the sequence length is ls and the pattern length is lp occurs in O (ls * lp). This heavy processing is caused by comparing every character of the sequence repeatedly with the pattern. The research intended to reduce the time complexity of the pattern matching by introducing an approach named “optimized pattern similarity identification” (OPSI). This methodology constructs a table, entitled “shift beyond for avoiding redundant comparison” (SBARC), to bypass the characters in the texts that are already compared with the pattern. The table pertains to the information about the character distance to be skipped in the matching. OPSI discovers at most spots of similar patterns occur in the sequence (by ignoring e mismatches). The experiment resulted in the time complexity identified as O (ls. '). In comparison to the size of the pattern, the allowed number of mismatches will be much smaller. Aspects such as scalability, generalizability, and performance of the OPSI algorithm are discussed. In comparison with the hamming distance-based approximate pattern matching algorithm, the proposed algorithm is found to be 69 more efficient.

机译：医学科学的一个领域，即越来越突出的领域，是DNA测序。使用DNA测序已经检测到导致该疾病的基因突变。该研究的重点是用于处理与各种应用相关的DNA测序问题的模式识别方法。此类问题的几个例子包括来自下一代测序（NGS）的短读长的比对和组装、比较 DNA 序列以及确定序列中模式的频率。由于DNA序列经常发生突变，因此DNA序列的近似匹配也非常适合许多等同于序列精确匹配的应用。因此，识别模式相似性变得必要。此外，它还可用于几乎所有需要模式匹配的应用程序，例如拼写检查、垃圾邮件过滤和搜索引擎。根据传统方法，在序列长度为ls且模式长度为lp的情况下，在O（ls*lp）中发现相似的模式。这种繁重的处理是由于将序列的每个字符与模式重复比较而引起的。该研究旨在通过引入一种名为“优化模式相似性识别”（OPSI）的方法来降低模式匹配的时间复杂度。该方法构建了一个名为“为避免冗余比较而超越”（SBARC）的表格，以绕过文本中已经与模式进行比较的字符。该表与有关匹配中要跳过的字符距离的信息有关。OPSI发现序列中出现的大多数相似模式（通过忽略e不匹配）。实验结果显示，时间复杂度为O（ls.').与模式的大小相比，允许的不匹配数量要小得多。讨论了OPSI算法的可扩展性、泛化性和性能等方面。与基于汉明距离的近似模式匹配算法相比，所提算法的效率提高了69%。

Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things

摘要

著录项

引文网络

相关主题

期刊订阅