Constrained Error-Correcting Codes for DNA-Based Storage Systems

More Info
expand_more

Abstract

Humanity is producing data at exponential rates. Solutions need to be found in order to accommodate the bytes of today and the future in an efficient and environmentally friendly way. DeoxyriboNucleic Acid, well-known as DNA, is a suitable destination for this major collection of data, as it is a high-density storage medium with long storage endurance [1]. In the field of research about DNA-Based Data Storage Systems, people examine quaternary codes. The considered DNA-codes consist of equi-length DNA-words over the Alphabet {0, 1, 2, 3} representing the available DNA building blocks or nucleotides {A, T,C,G} standing for Adenine, Thymine, Cytosine and Guanine respectively. Combinatorial constrained coding is introduced to avoid DNA patterns prone to sequencing errors, that might occur when archival data is read from DNA-blocks [2]. If an error occurs anyway, an imposed minimum Hamming distance d that a DNA-code satisfies, enforces error correction and detection capabilities [3]. Methods are considered to optimize the number of DNA-words that can be stored in a DNA-code satisfying these restrictions. This research explores DNA-codes of which the DNA-words adhere to a fixed GC content referred to as the GC-weight w, as well as the no-runlength constraint r = 1 that will not allow adjacent symbol repetition. By increasing distance d, we build on the findings of Van Leeuwen [4] and Vermeer [5], who created DNA-codes satisfying d = 2 for r = 1 and r ≥ 1 respectively. The maximum size of a DNA-code satisfying d = 3 among other constraints is proven to be 12. Furthermore, we cover the case d = n where the length n of the DNA-words and the minimum distance d the DNA-code satisfies are equal. We derive maximum sizes of these DNA-codes for all possible GC-weights w. Finally, we present algorithms that create DNA-d codes satisfying d = 3 and d = 4 and succeed in improving sizes of the DNA-codes created by the algorithms designed by Limbachiya et. al. [6] and Van Leeuwen [4].