Codes with relaxed weight constraints for DNA-Based storage systems

More Info
expand_more

Abstract

Nowadays, we produce an enormous amount of data and all this data needs to be stored somewhere. In order to store all this data in an efficient and environmentally friendly way, solutions need to be found. A storage technique that has great potential to be a solution is synthetic DNA based storage.
A DNA-strand is made up of a sequence of four nucleotides, Adenine (A), Thymine (T), Guanine (G) and Cytosine (C). To save our data using DNA, our binary data is encoded to quaternary data, using this quaternary data DNA strands are made.
During this process, of saving our data in DNA, errors can occur, therefore two constraints are considered to minimise the number of errors in our codes.
The GC-weight, which is the number of G and C nucleotides in a DNA word and the maximal run-length, which is the maximum number of identical nucleotides in a row. In addition, the Hamming distance between words is also considered to be able to detect and even correct errors that still occur.

This research tries to find lower and upper bounds for the maximum size of DNA codes.
We continue the work of Van Leeuwen [1], Vermeer [2] and Laseur [3], who considered codes with fixed GC-weight. In this thesis we will relax this GC-weight constraint and we will look at codes where we have a set of GC-weights and all the DNA words in the code need to satisfy one of these weights. Furthermore, the codes we consider have a run-length of r=1. We will determine the maximum size of codes where the minimum Hamming distance d is equal to the length n of the DNA words. We determine lower and upper bounds for the case where d=n-1 and for specific sets of GC-weights we will determine the maximum size. Finally we will present nine algorithms, these algorithms create DNA-codes. The size of these codes give us lower bounds for the maximum size of DNA codes.