Project: Algorithm Engineering
Current information can be found in the Moodle course: moodle.uni-ulm.de/course/view.php
Problem
The task to be addressed in this project is related to pan-genome read mapping.
A ‘pan-genome’ is a DNA sequence that contains common genetic variations (order of magnitude 106 - 109 bp). A ‘read’ is a short fragment (substring) of the sequence (order of magnitude 100 - 1,000 bp). The problem is finding the position of this read in the sequence. (The mapping problem arises when the read may contain errors and therefore the position where the read fits best is sought.)
Since the pan-genome is very large and there are usually a large number of reads, string-matching algorithms that move the read along the pan-genome in some way are out of the question. Solutions to the problem process the pan-genome in such a way that queries can be completed more quickly. The fastest solutions to the read mapping problem on genomes without variations use techniques related to suffix trees. However, these are not directly transferable to pan-genomes. Another solution that is more easily transferable uses a k-mer index.
A k-mer is a fragment of the sequence with a length of k. (Order of magnitude k = 10 - 20 bp). A k-mer index stores all positions of a k-mer in the pan-genome. With the knowledge of the positions of the k-mers, the positions of the reads can be determined ...