TDT4287: Algorithms for Bioinformatics
Disclaimer: I haven't really attended any lectures, and I'm writing this as I go. Also, all code is pseudocode.
The following chapters in the book (Jones & Pevzner) are in the curriculum:
- 4.4-4.9 (Exhaustive Search)
- 5.1-5.5 (Greedy Algorithms)
- 6.4-6.14 (Dynamic Programming Algorithms)
- 7.2-7.4 (Divide-and-Conquer Algorithms)
- 8.1-8.9 (Graph Algorithms)
- 9.3-9.8 (Combinatorial Pattern Matching)
- 12.2-12.3 (Randomized Algorithms)
(or, at least these are the chapters listed under lecture plan on It's Learning)
We would like to infer a common sequence from a set
window iterate over sublist
seqs = set() for s in S: for seq in s.window(k): seqs[seq] += 1 return max(seqs)
However, sometimes the common sequence may have undergone simple mutations, which renders our algorithm useless. We could somehow align the sequences in
The motif finding problem is exactly this.
The main problem is to find the alignment of the sequences.
We would want to define a scoring function, which ranks the alignment;
for instance we could sum the frequencies of the most popular elements (this is what the book does), making the best possible score
Median String Problem
A related problem is the Median String Problem.
We would like to find a sequence that minimizes the Hamming distance between itself and all sequences in
Both of these algorithms are very slow; they are in fact exponential.
However, there is a simple trick we can use to leverage this: branch and bound.
Assume we are in the middle of the search, and that we keep track of the best candidate so far.
If now, we are in the middle of ranking a sequence
For instance, say we are doing the Median String problem, and start with
Note that branch and bound doesn't improve the Big-O complexity of the algorithm, but it does reduce running time significantly.
There are three types of alignment: global, local and multiple.
Global alignment tries to align entire strings, and local alignments tries to align substrings, in order to get a better alignment.
Both global and local uses a scoring matrix (
We usually use DP for ths. The recurrence relation for each entry is
that is, either skip one of
As there are
For the naïve approach, we can use the recurrence from Global Alignsment, and add a new case:
Multiple Alignment takes multiple strings and tries to align them best. The problem is in NPC.
The Edit Distance problem finds the minimum number of "edits" to transform one string
Longest Common Subsequence
Given two strings