TDT4287: Dynamic Programming Algorithms

Tags:

Disclaimer: I haven't really attended any lectures, and I'm writing this as I go. Also, all code is pseudocode.

Curriculum (2016)

The following chapters in the book (Jones & Pevzner) are in the curriculum:

4.4-4.9 (Exhaustive Search)
5.1-5.5 (Greedy Algorithms)
6.4-6.14 (Dynamic Programming Algorithms)
7.2-7.4 (Divide-and-Conquer Algorithms)
8.1-8.9 (Graph Algorithms)
9.3-9.8 (Combinatorial Pattern Matching)
12.2-12.3 (Randomized Algorithms)

(or, at least these are the chapters listed under lecture plan on It's Learning)

Exhausitve Search

We would like to infer a common sequence from a set $\mathcal{S}$ of DNA sequences. If we want exact matching and know the size of the common sequence, this problem is rather easy (window iterate over sublist 0,1,..,k, 1,2,..,k+1, ...):

seqs = set()
for s in S:
    for seq in s.window(k):
        seqs[seq] += 1
return max(seqs)

However, sometimes the common sequence may have undergone simple mutations, which renders our algorithm useless. We could somehow align the sequences in $\mathcal{S}$, count the elements downwards, and select the most frequent (see p.86 in the book).

Motif Finding

The motif finding problem is exactly this. The main problem is to find the alignment of the sequences. We would want to define a scoring function, which ranks the alignment; for instance we could sum the frequencies of the most popular elements (this is what the book does), making the best possible score $k|\mathcal{S}|$.

Median String Problem

A related problem is the Median String Problem. We would like to find a sequence that minimizes the Hamming distance between itself and all sequences in $\mathcal{S}$, starting from some offset in each sequence.

Running Time?

Both of these algorithms are very slow; they are in fact exponential. However, there is a simple trick we can use to leverage this: branch and bound. Assume we are in the middle of the search, and that we keep track of the best candidate so far. If now, we are in the middle of ranking a sequence $s_1..s_i..s_n$; if we get to $s_i$, and the match is so bad that we cannot possibly beat our current best, we might as well stop here, and continue with $s_1..s_1+1..$.

For instance, say we are doing the Median String problem, and start with $AAAAAA$. Then, after some iteration, we have a minimum score of 2. After yet more iteration, we are matching $CCA---$ and get a minimum score of 3 by just checking the first three elements. Now there is no way we will ever get less than 3 matches (because that does not make any sense!). Hence, we can skip all sequences starting with $CCA$, saving $4^3=64$ iterations.

Note that branch and bound doesn't improve the Big-O complexity of the algorithm, but it does reduce running time significantly.

Some Algorithms

Alignment

There are three types of alignment: global, local and multiple. Global alignment tries to align entire strings, and local alignments tries to align substrings, in order to get a better alignment. Both global and local uses a scoring matrix ($\delta$), which defined the scores for matches, and the penalties for mismatches and skips.

Global Alignment

We usually use DP for ths. The recurrence relation for each entry is

$$ s_{i,j} = \max \begin{cases} s_{i-1,j} + \delta(v_i, -)\\ s_{i,j-1} + \delta(-, w_i)\\ s_{i-1,j-1} + \delta(v_i, w_i)\\ \end{cases} $$

that is, either skip one of $v_i$ and $w_i$ and pay the price, or choose both and add the scoring.

As there are $mn$ entries in the matrix, and each one needs to be calculated, the algothims takes $O(mn)$ time and space.

Local Alignment

For the naïve approach, we can use the recurrence from Global Alignsment, and add a new case: $$ s_{i,j} = \max \begin{cases} 0\\ s_{i-1,j} + \delta(v_i, -)\\ s_{i,j-1} + \delta(-, w_i)\\ s_{i-1,j-1} + \delta(v_i, w_i)\\ \end{cases} $$ That is, either start the alignment here, or continue the previous alignment as usual. Naturally, this does not affect the complexity.

Multiple Alignment

Multiple Alignment takes multiple strings and tries to align them best. The problem is in NPC.

Edit Distance

The Edit Distance problem finds the minimum number of "edits" to transform one string $s_1$ to another string $s_2$. One edit is usually a deletion, an insertion, or a replacement. Solvable using DP, similar to Alignment.

Longest Common Subsequence

Given two strings $v, w$, find a string $s$ such that each element in $s$ are in both $v$ and $w$ in order, and that the length of $s$ is maximized. Also solvable using DP.

Written by

kennedy martinhath

Last updated: Sun, 3 Dec 2017 05:35:22 +0100 .