# Course:CPSC445/2011W2

Algorithms in Informatics | |
---|---|

CPSC 445 | |

Section: | W2 |

Instructor: | Holger Hoos |

Joel Ferstay | |

Email: | hoos@cs.ubc.ca |

Office: | |

Office Hours: | |

Class Schedule: | Tue+Thu, 9:30-11:00 |

Classroom: | DMP 301 |

Important Course Pages | |

Syllabus | |

Lecture Notes | |

Assignments | |

Course Discussion | |

Welcome to the home of TWiki.CS445. This area is for use by students of CS 445 / Algorithms in Bioinformatics.

Lecture notes can be added to here.

## Contents

## Office Hours

Holger: Wed, 9:00-10:00 (ICCS X541)

Additional office hours:

- Holger: Thu, 04/12, 9:00-10:00 (ICCS X541)
- Holger: Thu, 04/12, 14:00-14:45 (ICCS X541)

Joel's additional office hours:

Additional office hours:

- Joel: Wednesday, 04/11, 12:40-13:40 (ICCS X468)
- Joel: Wednesday, 04/11, 13:40-14:40 (ICCS X468)
- Joel: Friday, 04/13, 09:50-10:50 (ICCS X468)

## Course Content

Note - The course wiki is for students to post questions, comments, etc. - the instructional team will try to respond to them here. Some lecture notes are also posted here.

### Detailed Course Outline

Introduction to biological sequence analysis - The KMP Algorithm Pairwise Sequence Alignment - The Global Pairwise Sequence Alignment Problem - Finding optimal PSA (using linear gap scores; Needleman-Wunsch algorithm) - Local pairwise sequence alignment (Smith-Waterman algorithm) - PSA with affine gap scores - Other sequence alignment problems - Heuristic alignment (BLAST) - Scoring in pairwise sequence alignemnt Multiple Sequence Alignment - Multiple sequence alignment (MSA) - introduction and problem statement - Scoring multiple sequence alignments - How to find optimal global multiple sequence alignments - Progressive multiple sequence alignment - The Feng-Doolittle algorithm - Sequence profile alignment - The ClustalW algorithm - Iterative refinement methods - Other MSA approaches Phylogenetic Tree Construction - Introduction & problem statement - Distance-based algorithms - How do we assess quality of a phylogenetic tree obtained by a distance-based method? - Parsimony - The small parsimony problem (Fitch's algorithm) - Approaches for solving the large parsimony problem - Simultaneous Alignment and Phylogenetic Tree Construction - Maximum Likelihood Approach for Phylogenetic Tree Construction Motif Finding & Discovery - Introduction - Motif finding (maximum likelihood approach) - The Motif Discovery Problem - The Relative Entropy Site Selection Problem - Motif discovery: consensus pattern formulation RNA and Protein Structure Prediction - Introduction - RNA secondary structure - Comparative sequence analysis - Energy-based RNA secondary structure prediction - The Nussinov structure prediction algorithm - Free energy of RNA secondary structure (Turner model) - The dynamic programming algorithm by Zuker & Stiegler - Other problems related to RNA secondary structure - Protein Structure and Function - Forces that determine the native state of proteins - Computational problems related to protein structure - The Protein folding problem - Protein folding approaches - An example of a simplified model for protein structure prediction - the 2D/3D HP models - ROSETTA

### RNA Secondary Structure Prediction

- Nussinov Algorithm - you may find these links useful for understanding this algorithm better. http://ultrastudio.org/en/Nussinov_algorithm

### Motif Discovery Notes

- Found this online, seems like it's exact notes we took from lecture. Thought I'd share this here :) enjoy! -A. http://www.cs.washington.edu/education/courses/cse527/00wi/lectures/lect10.pdf

Thanks for posting, these are indeed good resources. Below, my own notes for part 1 of the motiv finding & discovery module. Note that weight matrix scores are sometimes called WMM scores (relevant for the assignment). - H.

CPSC 445/545 - Motif Finding & Discovery (1) [partly based on earlier lecture notes from Anne Condon, CPSC536A and on lecture notes from UW, CS527] --- introduction motif: pattern common to a set of short sequences, or sites, where the sites share some signal associated with their common biological property. examples: binding site for a regulatory protein, transcription start and stop sites, ribosome binding sites in prokaryotes, or intron splice sites [slide: For example, the cyclic AMP receptor protein (CRP) is a transcription factor in E. Coli. Its binding sites are DNA sequences of length approximately 22. The following table, taken from Stormo and Hartzell (1989), shows just positions 3-9 (out of the 22 sequence positions) in 23 bonafide CRP binding sites. TTGTGGC TTTTGAT AAGTGTC ATTTGCA CTGTGAG ATGCAAA GTGTTAA ATTTGAA TTGTGAT ATTTATT ACGTGAT ATGTGAG TTGTGAG CTGTAAC CTGTGAA TTGTGAC GCCTGAC TTGTGAT TTGTGAT GTGTGAA CTGTGAC ATGAGAC TTGTGAG Notice that in the second column T predominates and in the third column G predominates, for example. ] motifs can be summarised in a sequence profile = weight array [slide: weight array for example from above: A 0.35 0.043 0 0.043 0.13 0.83 0.26 C 0.17 0.087 0.043 0.043 0 0.043 0.3 G 0.13 0 0.78 0 0.83 0.043 0.17 T 0.35 0.87 0.17 0.91 0.043 0.087 0.26 ] Simplifying assumptions: - consider only motifs with same length - do not allow gaps - consider DNA sequences two problems: 1. Motif finding: finding instances of a known motif in a given sequence 2. Motif discovery: discovery of new motifs in given unaligned sequences --- Motif finding. problem 1a: Given a sequence x of the same length l, decide whether it is likely to be an instance of the motif. problem 1b: given a long sequence x, find an instance of the motif in x. solution to problem 1a can be used directly for solving problem 1b (sliding window) --- Maximum likelihood approach to problem 1a Recall our interpretation of (dynamic programming) ungapped alignment scores. An alignment score (with respect to a properly designed scoring matrix) is the log odds score, with respect to two models for generating a pair of ungapped, aligned sequences: the model that generates evolutionarily related sequences, and a random model that generates each sequences randomly, according to a background distribution. For problem 1a, we follow an analogous approach. We construct two models which generate sequences: M: a model that generates instances of the motif B: a model that generates "background" sequences Think of a model as simply a probability distribution over all sequences of length l. We then compute the log odds score (logs are to base 2): log ( L[M|x] / L[B|x] ) The numerator, L[M|x] is the likelihood that x was generated by the model M, and the denominator is the likelihood that x was generated by B. The higher the log odds score, the more likely the sequence x was generated by the model, i.e. is an instance of the motif. To make a decision, we can compare the log odds score to a threshold. How to define the model M? We will use the given instances of the motif for this purpose. A natural idea is, at each position of the motif, generate a base according to the frequency with which it appears in that position in the given instances of the motif. The frequency is the number times that the base appears (at a fixed position), divided by the total number of instances. Note that with this approach, the base at each position is chosen independently of the base at any other position. This method of defining M is justified since the resulting model is the one that best explains the data we are given, i.e. the given instances of the motif. (That is, M is the most likely estimator for the data.) Example 1, continued: The frequencies of bases at each position of the motif of Example 1 are given by the following frequency matrix, assuming the background model is uniform: pos 1 2 3 4 5 6 7 base A 0.35 0.04 0 0.04 0.13 0.84 0.26 C 0.17 0.09 0.04 0.04 0 0.04 0.3 G 0.13 0 0.79 0 0.83 0.04 0.18 T 0.35 0.87 0.17 0.92 0.04 0.08 0.26 Example 3, continued: The following are derived from 100 samples, so divide each number by 100 (TODO): pos 1 2 3 4 5 6 base A 2 95 26 59 51 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 How to define the background model B? A simple way would be to generate each base uniformly at random. If the sequences you are working with have high nucleotide bias, such as high GC content, you could account for this by using instead the background frequency of each base in a (large) sample of sequence data of interest. Let f_{b,i} denote the frequency of base b at position i of the frequency matrix representation of M, and f_b denote the probability of base b according to model B. Then the log odds score for sequence x = x_1 ... x_l is log ( (\prod_i f_{x_i,i}) / (\prod_i f_{x_i}) ) = sum_i log ( f_{x_i,i}/f_{x_i} ) In order to be able to calculate the log odds score quickly for any given sequence x, we use a weight matrix to store the scores log (f_{b,i}/f_b ) for each base b and position i. These values are typically rounded to the nearest integer. Example: the weight matrix for Example 1, assuming the background model is the uniform model, is: A 0.48 -2.5 -infty -2.5 -0.94 1.7 0.061 C -0.52 -1.5 -2.5 -2.6 -infty -2.5 0.28 G -0.94 -infty 1.6 -infty 1.7 -2.5 -0.52 T 0.48 1.8 -0.52 1.9 -2.5 -1.5 0.-61 The -infty entries are awkward, particularly since they may be due to insufficient data. To avoid, this, various sample correction formulas may be applied, such as replacing the f_{b,i} by a small positive number, and adjusting the other values in column i accordingly, before calculating the weight matrix. The higher the log odds score is calculated, the more likely the sequence was generated according to the model M, rather than according to R, and thus the more likely it is an instance of the motif. The score is 0 if it is equally likely to have been generated by M as by R and is negative if it is more likely to have been generated by R. To test the hypothesis that x is indeed an instance of the motif, compare the log odds score with some threshold. If a long sequence is being searched for an instance of a motif, you have to set the threshold to account for the length of the sequence, e.g. proportional to the log of the sequence length. This is similar to our handling of scores for alignments, when searching for similar sequences in a large database. --- Significance of scores It is good to have a way to measure how different a motif is from the background sequence distribution. This will be useful, for example, when we formulate the motif discovery problem (problem 2 above), since we will aim to find motifs that are as different as possible from the background distribution. We will use relative entropy for this purpose. Recall that we view our motif model M and background model B as probability distributions over sequences of length l. Let S (the sample space) be the set of all sequences of length l (DNA sequences, in our examples). Thus, M(s) is the probability that sequence s is generated by model M, and B(s) is the probability that sequence s is generated by model B. The relative entropy, or information content, of probability distribution M with respect to probability distribution B over sample space S is defined as: D_2(M || B) = sum_{s in S} M(s) log_2 (M(s)/B(s)). By convention, we define M(s) log_2 ( M(s)/B(s) ) = 0 if M(s) = 0 (in keeping with the fact that lim_{x --> 0} x log x = 0). The relative entropy is undefined if B(s) = 0 for some s. Since we are taking logs to the base 2, relative entropy is measured in bits. (We could use other bases.) In what follows, we drop the subscript 2 and write D rather than D_2. Note that the expected value of a function f(s) with respect to probability distribution P on sample space S is E(f(s)) = sum_{s in S} P(s) f(s). Thus, relative entropy is the expected value of the log likelihood ratio, when a sample is picked according to M. Intuitively, the relative entropy measures how different the distributions M and B are. When M and B are the same, the relative entropy is 0. The larger the relative entropy, the easier it should be to distinguish between instances of a motif and the background distribution. So, we can use relative entropy as our measure of how informative the log likelihood ratio test is. In our case, the sample space is all sequences of length l, and we are assuming that the base at each position is generated independently. Therefore, D(M || B) = sum_{j=1}^{l} D(M_j || B_j), where M_j and B_j are the distributions at the jth position according to the motif and background models. (Typically, B_j is the same for all j.) We call the quantities D(M_j || B_j) the positional relative entropies. Example: -------- Suppose the list of motif instances is: ATG ATG ATG ATG ATG GTG GTG TTG The frequency matrix is: 1 2 3 A 0.625 0 0 C 0 0 0 G 0.25 0 1 T 0.125 1 0 The weight (log likelihood) matrix is: A 1.32 -infty -infty C -infty -infty -infty G 0 -infty 2 T -1 2 -infty (For example, entry in row G, column 3 is log_2 (f_{G,3}/f_G) = log_2( 1/.25) = log_2 4 = 2.) The positional relative entropies of the model, with respect to the uniform background distribution, are: 0.72 2 2 (Go through the calculations.) The fact that at position 2, the motif instances always contain a "T" means that the motif is more different at position 2 from the uniform distribution than at position 1. This is consistent with the fact that, at position 2, the relative entropy is higher than at position 1. In fact, if the background distribution is the uniform distribution, the positional relative entropy at any position can be at most 2. Can you suggest how to change the background distribution, so that the relative entropy is greater than 2? If we changed the background distribution B so that f_A = f_T = .375, f_C = f_G = 0.125, then the weight matrix and positional relative entropies are: A 0.737 -infty -infty C -infty -infty -infty G 1 -infty 2 T -1.58 2 -infty 0.512 1.42 3 Finally, the positional relative entropies for the CRP binding site example of Example 1 are: 0.12 1.3 1.1 1.5 1.2 1.1 0.027 Note that 1.5 (middle position) is the highest relative entropy and corresponds to the most biased column (see Table 7.2). The value 0.027 (last position) is the lowest relative entropy because the distribution in this last position is the closest to the uniform background distribution (see Table 7.2). To conclude: if the relative entropy is low, it's hard to distinguish an instance of a motif from the background distribution using log odds scores. We will continue with a relative entropy formulation of problem 2 next time. --- Nonnegativity of Relative Entropy You may have noticed that the relative entropy has always been nonnegative. It is by no means obvious that this should be, since it is the expected value of the log likelihood ratio, which can take negative values. However, it turns out to always be the case. Theorem 8.8: For any probability distributions P and Q over a sample sapce S, D(P||Q) >= 0, with equality if and only if P and Q are identical. The proof (omitted) follows from the fact that ln x <= x-1 for all real numbers x, with equality if and only if x = 1. --- The Motif Discovery Problem goal: identify and characterise unknown, shared motifs (short, imperfectly conserved subsequences) in a set of unaligned sequences. -- How to quantify the "amount of information" in a motif - could use information content [def: ...] - better: relative entropy of set of sequences (allows to take into account background distribution) assumption: each position is independent Given: - background distribution on nucleotide frequencies Q(N) = prob of N, N \in {A,C,G,T} Q(N) is position independent - profile matrix = weight array P_j(N) = prob of N at pos j, N \in {A,C,G,T} Note: P_j(N) is a probability distribution over bases Relative entropy of P_j w.r.t. Q is defined as D(P_j||Q) := sum_{N in {A,C,G,T}} P_j(N) log_2 (P_j(N)/Q(N)). Relative entropy of motif given by profile matrix [P_j(N)]_{N,j}, length L: D(P||Q) := sum_{j=1}^{L} D(P_j||Q) For the example above, the positional relative entropy values, measured in base 2, are 0.12 1.3 1.1 1.5 1.2 1.1 0.027 total relative entropy = 6.347 (bits) Note: Relative entropy is a measure for the difference between two distributions; it is also called "Kulback Leibler distance" (but: it is not a proper distance metric, since it doesn't satisfy the triangle inequality) Note: - D(P||Q) >= 0 - P = Q => D(P||Q) = 0 - D(P||Q) <= 2*L (when does this happen?) --- Resources: Finding instances of unknown sites, Lecture 10, CS527, Winter 2000, CS&E, U. Washington. See http://www.cs.washington.edu/education/courses/527/00wi/ G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999. See http://ural.wustl.edu/publications.html Further Reading: C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: structure, function and genetics, 7:41-51, 1990. G. D. Stormo and G. W. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proc. Nat. Acad. Science, U.S.A., 86:1183-1187, 1989. T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, 1995. http://meme.sdsc.edu/meme/website/papers.html T. Akutsu. Hardness results on gapless local multiple sequence alignment, By Technical report 98-MPS-42-2, Information Processing Society of Japan, 1998. L. P. Lim and C. B. Burge. A computational analysis of sequence features involved in recognition of short introns, PNAS, 98(20):11193-11198, September 25, 2001. See www.pnas.org.