Course:CPSC445/2011W2

From UBC Wiki
Algorithms in Informatics
Wiki.png
CPSC 445
Section: W2
Instructor: Holger Hoos
Joel Ferstay
Email: hoos@cs.ubc.ca
Office:
Office Hours:
Class Schedule: Tue+Thu, 9:30-11:00
Classroom: DMP 301
Important Course Pages
Syllabus
Lecture Notes
Assignments
Course Discussion

Welcome to the home of TWiki.CS445. This area is for use by students of CS 445 / Algorithms in Bioinformatics.

Lecture notes can be added to here.

Office Hours

Holger: Wed, 9:00-10:00 (ICCS X541)

Additional office hours:

  • Holger: Thu, 04/12, 9:00-10:00 (ICCS X541)
  • Holger: Thu, 04/12, 14:00-14:45 (ICCS X541)

Joel's additional office hours:

Additional office hours:

  • Joel: Wednesday, 04/11, 12:40-13:40 (ICCS X468)
  • Joel: Wednesday, 04/11, 13:40-14:40 (ICCS X468)
  • Joel: Friday, 04/13, 09:50-10:50 (ICCS X468)

Course Content

Note - The course wiki is for students to post questions, comments, etc. - the instructional team will try to respond to them here. Some lecture notes are also posted here.


Detailed Course Outline

Introduction to biological sequence analysis
- The KMP Algorithm

Pairwise Sequence Alignment
- The Global Pairwise Sequence Alignment Problem
- Finding optimal PSA (using linear gap scores; Needleman-Wunsch algorithm)
- Local pairwise sequence alignment (Smith-Waterman algorithm)
- PSA with affine gap scores
- Other sequence alignment problems
- Heuristic alignment (BLAST)
- Scoring in pairwise sequence alignemnt

Multiple Sequence Alignment
- Multiple sequence alignment (MSA) - introduction and problem statement
- Scoring multiple sequence alignments
- How to find optimal global multiple sequence alignments
- Progressive multiple sequence alignment
- The Feng-Doolittle algorithm
- Sequence profile alignment
- The ClustalW algorithm 
- Iterative refinement methods
- Other MSA approaches

Phylogenetic Tree Construction 
- Introduction & problem statement
- Distance-based algorithms
- How do we assess quality of a phylogenetic tree obtained by a distance-based method?
- Parsimony
- The small parsimony problem (Fitch's algorithm)
- Approaches for solving the large parsimony problem
- Simultaneous Alignment and Phylogenetic Tree Construction
- Maximum Likelihood Approach for Phylogenetic Tree Construction

Motif Finding & Discovery 
- Introduction
- Motif finding (maximum likelihood approach)
- The Motif Discovery Problem
- The Relative Entropy Site Selection Problem
- Motif discovery: consensus pattern formulation


RNA and Protein Structure Prediction
- Introduction
- RNA secondary structure
- Comparative sequence analysis 
- Energy-based RNA secondary structure prediction
- The Nussinov structure prediction algorithm 
- Free energy of RNA secondary structure (Turner model)
- The dynamic programming algorithm by Zuker & Stiegler
- Other problems related to RNA secondary structure
- Protein Structure and Function
- Forces that determine the native state of proteins
- Computational problems related to protein structure
- The Protein folding problem
- Protein folding approaches
- An example of a simplified model for protein structure prediction - the 2D/3D HP models
- ROSETTA

RNA Secondary Structure Prediction

Motif Discovery Notes

Thanks for posting, these are indeed good resources. Below, my own notes for part 1 of the motiv finding & discovery module. Note that weight matrix scores are sometimes called WMM scores (relevant for the assignment). - H.

CPSC 445/545 - Motif Finding & Discovery (1)

[partly based on earlier lecture notes from Anne Condon, CPSC536A and on lecture notes from UW, CS527]

---
introduction

motif: pattern common to a set of short sequences, or sites, where the sites 
	share some signal associated with their common biological property.

examples: binding site for a regulatory protein, transcription start and stop
	sites, ribosome binding sites in prokaryotes, or intron splice sites

[slide: 
For example, the cyclic AMP receptor protein (CRP) is a transcription
factor in E. Coli. Its binding sites are DNA sequences of length
approximately 22. The following table, taken from Stormo and
Hartzell (1989), shows just positions 3-9 (out of the 22 sequence positions)
in 23 bonafide CRP binding sites.

TTGTGGC
TTTTGAT
AAGTGTC
ATTTGCA
CTGTGAG
ATGCAAA
GTGTTAA
ATTTGAA
TTGTGAT
ATTTATT
ACGTGAT
ATGTGAG
TTGTGAG
CTGTAAC
CTGTGAA
TTGTGAC
GCCTGAC
TTGTGAT
TTGTGAT
GTGTGAA
CTGTGAC
ATGAGAC
TTGTGAG

Notice that in the second column T predominates and in the third
column G predominates, for example.
]

motifs can be summarised in a sequence profile = weight array

[slide: weight array for example from above:

A 0.35 0.043 0     0.043 0.13  0.83  0.26
C 0.17 0.087 0.043 0.043 0     0.043 0.3
G 0.13 0     0.78  0     0.83  0.043 0.17
T 0.35 0.87  0.17  0.91  0.043 0.087 0.26
]

Simplifying assumptions:
- consider only motifs with same length
- do not allow gaps
- consider DNA sequences



two problems:
1. Motif finding: finding instances of a known motif in a given sequence

2. Motif discovery: discovery of new motifs in given unaligned sequences

---
Motif finding.

problem 1a:
   Given a sequence x of the same length l, decide whether it
   is likely to be an instance of the motif.

problem 1b: 
   given a long sequence x, find an instance of the motif in x.

solution to problem 1a can be used directly for solving problem 1b (sliding window)


---
Maximum likelihood approach to problem 1a


Recall our interpretation of (dynamic programming) ungapped alignment
scores.  An alignment score (with respect to a properly designed
scoring matrix) is the log odds score, with respect to two models for
generating a pair of ungapped, aligned sequences: the model that
generates evolutionarily related sequences, and a random model that
generates each sequences randomly, according to a background
distribution.

For problem 1a, we follow an analogous approach.  We construct two
models which generate sequences:

 M: a model that generates instances of the motif
 B: a model that generates "background" sequences

Think of a model as simply a probability distribution over all sequences
of length l.

We then compute the log odds score (logs are to base 2):

	log ( L[M|x] / L[B|x] )

The numerator, L[M|x] is the likelihood that x was generated by the
model M, and the denominator is the likelihood that x was generated by
B.  The higher the log odds score, the more likely the sequence x was
generated by the model, i.e. is an instance of the motif. To make a
decision, we can compare the log odds score to a threshold.

How to define the model M? We will use the given instances of the
motif for this purpose.  A natural idea is, at each position of the
motif, generate a base according to the frequency with which it
appears in that position in the given instances of the motif. The
frequency is the number times that the base appears (at a fixed
position), divided by the total number of instances. Note that with
this approach, the base at each position is chosen independently of
the base at any other position. This method of defining M is justified
since the resulting model is the one that best explains the data we
are given, i.e.  the given instances of the motif. (That is, M is the
most likely estimator for the data.)

Example 1, continued:
The frequencies of bases at each position of the motif of Example 1
are given by the following frequency matrix, assuming the background
model is uniform:

 pos   1     2     3     4     5    6      7
base
A     0.35  0.04  0     0.04  0.13  0.84  0.26
C     0.17  0.09  0.04  0.04  0     0.04  0.3
G     0.13  0     0.79  0     0.83  0.04  0.18
T     0.35  0.87  0.17  0.92  0.04  0.08  0.26

Example 3, continued: 
The following are derived from 100 samples, so divide each number by
100 (TODO):

 pos   1    2     3     4     5    6
base
A      2   95    26    59    51    1
C      9    2	 14    13    20	   3
G     10    1	 16    15    13	   0
T     79    3	 44    13    17	  96

How to define the background model B? A simple way would be to
generate each base uniformly at random. If the sequences you are
working with have high nucleotide bias, such as high GC content, you
could account for this by using instead the background frequency of
each base in a (large) sample of sequence data of interest.

Let f_{b,i} denote the frequency of base b at position i of
the frequency matrix representation of M, and f_b denote the probability of base b
according to model B. Then the log odds score for sequence x = x_1 ... x_l is

log ( (\prod_i f_{x_i,i}) / (\prod_i f_{x_i}) ) = sum_i log ( f_{x_i,i}/f_{x_i} )

In order to be able to calculate the log odds score quickly for any
given sequence x, we use a weight matrix to store the scores log
(f_{b,i}/f_b ) for each base b and position i. These values are
typically rounded to the nearest integer.

Example: the weight matrix for Example 1, assuming the background
model is the uniform model, is:

A   0.48  -2.5    -infty  -2.5    -0.94    1.7   0.061
C  -0.52  -1.5    -2.5    -2.6    -infty  -2.5   0.28
G  -0.94  -infty   1.6    -infty   1.7    -2.5  -0.52
T   0.48  1.8	  -0.52    1.9    -2.5    -1.5   0.-61

The -infty entries are awkward, particularly since they may be due to
insufficient data. To avoid, this, various sample correction formulas
may be applied, such as replacing the f_{b,i} by a small positive
number, and adjusting the other values in column i accordingly, before
calculating the weight matrix.

The higher the log odds score is calculated, the more likely the
sequence was generated according to the model M, rather than according
to R, and thus the more likely it is an instance of the motif. The
score is 0 if it is equally likely to have been generated by M as by R
and is negative if it is more likely to have been generated by R. 

To test the hypothesis that x is indeed an instance of the motif,
compare the log odds score with some threshold.  If a long sequence is
being searched for an instance of a motif, you have to set the
threshold to account for the length of the sequence, e.g. proportional
to the log of the sequence length. This is similar to our handling of
scores for alignments, when searching for similar sequences in a large
database.


---
Significance of scores
 
It is good to have a way to measure how different a motif is from the
background sequence distribution. This will be useful, for example,
when we formulate the motif discovery problem (problem 2 above), since
we will aim to find motifs that are as different as possible from the
background distribution. We will use relative entropy for this
purpose.

Recall that we view our motif model M and background model B as
probability distributions over sequences of length l. Let S (the
sample space) be the set of all sequences of length l (DNA sequences,
in our examples). Thus, M(s) is the probability that sequence s
is generated by model M, and B(s) is the probability that sequence s
is generated by model B.

The relative entropy, or information content, of probability
distribution M with respect to probability distribution B 
over sample space S is defined as:

D_2(M || B) = sum_{s in S} M(s) log_2 (M(s)/B(s)).

By convention, we define M(s) log_2 ( M(s)/B(s) ) = 0 if M(s) = 0 (in
keeping with the fact that lim_{x --> 0} x log x = 0).  The relative
entropy is undefined if B(s) = 0 for some s. Since we are taking logs
to the base 2, relative entropy is measured in bits. (We could use
other bases.)  In what follows, we drop the subscript 2 and write D
rather than D_2.

Note that the expected value of a function f(s) with respect to
probability distribution P on sample space S is

E(f(s)) = sum_{s in S} P(s) f(s).

Thus, relative entropy is the expected value of the log likelihood
ratio, when a sample is picked according to M.  Intuitively, the
relative entropy measures how different the distributions M and B are.
When M and B are the same, the relative entropy is 0.  The larger the
relative entropy, the easier it should be to distinguish between
instances of a motif and the background distribution. So, we can use
relative entropy as our measure of how informative the log likelihood
ratio test is.

In our case, the sample space is all sequences of length l, and we
are assuming that the base at each position is generated independently.
Therefore,

D(M || B) = sum_{j=1}^{l} D(M_j || B_j), where M_j and B_j are the
distributions at the jth position according to the motif and
background models. (Typically, B_j is the same for all j.)  We call
the quantities D(M_j || B_j) the positional relative entropies.


Example: 
--------

Suppose the list of motif instances is:

ATG
ATG
ATG
ATG
ATG
GTG
GTG
TTG

The frequency matrix is:

  1         2       3
A 0.625     0       0
C 0         0       0
G 0.25      0       1
T 0.125     1       0

The weight (log likelihood) matrix is:

A  1.32  -infty   -infty
C -infty -infty   -infty
G  0     -infty     2
T -1        2     -infty

(For example, entry in row G, column 3 is 
  log_2 (f_{G,3}/f_G) = log_2( 1/.25) = log_2 4 = 2.)

The positional relative entropies of the model, with respect to the
uniform background distribution, are:

   0.72      2      2

(Go through the calculations.) The fact that at position 2, the motif
instances always contain a "T" means that the motif is more different
at position 2 from the uniform distribution than at position 1. This
is consistent with the fact that, at position 2, the relative entropy
is higher than at position 1. 

In fact, if the background distribution is the uniform distribution,
the positional relative entropy at any position can be at most 2.
Can you suggest how to change the background distribution, so that
the relative entropy is greater than 2?

If we changed the background distribution B so that f_A = f_T = .375,
f_C = f_G = 0.125, then the weight matrix and positional relative
entropies are:


A  0.737  -infty   -infty
C -infty  -infty   -infty
G  1      -infty     2
T -1.58      2     -infty

   0.512   1.42       3

Finally, the positional relative entropies for the CRP binding site
example of Example 1 are:

0.12   1.3   1.1   1.5   1.2   1.1   0.027

Note that 1.5 (middle position) is the highest relative entropy and
corresponds to the most biased column (see Table 7.2). The value 0.027
(last position) is the lowest relative entropy because the
distribution in this last position is the closest to the uniform
background distribution (see Table 7.2).

To conclude: if the relative entropy is low, it's hard to distinguish
an instance of a motif from the background distribution using log odds
scores. We will continue with a relative entropy formulation of
problem 2 next time.


---
Nonnegativity of Relative Entropy

You may have noticed that the relative entropy has always been
nonnegative. It is by no means obvious that this should be, since it
is the expected value of the log likelihood ratio, which can take
negative values. However, it turns out to always be the case.

Theorem 8.8: For any probability distributions P and Q over a sample
sapce S, D(P||Q) >= 0, with equality if and only if P and Q are
identical.

The proof (omitted) follows from the fact that ln x <= x-1 for
all real numbers x, with equality if and only if x = 1.



---
The Motif Discovery Problem

goal: identify and characterise unknown, shared motifs (short, imperfectly
	conserved subsequences) in a set of unaligned sequences.


--
How to quantify the "amount of information" in a motif

- could use information content [def: ...]
- better: relative entropy of set of sequences
	(allows to take into account background distribution)

assumption: each position is independent

Given: 
- background distribution on nucleotide frequencies 
	Q(N) = prob of N, N \in {A,C,G,T}
	Q(N) is position independent
- profile matrix = weight array
	P_j(N) = prob of N at pos j, N \in {A,C,G,T}

Note: P_j(N) is a probability distribution over bases

Relative entropy of P_j w.r.t. Q is defined as
	D(P_j||Q) := sum_{N in {A,C,G,T}} P_j(N) log_2 (P_j(N)/Q(N)).

Relative entropy of motif given by profile matrix [P_j(N)]_{N,j}, length L:
	D(P||Q) := sum_{j=1}^{L} D(P_j||Q)

For the example above, the positional relative entropy values, measured
in base 2, are

0.12 1.3 1.1 1.5 1.2 1.1 0.027

total relative entropy = 6.347 (bits)


Note: 
Relative entropy is a measure for the difference between two distributions;
it is also called "Kulback Leibler distance" 
(but: it is not a proper distance metric, since it doesn't satisfy the 
triangle inequality)

Note:
- D(P||Q) >= 0
- P = Q => D(P||Q) = 0
- D(P||Q) <= 2*L	(when does this happen?)


---
Resources:

Finding instances of unknown sites, Lecture 10, CS527, Winter
2000, CS&E, U. Washington.  See
http://www.cs.washington.edu/education/courses/527/00wi/

G. Z. Hertz and G. D. Stormo.  Identifying DNA and protein
patterns with statistically significant alignments of multiple
sequences.  Bioinformatics, 15(7/8):563-577, July/August 1999.  See
http://ural.wustl.edu/publications.html


Further Reading:

C. E. Lawrence and A. A. Reilly.  An expectation maximization (EM)
algorithm for the identification and characterization of common sites
in unaligned biopolymer sequences.  Proteins: structure, function and
genetics, 7:41-51, 1990.
 
G. D. Stormo and G. W. Hartzell III.  Identifying protein-binding
sites from unaligned DNA fragments.  Proc. Nat. Acad. Science, U.S.A.,
86:1183-1187, 1989.

T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs
in biopolymers using expectation maximization.  Machine Learning,
21(1-2):51-80, 1995.  
http://meme.sdsc.edu/meme/website/papers.html

T. Akutsu.  Hardness results on gapless local multiple sequence
alignment, By Technical report 98-MPS-42-2, Information Processing
Society of Japan, 1998.

L. P. Lim and C. B. Burge. A computational analysis of sequence
features involved in recognition of short introns, PNAS,
98(20):11193-11198, September 25, 2001. 
See www.pnas.org.

Heading text