Course:CPSC522/Hidden Markov Models

Hidden Markov Models

Hidden Markov model is a statistical model, where the system is modeled follows Markov process with hidden (unobserved) states.^[1]

Principal Author: Jiahong Chen

Collaborators: Dandan wang, Mehrdad Ghomi

Abstract

This page describes HMM as a probabilistic model, introduces the framework of it as well as giving an example to tell how a HMM is established. Also, it introduces three main questions of HMM and gives detailed algorithms to describe how to solve these questions.

Builds on

A hidden Markov model (HMM) is a statistical Markov model, where the system is modeled follows Markov process with hidden (unobserved) states, and builds on Graphical_Models. Also, it can be presented as the simplest dynamic Bayesian network.

Related Pages

As mentioned above, A hidden Markov model (HMM) is a statistical Markov model, which builds on Graphical Models. It is also related to Bayesian network and Markov Decision Process.

Content

Introduction

Hidden Markov Model(HMM) is a probabilistic sequence model used for assigning a label or class to each unit in a sequence, and mapping a sequence of labels to a sequence of observations^[2]. It could be applied to numerous fields like speech recognition, natural language processing, bioinformatics, pattern recognize and so on^[3]. HMM are able to compute the probability of a sequence of unobservable events, which usually described as hidden events with the help of a sequence of observed events^[2].

HMM gets its name from two defining properties. The first one is states of hidden process follow Markov property, that is, given state $S_{t-1}$ , current state $S_{t}$ is independent to states prior to $S_{t-1}$ . The second property is observation $O_{t}$ is only dependent to $S_{t}$ , where $S_{t}$ is unobservable ^[4]^[5]. Figure [1] indicates the belief representation of an HMM^[4]^[5].

Figure 1: HMM belief representation^[1].

Framework of HMM

Definition of HMM

The main procedure of generating a HMM is as time pass by, states will change from one to another and there will be a observation in each new state. In this way, a HMM knows all the states will occur and all the observations it might meet. For example, it has $N$ possible states and $M$ possible observations. So, one of $N$ states will occur each time, and one of $M$ observations will appear each time based on current state. Besides, HMM knows the probability of changing from one state to another and the probability of what observation it might meet in each new state in order to increase the sequence. Thus, it is able to transfer from current state to a new one and making a new observation at this new state. Moreover, given a time period of length $T$ , the HMM will have $T$ states and observations, and at each time, state $i_{t}$ is paired to observation $o_{t}$ .

A Hidden Markov model could be defined as follow: Assuming $Q$ is the set of all possible states while $V$ is a set of all possible observations: $Q=\{q_{1},q_{2},\cdots ,q_{N}\},$ $V=\{v_{1},v_{2},\cdots ,v_{M}\}$ , where N refers to the number of possible states, and M refers to the number of possible observations. Besides, $I$ and $O$ both have the length of $T$ and indicate state sequence and observation sequence respectively: $I=(i_{1},i_{2},\cdots ,i_{T}),O=(o_{1},o_{2},\cdots ,o_{T})$ . In this way, the state transposition probability matrix $A$ could be built as: $A=[a_{ij}]_{N\times N}$ , where $a_{ij}$ means at state $q_{i}$ , there will be a probability $a_{ij}$ for this state to transform to state $q_{j}$ . It is defined as follow:

a_{ij}=P(i_{t+1}=q_{j}|i_{t}=q_{i}),i=1,2,\cdots ,N;j=1,2,\cdots ,N;t\in \{1,2,\cdots ,T-1\}

Moreover, output emission probability matrix $B$ , which indicates the probability to generate observation $v_{k}$ at given state $q_{j}$ , could be established as $B=[b_{j}(k)]_{N\times M}$ , where $b_{j}(k)$ means at time $t$ , given state as $q_{i}$ , there is a probability $b_{j}(k)$ to observe $v_{k}$ :

b_{j}(k)=P(o_{t}=v_{k}|i_{t}=q_{i}),k=1,2,\cdots ,M;j=1,2,\cdots ,N

Furthermore, initial state vector $\pi$ , which intend to provide the initial state for a HMM could be defined as $\pi =(\pi _{i})$ , where $\pi _{i}=P(i_{1}=q_{i}),i=1,2,\cdots ,N$ , and it means the probability for real observation $i$ to be $q_{i}$ at initial time 1.

A Hidden Markov model is decided by initial state vector $\pi$ , state transposition probability matrix $A$ and output emission probability matrix $B$ . Assuming a HMM is represented by $\lambda$ , the model could be expressed as: $\lambda =(A,B,\pi )$ . Initial state vector $\pi$ for state transposition probability matrix $A$ decides hidden Markov chain, and output emission probability matrix $B$ determines how the observations is generated from states.

Generation of observation sequence

An observation sequence $O=\{o_{1},o_{2},\cdots ,c_{T}\}$ could be generated according to the definition of HMM.

Algorithm 1 (generation of observation sequence)

 input:  $\lambda =(A,B,\pi )$  and  $T$ 
 output:  $O=\{o_{1},o_{2},\cdots ,c_{T}\}$ 
 procedure:
 1. Generate initial states  $i_{i}$  according to  $\pi$ .
 2. Let  $t$  = 1.
 3. Generate  $o_{t}$ , according to output emission matrix  $b_{i_{t}}(k)$ 
 4. Generate  $i_{t+1}$  according to state transposition
    probability  ${a_{i_{t},i_{t}+1}}$ 
 5. Let  $t=t+1$ 
 	if  $t<T$ , go to step 3
 	else, stop.

Example 1 Here is an example given to explain how a hidden Markov model is established. Assuming there are 4 opaque boxes, each of them contains 10 white or red balls. Balls assigned in each box are shown in Table 1.

Table 1: Ball arrangement
Box No.	1	2	3	4
Red balls	5	3	6	8
White balls	5	7	4	2

Grab balls to generate a observation sequence as following rules. First, randomly choose one box from all four boxes with same probability, then grab a ball from this opaque box to record its color and put it back. Second, transfer to next box as following rules: if current box is No.1, the next box must be box No.2; if current box is No.2 and No.3, they have a probability of 0.4 to transfer to the box No.1 and No.2 respectively, and a probability of 0.6 to transfer to box No.3 and No.4 respectively; if current box is No.4, it might stay in box No.4 with probability of 0.5, otherwise it will go to box No.3. Third, move to new box according to the status and rules in step 2. Last, repeat step 2 and 3, for $T$ times to generate the observation sequence of ball color with length $T$ . For example, let $T=10$ , we might have a observation sequence as follow:

O=\{red,red,red,white,red,red,white,red,white,white\}

During the whole ball grabbing process, observer will only able to get the color of ball, they cannot know in which box is the ball come from, in an other words, they cannot observe the sequence of box transformation. In another words, box sequence is hidden, while the sequence of balls is observable.

In this way, we could determine possible states $Q$ , possible observations $V$ and initial state vector $\pi$ .

Possible states are:

Q=\{Box~No.1,Box~No.2,Box~No.3,Box~No.4\},~where~N=4

Possible observations are:

V=\{red,~white\},~where~M=4

Initial state vector could be defined as:

\pi =(0.25,~0.25,~0.25,~0.25)^{T}

Furthermore, state transposition probability matrix $A$ (Table 2) and output emission probability matrix $B$ (Table 3) could be defined.

Table 2: State transposition probability matrix: A
A	1	2	3	4
1	0	1	0	0
2	0.4	0	0.6	0
3	0	0.4	0	0.6
4	0	0	0.5	0.5

Table 3: Output emission probability matrix
B	1	2
1	0.5	0.5
2	0.3	0.7
3	0.6	0.4
4	0.8	0.2

In this way a hidden Markov model could be defined as $\lambda =(A,B,\pi )$ .

Three core problem of HMM

There are three fundamental questions of HMM ^[3], which are widely used to solve real world problems:

1. Probability calculation: given $\lambda =(A,B,\pi )$ and observation sequence $O=(o_{1},o_{2},\cdots ,o_{T})$ , calculate the probability $P(O|\lambda )$ of the occurrence of $O$ .

2. Learning: given observation sequence $O=(o_{1},o_{2},\cdots ,o_{T})$ , estimate attributes of $\lambda$ under maximum probability of $P(O|\lambda )$ .

3. Predicting (also named as decoding): given $\lambda =(A,B,\pi )$ and observation sequence $O=(o_{1},o_{2},\cdots ,o_{T})$ , find out state sequence $I=(i_{1},i_{2},\cdots ,i_{T})$ with maximum probability $P(I|\lambda )$ .

Probability calculation

This question is to compute the probability of a particular output sequence by using parameters of the HMM: $\lambda =(A,B,\pi )$ . In general, probability calculation problem could be solved by forward-backward algorithm ^[3]^[6]. Forward-backward algorithm includes forward procedure and backward procedure, both of them are able to compute the probability of output sequence individually.

Forward procedure

Definition: Given Markov model $\lambda$ , the observation sequence to time $t$ as $o_{1},o_{2},\cdots ,o_{t}$ and current state $q_{i}$ , forward probability $\alpha _{t}(i)$ could be defined as:

\alpha _{t}(i)=P(o_{1},o_{2},\cdots ,o_{t},i_{t}=q_{i}|\lambda )

In this way, observation probability $P(O|\lambda )$ for current state could be calculated recursively.

 Algorithm 2 (forward algorithm)
 input:  $\lambda =(A,B,\pi )$ ,  $O$ .
 output:  $P(O|\lambda )$ .
 procedure:
 1. Initialization:
 	 $\alpha _{1}(i)=\pi _{i}b_{i}(o_{1}),~i=1,2,\cdots ,N$ .
 2. Recursion:
 	 $\alpha _{t+1}(i)=[\sum \limits _{j=1}^{N}\alpha _{t}(j)a_{ji}]b_{i}(o_{t+1})$ ,
 	where  $t=1,2,\cdots ,T-1,i=1,2,\cdots ,N$ .
 3. Terminal:
 	 $P(O|\lambda )=\sum \limits _{i=1}^{N}\alpha _{T}(i)$ .

Example 2 Here is an example to present how forward procedure works. Assuming there is a box-ball observing system like Example 1 with HMM $\lambda =(A,B,\pi ),$ $Q=\{1,2,3\},$ $V=\{red,white\},$ $\pi =(0.2,0.4,0.4)^{T}$ . State transposition probability matrix and output emission probability matrix are shown in Table 4 and Table 5 respectively. Letting $T=3,O=(red,white,red)$ , and using the forward procedure to solve it.

Table 4: State transposition probability matrix for example 2
A	1	2	3
1	0.5	0.2	0.3
2	0.3	0.5	0.2
3	0.2	0.3	0.5

Table 5: Output emission probability matrix for example 2
B	1	2
1	0.5	0.5
2	0.4	0.6
3	0.7	0.3

 Solution
 1. Calculate initial values:
    $\alpha _{1}(1)=\pi _{1}b_{1}(o_{1})=0.1$ 
    $\alpha _{1}(2)=\pi _{2}b_{2}(o_{1})=0.16$ 
    $\alpha _{1}(3)=\pi _{3}b_{3}(o_{1})=0.28$ 
 2. Compute recursively
    $\alpha _{2}(1)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i1}]b_{1}(o_{2})=0.154\times 0.5=0.077$ 
    $\alpha _{2}(2)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i2}]b_{2}(o_{2})=0.184\times 0.6=0.1104$ 
    $\alpha _{2}(3)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i3}]b_{3}(o_{2})=0.202\times 0.3=0.0606$ 
    $\alpha _{3}(1)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i1}]b_{1}(o_{3})=0.04187$ 
    $\alpha _{3}(2)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i2}]b_{2}(o_{3})=0.03551$ 
    $\alpha _{3}(3)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i3}]b_{3}(o_{3})=0.05284$ 
 3. Sum
    $P(O|\lambda )=\sum _{i=1}^{3}\alpha _{3}(i)=0.13022$

Backward procedure

Backward procedure is similar to forward procedure. Given Markov model $\lambda$ , current state $q_{i}$ at time $t$ and the observation sequence $o_{t+1},o_{t+2},\cdots ,o_{T}$ from time $t$ to time $T$ , backward probability $\beta _{t}(i)$ could be defined as:

\beta _{t}(i)=P(o_{t+1},o_{t+2},\cdots ,o_{T},i_{t}=q_{i}|\lambda )

Similarly, observation probability $P(O|\lambda )$ for current state could be calculated.

 Algorithm 3 (backward algorithm)
 input:  $\lambda =(A,B,\pi )$ ,  $O$ .
 output:  $P(O|\lambda )$ .
 procedure:
 1. Initialization:
 	 $\beta _{T}(i)=1,~i=1,2,\cdots ,N$ .
 2. Recursion: 
 	 $\beta _{t}(i)=\sum \limits _{j=1}^{N}\beta _{t+1}(j)a_{ij}b_{i}(o_{t+1})$ ,
 	where  $t=T-1,T-2,\cdots ,1,i=1,2,\cdots ,N$ .
 3. Terminal:
 	 $P(O|\lambda )=\sum \limits _{i=1}^{N}\pi _{i}b_{i}(o_{1})\beta _{1}(i)$ .

Learning

HMM is also used to find the best set of state transition and output emission probabilities. Given an output sequence, this method could derive the maximum likelihood estimate of the parameters of the HMM. In this section, Expectation Maximization algorithm^[7] helps HMM to find out that, given a sequence of observations, what are the values of state transposition probability matrix $A$ and output emission probability matrix $B$ makes such observation most likely ^[6].

 Algorithm 4 (Expectation Maximization algorithm)
 input:  $O$ .
 output:  $\lambda =(A,B,\pi )$ ,  $arg~maxP(O|\lambda )$ .
 procedure:
 1. Initialization: set values of matrix  $A$ ,  $B$  as random
    valid numbers, where  $A_{i0}=0$  and  $B_{0}(k)=0$ ,  $i=1,2,\cdots ,N$ ,
     $k=1,2,\cdots ,M$ .
 2. E-step: run forward-backward algorithm o compute  $\alpha _{i}$  and
     $\beta _{i}$  for  $i=1,2,\cdots ,N$ :
 	 $\gamma _{t}(i,j):=\alpha _{t}(i)a_{ij}b_{j}(o_{t+1})\beta _{t+1}(i)$ 
 3. M-step: re-estimate the maximum likelihood of  $A$  and  $B$ :
 	 $a_{ij}:={\frac {\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}{\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}}$ 
 	 $b_{j}(k):={\frac {\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}1\{o_{t}=v_{k}\}\gamma _{t}(i,j)}{\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}}$ 
 4. Repeat step 2 and 3, until converge.

This algorithm is a variant of forward-backward algorithm, also called Baum-Welch algorithm. It is used for parameter learning for HMM.

Prediction

The final question of HMM is to compute what was the most likely sequence of states $I$ with the help of a sequence of observation. Viterbi algorithm uses dynamic programming to solve such problem ^[3].

To apply Viterbi algorithm, some variables ( $\delta$ , $\psi$ ) should be introduced.

$\delta$ stands for maximum probability routine among all possible states ( $i_{1},i_{2},\cdots ,i_{T}$ ) at time $t$ :

\delta _{t}(i)=\max _{i_{1},i_{2},\cdots ,i_{t-1}}P(i_{t}=i,i_{t-1},\cdots ,i_{1},o_{1},o_{2},\cdots ,o_{t}|\lambda ),i=q,2,\cdots ,N

Also, $\delta _{t+1}(i)$ could be calculated:

\delta _{t+1}(i)=\max _{i_{1},i_{2},\cdots ,i_{t}}P(i_{t+1}=i,i_{t},\cdots ,i_{1},o_{1},o_{2},\cdots ,o_{t}|\lambda )

=\max _{1\leq j\leq N}[\delta _{t}(j)a_{ji}]b_{i}(o_{t+1}),i=1,2,\cdots ,N;t=1,2,\cdots ,T-1

$\psi$ stands for the $t-1$ node of maximum probability routine among all possible states ( $i_{1},i_{2},\cdots ,i_{T}$ ) at time $t-1$ :

\psi _{t}(i)=arg\max _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}],i=1,2,\cdots ,N

The process of Viterbi algorithm is shown as follow:

 Algorithm 5 (Viterbi algorithm)
 input:  $\lambda =(A,B,\pi )$ ,  $O$ .
 output: optimal state routine  $I^{*}=(i_{1}^{*},i_{2}^{*},\cdots ,i_{T}^{*})$ ,
 	probability of such routine  $P^{*}$ 
 1. Initialization: 
 	 $\delta _{1}(i)=pi_{i}b_{i}(o_{1}),i=1,2,\cdots ,N$ 
 	 $\psi _{1}(i)=0,i=1,2,\cdots ,N$ 
 2. for t = (2, 3, \cdots, T)
 	 $\delta _{t}(i)=\max \limits _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}]b_{i}(o_{t}),i=1,2,\cdots ,N;$ 
 	 $\psi _{t}(i)=arg\max \limits _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}],i=1,2,\cdots ,N$ 
 3. Stop:
 	 $P^{*}=\max \limits _{1\leq j\leq N}\delta _{t}(i)$ 
 	 $i_{T}^{*}=arg\max \limits _{1\leq j\leq N}[\delta _{t}(i)]$

The Viterbi algorithm is similar to forward procedure except only tracks maximum probability and record its corresponding state sequence, rather than tracking the total probability of all observations so far.

Applications

Hidden Markov Model are applied to many fields to discover not immediately observable data sequence ^[1]:

Single Molecule Kinetic analysis^[8]
Cryptanalysis
Speech recognition
Speech synthesis
Part-of-speech tagging
Document Separation in scanning solutions
Machine translation
Partial discharge
Gene prediction
sequence alignment|Alignment of bio-sequences
Time Series Analysis
Activity recognition
Protein folding^[9]
Metamorphic Virus Detection^[10]
DNA Motif Discovery^[11]

Annotated Bibliography

↑ ^{Jump up to: 1.0} ^1.1 ^1.2 Hidden Markov Model, Available at: https://en.wikipedia.org/wiki/\Hidden_Markov_model [Accessed Jan-27-2016].
↑ ^{Jump up to: 2.0} ^2.1 Daniel, Jurafsky (2014). Speech and Language Processing. Prentice Hall. ISBN 978-0131873216.
↑ ^{Jump up to: 3.0} ^3.1 ^3.2 ^3.3 Hang, Li (2012). Statistical Learning Methods (in Chinese). Tsinghua University Press. ISBN 978-7-302-27595-4.
↑ ^{Jump up to: 4.0} ^4.1 Zoubin, Ghahramani (2001). "An Introduction to Hidden Markov Models and Bayesian Networks". International Journal of Pattern Recognition and Artificial Intelligence.
↑ ^{Jump up to: 5.0} ^5.1 David, Poole (2010). Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press. ISBN 978-0521519007.
↑ ^{Jump up to: 6.0} ^6.1 Hidden Markov Models Fundamentals, Available at: http://cs229.stanford.edu/section/cs229-hmm.pdf [Accessed Jan-29-2016]
↑ Expectation Maximization algorithm, Available at: https://en.wikipedia.org/wiki/Expectation–maximization_algorithm [Accessed Jan-30-2016].
↑ NICOLAI, CHRISTOPHER (2013). "SOLVING ION CHANNEL KINETICS WITH THE QuB SOFTWARE". Biophysical Reviews and Letters. 8 (3n04): 191–211. doi:10.1142/S1793048013300053.
↑ Stigler, J.; Ziegler, F.; Gieseke, A.; Gebhardt, J. C. M.; Rief, M. (2011). "The Complex Folding Network of Single Calmodulin Molecules". Science. 334 (6055): 512–516. doi:10.1126/science.1207598. PMID 22034433.
↑ Wong, W.; Stamp, M. (2006). "Hunting for metamorphic engines". Journal in Computer Virology. 2 (3): 211–229. doi:10.1007/s11416-006-0028-7.
↑ Wong, K. -C.; Chan, T. -M.; Peng, C.; Li, Y.; Zhang, Z. (2013). "DNA motif elucidation using belief propagation". Nucleic Acids Research. 41 (16): e153. doi:10.1093/nar/gkt574. PMC 3763557. PMID 23814189.

To Add

1. Add the proof process of three core questions of HMM

2. Add more examples to better explain three core questions of HMM

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[hmmwiki-1] {Jump up to: 1.0} ^1.1 ^1.2 Hidden Markov Model, Available at: https://en.wikipedia.org/wiki/\Hidden_Markov_model [Accessed Jan-27-2016].

[Jurafsky-2] {Jump up to: 2.0} ^2.1 Daniel, Jurafsky (2014). Speech and Language Processing. Prentice Hall. ISBN 978-0131873216.

[hangli-3] {Jump up to: 3.0} ^3.1 ^3.2 ^3.3 Hang, Li (2012). Statistical Learning Methods (in Chinese). Tsinghua University Press. ISBN 978-7-302-27595-4.

[Ghahramani-4] {Jump up to: 4.0} ^4.1 Zoubin, Ghahramani (2001). "An Introduction to Hidden Markov Models and Bayesian Networks". International Journal of Pattern Recognition and Artificial Intelligence.

[poole-5] {Jump up to: 5.0} ^5.1 David, Poole (2010). Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press. ISBN 978-0521519007.

[stanford-6] {Jump up to: 6.0} ^6.1 Hidden Markov Models Fundamentals, Available at: http://cs229.stanford.edu/section/cs229-hmm.pdf [Accessed Jan-29-2016]

[7] Expectation Maximization algorithm, Available at: https://en.wikipedia.org/wiki/Expectation–maximization_algorithm [Accessed Jan-30-2016].

[8] NICOLAI, CHRISTOPHER (2013). "SOLVING ION CHANNEL KINETICS WITH THE QuB SOFTWARE". Biophysical Reviews and Letters. 8 (3n04): 191–211. doi:10.1142/S1793048013300053.

[9] Stigler, J.; Ziegler, F.; Gieseke, A.; Gebhardt, J. C. M.; Rief, M. (2011). "The Complex Folding Network of Single Calmodulin Molecules". Science. 334 (6055): 512–516. doi:10.1126/science.1207598. PMID 22034433.

[10] Wong, W.; Stamp, M. (2006). "Hunting for metamorphic engines". Journal in Computer Virology. 2 (3): 211–229. doi:10.1007/s11416-006-0028-7.

[11] Wong, K. -C.; Chan, T. -M.; Peng, C.; Li, Y.; Zhang, Z. (2013). "DNA motif elucidation using belief propagation". Nucleic Acids Research. 41 (16): e153. doi:10.1093/nar/gkt574. PMC 3763557. PMID 23814189.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]