# Course:CPSC522/Hidden Markov Models

## Hidden Markov Models

Hidden Markov model is a statistical model, where the system is modeled follows Markov process with hidden (unobserved) states.[1]

Principal Author: Jiahong Chen

## Abstract

This page describes HMM as a probabilistic model, introduces the framework of it as well as giving an example to tell how a HMM is established. Also, it introduces three main questions of HMM and gives detailed algorithms to describe how to solve these questions.

### Builds on

A hidden Markov model (HMM) is a statistical Markov model, where the system is modeled follows Markov process with hidden (unobserved) states, and builds on Graphical_Models. Also, it can be presented as the simplest dynamic Bayesian network.

### Related Pages

As mentioned above, A hidden Markov model (HMM) is a statistical Markov model, which builds on Graphical Models. It is also related to Bayesian network and Markov Decision Process.

## Content

### Introduction

Hidden Markov Model(HMM) is a probabilistic sequence model used for assigning a label or class to each unit in a sequence, and mapping a sequence of labels to a sequence of observations[2]. It could be applied to numerous fields like speech recognition, natural language processing, bioinformatics, pattern recognize and so on[3]. HMM are able to compute the probability of a sequence of unobservable events, which usually described as hidden events with the help of a sequence of observed events[2].

HMM gets its name from two defining properties. The first one is states of hidden process follow Markov property, that is, given state ${\displaystyle S_{t-1}}$, current state ${\displaystyle S_{t}}$ is independent to states prior to ${\displaystyle S_{t-1}}$. The second property is observation ${\displaystyle O_{t}}$ is only dependent to ${\displaystyle S_{t}}$, where ${\displaystyle S_{t}}$ is unobservable [4][5]. Figure [1] indicates the belief representation of an HMM[4][5].

Figure 1: HMM belief representation[1].

### Framework of HMM

#### Definition of HMM

The main procedure of generating a HMM is as time pass by, states will change from one to another and there will be a observation in each new state. In this way, a HMM knows all the states will occur and all the observations it might meet. For example, it has ${\displaystyle N}$ possible states and ${\displaystyle M}$ possible observations. So, one of ${\displaystyle N}$ states will occur each time, and one of ${\displaystyle M}$ observations will appear each time based on current state. Besides, HMM knows the probability of changing from one state to another and the probability of what observation it might meet in each new state in order to increase the sequence. Thus, it is able to transfer from current state to a new one and making a new observation at this new state. Moreover, given a time period of length ${\displaystyle T}$, the HMM will have ${\displaystyle T}$ states and observations, and at each time, state ${\displaystyle i_{t}}$ is paired to observation ${\displaystyle o_{t}}$.

A Hidden Markov model could be defined as follow: Assuming ${\displaystyle Q}$ is the set of all possible states while ${\displaystyle V}$ is a set of all possible observations: ${\displaystyle Q=\{q_{1},q_{2},\cdots ,q_{N}\},}$ ${\displaystyle V=\{v_{1},v_{2},\cdots ,v_{M}\}}$, where N refers to the number of possible states, and M refers to the number of possible observations. Besides, ${\displaystyle I}$ and ${\displaystyle O}$ both have the length of ${\displaystyle T}$ and indicate state sequence and observation sequence respectively: ${\displaystyle I=(i_{1},i_{2},\cdots ,i_{T}),O=(o_{1},o_{2},\cdots ,o_{T})}$. In this way, the state transposition probability matrix ${\displaystyle A}$ could be built as: ${\displaystyle A=[a_{ij}]_{N\times N}}$, where ${\displaystyle a_{ij}}$ means at state ${\displaystyle q_{i}}$, there will be a probability ${\displaystyle a_{ij}}$ for this state to transform to state ${\displaystyle q_{j}}$. It is defined as follow:

${\displaystyle a_{ij}=P(i_{t+1}=q_{j}|i_{t}=q_{i}),i=1,2,\cdots ,N;j=1,2,\cdots ,N;t\in \{1,2,\cdots ,T-1\}}$

Moreover, output emission probability matrix ${\displaystyle B}$, which indicates the probability to generate observation ${\displaystyle v_{k}}$ at given state ${\displaystyle q_{j}}$, could be established as ${\displaystyle B=[b_{j}(k)]_{N\times M}}$, where ${\displaystyle b_{j}(k)}$ means at time ${\displaystyle t}$, given state as ${\displaystyle q_{i}}$, there is a probability ${\displaystyle b_{j}(k)}$ to observe ${\displaystyle v_{k}}$:

${\displaystyle b_{j}(k)=P(o_{t}=v_{k}|i_{t}=q_{i}),k=1,2,\cdots ,M;j=1,2,\cdots ,N}$

Furthermore, initial state vector ${\displaystyle \pi }$, which intend to provide the initial state for a HMM could be defined as ${\displaystyle \pi =(\pi _{i})}$, where ${\displaystyle \pi _{i}=P(i_{1}=q_{i}),i=1,2,\cdots ,N}$, and it means the probability for real observation ${\displaystyle i}$ to be ${\displaystyle q_{i}}$ at initial time 1.

A Hidden Markov model is decided by initial state vector ${\displaystyle \pi }$, state transposition probability matrix ${\displaystyle A}$ and output emission probability matrix ${\displaystyle B}$. Assuming a HMM is represented by ${\displaystyle \lambda }$, the model could be expressed as: ${\displaystyle \lambda =(A,B,\pi )}$. Initial state vector ${\displaystyle \pi }$ for state transposition probability matrix ${\displaystyle A}$ decides hidden Markov chain, and output emission probability matrix ${\displaystyle B}$ determines how the observations is generated from states.

#### Generation of observation sequence

An observation sequence ${\displaystyle O=\{o_{1},o_{2},\cdots ,c_{T}\}}$ could be generated according to the definition of HMM.

###### Algorithm 1 (generation of observation sequence)
 input: ${\displaystyle \lambda =(A,B,\pi )}$ and ${\displaystyle T}$
output: ${\displaystyle O=\{o_{1},o_{2},\cdots ,c_{T}\}}$
procedure:
1. Generate initial states ${\displaystyle i_{i}}$ according to ${\displaystyle \pi }$.
2. Let ${\displaystyle t}$ = 1.
3. Generate ${\displaystyle o_{t}}$, according to output emission matrix ${\displaystyle b_{i_{t}}(k)}$
4. Generate ${\displaystyle i_{t+1}}$ according to state transposition
probability ${\displaystyle {a_{i_{t},i_{t}+1}}}$
5. Let ${\displaystyle t=t+1}$
if ${\displaystyle t, go to step 3
else, stop.


Example 1 Here is an example given to explain how a hidden Markov model is established. Assuming there are 4 opaque boxes, each of them contains 10 white or red balls. Balls assigned in each box are shown in Table 1.

Table 1: Ball arrangement
Box No. 1 2 3 4
Red balls 5 3 6 8
White balls 5 7 4 2

Grab balls to generate a observation sequence as following rules. First, randomly choose one box from all four boxes with same probability, then grab a ball from this opaque box to record its color and put it back. Second, transfer to next box as following rules: if current box is No.1, the next box must be box No.2; if current box is No.2 and No.3, they have a probability of 0.4 to transfer to the box No.1 and No.2 respectively, and a probability of 0.6 to transfer to box No.3 and No.4 respectively; if current box is No.4, it might stay in box No.4 with probability of 0.5, otherwise it will go to box No.3. Third, move to new box according to the status and rules in step 2. Last, repeat step 2 and 3, for ${\displaystyle T}$ times to generate the observation sequence of ball color with length ${\displaystyle T}$. For example, let ${\displaystyle T=10}$, we might have a observation sequence as follow:

${\displaystyle O=\{red,red,red,white,red,red,white,red,white,white\}}$

During the whole ball grabbing process, observer will only able to get the color of ball, they cannot know in which box is the ball come from, in an other words, they cannot observe the sequence of box transformation. In another words, box sequence is hidden, while the sequence of balls is observable.

In this way, we could determine possible states ${\displaystyle Q}$, possible observations ${\displaystyle V}$ and initial state vector ${\displaystyle \pi }$.

Possible states are:

${\displaystyle Q=\{Box~No.1,Box~No.2,Box~No.3,Box~No.4\},~where~N=4}$

Possible observations are:

${\displaystyle V=\{red,~white\},~where~M=4}$

Initial state vector could be defined as:

${\displaystyle \pi =(0.25,~0.25,~0.25,~0.25)^{T}}$

Furthermore, state transposition probability matrix ${\displaystyle A}$ (Table 2) and output emission probability matrix ${\displaystyle B}$ (Table 3) could be defined.

Table 2: State transposition probability matrix: A
A 1 2 3 4
1 0 1 0 0
2 0.4 0 0.6 0
3 0 0.4 0 0.6
4 0 0 0.5 0.5
Table 3: Output emission probability matrix
B 1 2
1 0.5 0.5
2 0.3 0.7
3 0.6 0.4
4 0.8 0.2

In this way a hidden Markov model could be defined as ${\displaystyle \lambda =(A,B,\pi )}$.

### Three core problem of HMM

There are three fundamental questions of HMM [3], which are widely used to solve real world problems:

1. Probability calculation: given ${\displaystyle \lambda =(A,B,\pi )}$ and observation sequence ${\displaystyle O=(o_{1},o_{2},\cdots ,o_{T})}$, calculate the probability ${\displaystyle P(O|\lambda )}$ of the occurrence of ${\displaystyle O}$.

2. Learning: given observation sequence ${\displaystyle O=(o_{1},o_{2},\cdots ,o_{T})}$, estimate attributes of ${\displaystyle \lambda }$ under maximum probability of ${\displaystyle P(O|\lambda )}$.

3. Predicting (also named as decoding): given ${\displaystyle \lambda =(A,B,\pi )}$ and observation sequence ${\displaystyle O=(o_{1},o_{2},\cdots ,o_{T})}$, find out state sequence ${\displaystyle I=(i_{1},i_{2},\cdots ,i_{T})}$ with maximum probability ${\displaystyle P(I|\lambda )}$.

#### Probability calculation

This question is to compute the probability of a particular output sequence by using parameters of the HMM: ${\displaystyle \lambda =(A,B,\pi )}$. In general, probability calculation problem could be solved by forward-backward algorithm [3][6]. Forward-backward algorithm includes forward procedure and backward procedure, both of them are able to compute the probability of output sequence individually.

##### Forward procedure

Definition: Given Markov model ${\displaystyle \lambda }$, the observation sequence to time ${\displaystyle t}$ as ${\displaystyle o_{1},o_{2},\cdots ,o_{t}}$ and current state ${\displaystyle q_{i}}$, forward probability ${\displaystyle \alpha _{t}(i)}$ could be defined as:

${\displaystyle \alpha _{t}(i)=P(o_{1},o_{2},\cdots ,o_{t},i_{t}=q_{i}|\lambda )}$

In this way, observation probability ${\displaystyle P(O|\lambda )}$ for current state could be calculated recursively.

 Algorithm 2 (forward algorithm)
input: ${\displaystyle \lambda =(A,B,\pi )}$, ${\displaystyle O}$.
output: ${\displaystyle P(O|\lambda )}$.
procedure:
1. Initialization:
${\displaystyle \alpha _{1}(i)=\pi _{i}b_{i}(o_{1}),~i=1,2,\cdots ,N}$.
2. Recursion:
${\displaystyle \alpha _{t+1}(i)=[\sum \limits _{j=1}^{N}\alpha _{t}(j)a_{ji}]b_{i}(o_{t+1})}$,
where ${\displaystyle t=1,2,\cdots ,T-1,i=1,2,\cdots ,N}$.
3. Terminal:
${\displaystyle P(O|\lambda )=\sum \limits _{i=1}^{N}\alpha _{T}(i)}$.


Example 2 Here is an example to present how forward procedure works. Assuming there is a box-ball observing system like Example 1 with HMM ${\displaystyle \lambda =(A,B,\pi ),}$ ${\displaystyle Q=\{1,2,3\},}$ ${\displaystyle V=\{red,white\},}$ ${\displaystyle \pi =(0.2,0.4,0.4)^{T}}$. State transposition probability matrix and output emission probability matrix are shown in Table 4 and Table 5 respectively. Letting ${\displaystyle T=3,O=(red,white,red)}$, and using the forward procedure to solve it.

Table 4: State transposition probability matrix for example 2
A 1 2 3
1 0.5 0.2 0.3
2 0.3 0.5 0.2
3 0.2 0.3 0.5
Table 5: Output emission probability matrix for example 2
B 1 2
1 0.5 0.5
2 0.4 0.6
3 0.7 0.3
 Solution
1. Calculate initial values:
${\displaystyle \alpha _{1}(1)=\pi _{1}b_{1}(o_{1})=0.1}$
${\displaystyle \alpha _{1}(2)=\pi _{2}b_{2}(o_{1})=0.16}$
${\displaystyle \alpha _{1}(3)=\pi _{3}b_{3}(o_{1})=0.28}$
2. Compute recursively
${\displaystyle \alpha _{2}(1)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i1}]b_{1}(o_{2})=0.154\times 0.5=0.077}$
${\displaystyle \alpha _{2}(2)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i2}]b_{2}(o_{2})=0.184\times 0.6=0.1104}$
${\displaystyle \alpha _{2}(3)=[\sum _{i=1}^{3}\alpha _{1}(i)a_{i3}]b_{3}(o_{2})=0.202\times 0.3=0.0606}$
${\displaystyle \alpha _{3}(1)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i1}]b_{1}(o_{3})=0.04187}$
${\displaystyle \alpha _{3}(2)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i2}]b_{2}(o_{3})=0.03551}$
${\displaystyle \alpha _{3}(3)=[\sum _{i=1}^{3}\alpha _{2}(i)a_{i3}]b_{3}(o_{3})=0.05284}$
3. Sum
${\displaystyle P(O|\lambda )=\sum _{i=1}^{3}\alpha _{3}(i)=0.13022}$

##### Backward procedure

Backward procedure is similar to forward procedure. Given Markov model ${\displaystyle \lambda }$, current state ${\displaystyle q_{i}}$ at time ${\displaystyle t}$ and the observation sequence ${\displaystyle o_{t+1},o_{t+2},\cdots ,o_{T}}$ from time ${\displaystyle t}$ to time ${\displaystyle T}$, backward probability ${\displaystyle \beta _{t}(i)}$ could be defined as:

${\displaystyle \beta _{t}(i)=P(o_{t+1},o_{t+2},\cdots ,o_{T},i_{t}=q_{i}|\lambda )}$

Similarly, observation probability ${\displaystyle P(O|\lambda )}$ for current state could be calculated.

 Algorithm 3 (backward algorithm)
input: ${\displaystyle \lambda =(A,B,\pi )}$, ${\displaystyle O}$.
output: ${\displaystyle P(O|\lambda )}$.
procedure:
1. Initialization:
${\displaystyle \beta _{T}(i)=1,~i=1,2,\cdots ,N}$.
2. Recursion:
${\displaystyle \beta _{t}(i)=\sum \limits _{j=1}^{N}\beta _{t+1}(j)a_{ij}b_{i}(o_{t+1})}$,
where ${\displaystyle t=T-1,T-2,\cdots ,1,i=1,2,\cdots ,N}$.
3. Terminal:
${\displaystyle P(O|\lambda )=\sum \limits _{i=1}^{N}\pi _{i}b_{i}(o_{1})\beta _{1}(i)}$.


#### Learning

HMM is also used to find the best set of state transition and output emission probabilities. Given an output sequence, this method could derive the maximum likelihood estimate of the parameters of the HMM. In this section, Expectation Maximization algorithm[7] helps HMM to find out that, given a sequence of observations, what are the values of state transposition probability matrix ${\displaystyle A}$ and output emission probability matrix ${\displaystyle B}$ makes such observation most likely [6].

 Algorithm 4 (Expectation Maximization algorithm)
input: ${\displaystyle O}$.
output: ${\displaystyle \lambda =(A,B,\pi )}$, ${\displaystyle arg~maxP(O|\lambda )}$.
procedure:
1. Initialization: set values of matrix ${\displaystyle A}$, ${\displaystyle B}$ as random
valid numbers, where ${\displaystyle A_{i0}=0}$ and ${\displaystyle B_{0}(k)=0}$, ${\displaystyle i=1,2,\cdots ,N}$,
${\displaystyle k=1,2,\cdots ,M}$.
2. E-step: run forward-backward algorithm o compute ${\displaystyle \alpha _{i}}$ and
${\displaystyle \beta _{i}}$ for ${\displaystyle i=1,2,\cdots ,N}$:
${\displaystyle \gamma _{t}(i,j):=\alpha _{t}(i)a_{ij}b_{j}(o_{t+1})\beta _{t+1}(i)}$
3. M-step: re-estimate the maximum likelihood of ${\displaystyle A}$ and ${\displaystyle B}$:
${\displaystyle a_{ij}:={\frac {\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}{\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}}}$
${\displaystyle b_{j}(k):={\frac {\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}1\{o_{t}=v_{k}\}\gamma _{t}(i,j)}{\sum \limits _{j=1}^{M}\sum \limits _{t=1}^{T}\gamma _{t}(i,j)}}}$
4. Repeat step 2 and 3, until converge.


This algorithm is a variant of forward-backward algorithm, also called Baum-Welch algorithm. It is used for parameter learning for HMM.

#### Prediction

The final question of HMM is to compute what was the most likely sequence of states ${\displaystyle I}$ with the help of a sequence of observation. Viterbi algorithm uses dynamic programming to solve such problem [3].

To apply Viterbi algorithm, some variables (${\displaystyle \delta }$, ${\displaystyle \psi }$) should be introduced.

${\displaystyle \delta }$ stands for maximum probability routine among all possible states (${\displaystyle i_{1},i_{2},\cdots ,i_{T}}$) at time ${\displaystyle t}$:

${\displaystyle \delta _{t}(i)=\max _{i_{1},i_{2},\cdots ,i_{t-1}}P(i_{t}=i,i_{t-1},\cdots ,i_{1},o_{1},o_{2},\cdots ,o_{t}|\lambda ),i=q,2,\cdots ,N}$

Also, ${\displaystyle \delta _{t+1}(i)}$ could be calculated:

${\displaystyle \delta _{t+1}(i)=\max _{i_{1},i_{2},\cdots ,i_{t}}P(i_{t+1}=i,i_{t},\cdots ,i_{1},o_{1},o_{2},\cdots ,o_{t}|\lambda )}$
${\displaystyle =\max _{1\leq j\leq N}[\delta _{t}(j)a_{ji}]b_{i}(o_{t+1}),i=1,2,\cdots ,N;t=1,2,\cdots ,T-1}$

${\displaystyle \psi }$ stands for the ${\displaystyle t-1}$ node of maximum probability routine among all possible states (${\displaystyle i_{1},i_{2},\cdots ,i_{T}}$) at time ${\displaystyle t-1}$:

${\displaystyle \psi _{t}(i)=arg\max _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}],i=1,2,\cdots ,N}$

The process of Viterbi algorithm is shown as follow:

 Algorithm 5 (Viterbi algorithm)
input: ${\displaystyle \lambda =(A,B,\pi )}$, ${\displaystyle O}$.
output: optimal state routine ${\displaystyle I^{*}=(i_{1}^{*},i_{2}^{*},\cdots ,i_{T}^{*})}$,
probability of such routine ${\displaystyle P^{*}}$
1. Initialization:
${\displaystyle \delta _{1}(i)=pi_{i}b_{i}(o_{1}),i=1,2,\cdots ,N}$
${\displaystyle \psi _{1}(i)=0,i=1,2,\cdots ,N}$
2. for t = (2, 3, \cdots, T)
${\displaystyle \delta _{t}(i)=\max \limits _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}]b_{i}(o_{t}),i=1,2,\cdots ,N;}$
${\displaystyle \psi _{t}(i)=arg\max \limits _{1\leq j\leq N}[\delta _{t-1}(j)a_{ji}],i=1,2,\cdots ,N}$
3. Stop:
${\displaystyle P^{*}=\max \limits _{1\leq j\leq N}\delta _{t}(i)}$
${\displaystyle i_{T}^{*}=arg\max \limits _{1\leq j\leq N}[\delta _{t}(i)]}$


The Viterbi algorithm is similar to forward procedure except only tracks maximum probability and record its corresponding state sequence, rather than tracking the total probability of all observations so far.

## Applications

Hidden Markov Model are applied to many fields to discover not immediately observable data sequence [1]:

## Annotated Bibliography

1. Hidden Markov Model, Available at: https://en.wikipedia.org/wiki/\Hidden_Markov_model [Accessed Jan-27-2016].
2. Daniel, Jurafsky (2014). Speech and Language Processing. Prentice Hall. ISBN 978-0131873216.
3. Hang, Li (2012). Statistical Learning Methods (in Chinese). Tsinghua University Press. ISBN 978-7-302-27595-4.
4. Zoubin, Ghahramani (2001). "An Introduction to Hidden Markov Models and Bayesian Networks". International Journal of Pattern Recognition and Artificial Intelligence.
5. David, Poole (2010). Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press. ISBN 978-0521519007.
6. Hidden Markov Models Fundamentals, Available at: http://cs229.stanford.edu/section/cs229-hmm.pdf [Accessed Jan-29-2016]
7. Expectation Maximization algorithm, Available at: https://en.wikipedia.org/wiki/Expectation–maximization_algorithm [Accessed Jan-30-2016].
8. NICOLAI, CHRISTOPHER (2013). "SOLVING ION CHANNEL KINETICS WITH THE QuB SOFTWARE". Biophysical Reviews and Letters. 8 (3n04): 191–211. doi:10.1142/S1793048013300053.
9. Stigler, J.; Ziegler, F.; Gieseke, A.; Gebhardt, J. C. M.; Rief, M. (2011). "The Complex Folding Network of Single Calmodulin Molecules". Science. 334 (6055): 512–516. doi:10.1126/science.1207598. PMID 22034433.
10. Wong, W.; Stamp, M. (2006). "Hunting for metamorphic engines". Journal in Computer Virology. 2 (3): 211–229. doi:10.1007/s11416-006-0028-7.
11. Wong, K. -C.; Chan, T. -M.; Peng, C.; Li, Y.; Zhang, Z. (2013). "DNA motif elucidation using belief propagation". Nucleic Acids Research. 41 (16): e153. doi:10.1093/nar/gkt574. PMC . PMID 23814189.

1. Add the proof process of three core questions of HMM

2. Add more examples to better explain three core questions of HMM

 Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0