Probabilistic Models for Single Cell Sequencing
Probabilistic Models for Single Cell Sequencing
This page showcases how the paper CopyMix: Mixture Model Based Single-Cell Clustering and Copy Number Profiling using Variational Inference builds of the paper Interactive analysis and assessment of single-cell copy-number variations to create an effective model for sequencing single cell data.
Principal Author: Katherine Breen
Abstract
Single cell sequencing has become a relevant area in cancer research over the last decade however modelling this data in a useful form is difficult. Clustering single cell sequences into smaller sub-populations of cells based on their mutations offers insight in cancer progression, metastasis, therapy resistance and more [1]. Traditionally, to cluster these sequences, models have first identified mutations in cells then clustered them using hierarchical clustering [2]. Newer probabilistic models identify mutations and cluster cells simultaneously [3]. These models have outperformed the traditional cluster models in terms of computational efficiency and V-measure scores [3]. This page showcases how a current probabilistic model, CopyMix, was built to overcome the limitations of a previous state-of-the-art clustering model. Gingko [3][2].
Builds on
This article applies Markov Chains and Variational Inference to create a probabilistic model.
Related Pages
There are no UBC wiki pages that reference this page.
Content
Single Cell Sequencing for Cancer Research
Understanding the genetic mutations that cause cancer are key to improving cancer treatments [1]. When genes work properly, cells grow and divide to replace damaged or aging cells. When a gene becomes damaged, the gene can mutate and cause normal cells to divide and grow out of control, leading to cancer [4]. Over the past decade, cell sequencing has improved researchers’ understanding of these mutations [1].
Cell sequencing is used to observe the genomes of a cell sample[1]. In cancer research, a cell sample is typically a tumour sample [3]. Tumours consist of several cell sub-populations, each with their own genetic properties [3]. Clustering single cell data allows researchers to identify these sub-populations, an important step to understanding cancer progression, therapy resistance and more [1].
To identify these cell sub-populations, researchers identify the mutations contained in individual cells from a group of cells then cluster the individual cells based on their mutations [1]. The types of mutations are typically referred to as copy numbers [1]. Initially, models were built to complete these tasks sequentially, as in, identify the copy numbers in each single-cell first then cluster the cells [2]. However, these sequential models fall short of best possible performance even if each task has optimal performance [3]. Newer models look to complete these tasks simultaneously and have shown to outperform sequential models [3].
This article showcases how the newer simultaneous model, named CopyMix, builds off of the sequential model, named Ginkgo. Although these models have cancer based applications, this article will focus on the methods used to build these models and thus give very high-level explanations for the biological application. For further information on single-cell sequencing and cancer cells, additional links have been hyperlinked as well as added to the end of this article.
Gingko: A hierarchical clustering method for single cell sequencing
Overview
Gingko is an open source single-cell sequencing platform based on hierarchical clustering. Before clustering the single-cell data, this model performs a large amount of data processing where outliers are removed. The length of the sequence is also adjusted and accounted for [2].
Once the data is processed, the model maps the single cell data to the genome. From this mapped genome, the mutations in each sample are observed. Using the mapped genomes and mutation data, the data is then clustered using hierarchical clustering [2].
To perform clustering a distance matrix containing the distances between all cells is computed. The distance is determined using one of six possible distance metrics: Euclidean, maximum, Manhattan, Canberra, binary and Minkowski. The distance metric is selected by the user. With the distance matrix, Ginkgo then uses agglomerative clustering to group the cells into clusters representing different cell subpopulations [2]. In agglomerative clustering, each sample starts as its own cluster. Each cluster is then merged step-by-step with its nearest cluster until there is only one cluster left. How many merges are made is a hyper-parameter that determines the number of resulting clusters. An example of this is shown in Figure 1. In Figure 1, there are three clusters created in the first step of the clustering. These three clusters are cell #1-#3, cell #4-5, and cell #6. Ginkgo offers four different agglomeration methods: single linkage, complete linkage, average linkage and ward linkage [2].
Figure 1 is a small scale example of Gingko. In a real application, there would be thousands of cells to cluster and the clusters would be determined by the types of mutations in the cells.
Limitations
Ginkgo has been successful at clustering single cell data however it does have limitations. Ginkgo’s initial data processing step removes a large amount of data due to outliers and noise corrections. In one example, this resulted in 80% of the sample cells being removed [3]. Ginkgo also requires the user to determine many hyper-parameters such as bin size, distance metric and agglomerative clustering type [2]. Furthermore, once the clustering is performed, the number of clusters must be decided based on the tree [3]. While selecting hyper-parameters and tree depth gives flexibility, it can also be computationally expensive and time consuming. This takes away time from the goal of this research: analyzing the subpopulations of tumours.
Ginkgo is also limited by dataset size and computational run time. Gingko is offered as an online application and was not able to run when tested with a larger dataset of 891 cells [3]. Ginkgo takes a few hours to cluster just 90 cells [2]. Most single-cell datasets are much larger, containing up to thousands of cells, a size which Ginkgo would not be able to handle.
CopyMix: A probabilistic model for single cell sequencing
Overview
Given the limitations of Ginkgo, a probabilistic model based solution to clustering single-cell data, called CopyMix, was proposed. CopyMix requires less hyper parameter tuning and is able to handle larger datasets [3]. CopyMix identifies mutations in cells and clusters cell sub-populations simultaneously using a mixture model with components corresponding to the different cell subpopulations [3].
The probabilistic framework of CopyMix is advantageous due to its transparency, uncertainty measurements and modeling flexibility. CopyMix models each cluster as a sequence of latent variables and assumes this sequence is governed by a Markov Chain [3]. Markov Chains are effective models of human genomes because the next state depends only on the current state in the sequence. This is an effective way to model the human genome because the human genome is made up of a sequence of nucleic acids represented by the letters A, G, C, and T, this is shown in Figure 2 [5]. For Markov Chains each state is the current position in the sequence. As a result, each sample of the single cell data is modelled as a sequence where the next part depends on the current part.
CopyMix
CopyMix’s graphical model contains observable variables denoted by (the single cell data), the latent copy number states which form a Markov chain denoted by (the different mutations), and the latent cell-specific cluster assignment variables denoted by [3].
In a single cell dataset, the genomes of a single cell is read multiple times [1]. In CopyMix, each genome considered is partitioned into equally-sized segments called bins. The number of reads aligned to bin for cell is given by . are assumed to be independent. The cluster assignment of a cell is given by the latent variable and there are up to clusters. All are independent following a categorical distribution with and . The distribution of depends on the vector of the true hidden copy number states, defined as with each assumed to follow a discrete-time homogeneous Markov chain. Finally is given as the set containing the unknown model parameters where are the priors over , is the cell-specific rate, and are the priors over [3]. Figure 3, taken from the original paper, is a graphical model of CopyMix [3].
Variational inference is used to infer the values of the latent variables and in addition to the unknown model parameters . Variational inference is used because it estimates posterior distributions, protects against overfitting and allows for the selection of optimal number of mixture components, which in this case is the number of cell subpopulations [3].
Building from Gingko to CopyMix
CopyMix was compared with Ginkgo to cluster the cell data of 891 cells as a benchmark for CopyMix's performance. In the test, Ginkgo removed 80% of the given cells during data processing while CopyMix was able to cluster cells without removing any [3]. Additionally, Ginkgo does not suggest an optimal number of clusters while CopyMix does [2][3]. The clustering performance measure V-measure score was use to compare the performance of the two models. A higher V-measure means better clustering performance. To achieve the the same V-measure score as CopyMix, Ginkgo required a shallow cutting of its cluster tree [3]. When the threshold for cutting the hierarchical cluster tree was increased slightly, Ginkgo identified four clusters but only achieved a V-measure of 55% compared to CopyMix’s V-measure of 67% [3].
The probabilistic model CopyMix outperformed the hierarchical clustering model Ginkgo in model performance and run time, however, CopyMix’s model was built from the limitations of the Ginkgo model. Ginkgo was one of the first easily accessible single-cell clustering methods [2]. Once it was published in 2015 its limitations came to light such as computational efficiency, performance and hyper-parameter selection [3]. From these limitations came a proposed solution: a probabilistic model that would increase computational efficiency, performance and limit hyper-parameter selection.
Annotated Bibliography
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Bai, Fan; Li, Ruoyan; Xue, Ruidong (2016). "Single cell sequencing: technique, application, and future development". Science Bulletin – via Elsevier Science Direct.
- ↑ 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 Garvin, Taylor; Aboukhalil, Robert; Kendall, Jude; Baslan, Timour; Atwal, Grinder; Hicks, James; Wigler, Michael; Schatz, Michael (2015). "Interactive analysis and assessment of single-cell copy-number variations". Nature Methods.
- ↑ 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 Safinianaini, Negar; de Souza, Camila; Roth, Andrew; Koptagel, Hazel; Toosi, Hosein; Lagergren, Jens (2020). "CopyMix: Mixture Model Based Single-Cell Clustering and Copy Number Profiling using Variational Inference". BioRxiv.
- ↑ "How Cancer Starts, Grows and Spreads". Canadian Cancer Society.
- ↑ "ACGT". National Human Genome Research Institute.