This page introduces the problem of missingness, the three main categories of missingness, and common methods used to treat missing data in doing inference and machine learning.
Principal Author: Nam Hee Gordon Kim
Sensors are often noisy and unreliable. We observe missingness in data, which is a phenomenon where one ore more entries of data collected from the same population is missing due to some hindrance or loss. In this page, we introduce the three fundamental categories of assumptions regarding missingness: MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random). We formalize these definitions and introduce methods for treating missingness (or the lack thereof) when each assumption is made for inference. We introduce the imputation method which generates the "best guess" for the missing entries by reasoning on the available observed data and show the limitations of imputation.
Data collection mechanisms are inherently noisy, and therefore it is difficult to guarantee the completeness of data. For example, a person taking a survey may omit one of the responses either by mistake or intentionally, and surveyors usually do not have much control over whether he/she does so. For another example, a DNA sequencer may aim to sequence the whole genome of an organism, but only a subset of the genome may end up being sequenced. In these cases, probabilistic models and methods must conduct inference with missingness in order to perform as best as they can given the missing data.
Depending on the assumption about the manner in which the data are missing, the treatment of missing data can cause wildly different results. For example, in cases where data are missing purely by chance (or missing completely at random; MCAR), simply removing the examples and/or features with missing entries from consideration may suffice [1]. However, if the data is missing due to the particular value that would have been observed (or not missing at random; NMAR), e.g. a light sensor overloads and returns error, then the same approach will amplify the bias and decrease the reliability of the model [1]. The two aforementioned cases are two extreme cases, while in a more moderate third case, entries for one variable may be missing due to the values of some other variables than itself (or missing at random; MAR), e.g. living in a certain neighborhood makes a person less likely to reveal their political view in survey forms. We first frame each of these three scenarios as a random process represented by simple graphical models. We later introduce imputation as a technique for inference in the MCAR and MAR cases, and demonstrate some examples when the MAR assumption holds.
Treating missing data is usually in scope of statistics and machine learning. As such, we will primarily use conventional machine learning notations to introduce the problem.
Now, we define variables and entries. These notations follow the conventions presented in Kevin Murphy's textbook [2].
As such, an exemplary dataset with missing entries may look like this:
As alluded in the above example, some entries in and may be marked as missing, as with the entries marked by . A missing value is typically a result of omission or noise in data gathering process. For the purpose of this article, we assume that each missing entry has a ground truth value which has been unobserved.
It is helpful to distinguish the difference between missing values and hidden variables. For the scope of this article, a missing value is a particular feature of an example (or observation) which is unobserved despite having a ground truth value. Hidden variables are features that have not been collected as part of the data. Hidden variables can be thought of as special cases of features whose values are entirely missing.
For ease of understanding, we now define a missingness process which models how the missing entries come into being. The notations follow the convention presented in Mohan et al [1].
We treat both and as binary random variables. The exact properties of and depend on the particular missingness assumption pertaining to the data. It is helpful to frame the missing entries as a result of the missingness indicator's role as a masking function, i.e.
Now, it is important to note that the values of the binary random variables and are determined separately for each example . In other words, and define Bernoulli random processes, whose time steps correspond to the indexes of observations. Although these random processes are not necessarily IID (independently and identically distributed), we will assume IID in favor of simplicity.
This process can be translated into graphical models, with the following notations (See Figure 1):
Following the notations in the Preliminaries section, we define three separate assumptions regarding missingness. The names for the assumptions follow Kevin Murphy's textbook [2]. These assumptions were first proposed by Rubin (1976) [7].
Index  Name  Age  Occupation 

1  John Doe  26  Construction worker 
2  Dohyun Nam  ?  Doctor 
3  Mostafa Sharif  ?  Graduate student 
4  Jane Lee  17  High school student 
Depending on the missingness assumption and whether the data is complete during learning or prediction, one may or may not recover reasonable performance in the presence of missing data. In this section, we discuss imputation as a general technique to address missing entries in data and discuss some common imputation methods. We motivate a case for imputation using listwise deletion as a baseline method.
Listwise deletion refers to the deletion of examples or features that contain missing entries. For example, in Table 1., since the "Age" column is missing a the entries for examples 2 and 3, we can exclude those examples from analysis. This is a reasonable treatment under the MCAR assumption, assuming we have a sufficiently large number of IID examples, because the overall distribution of the data would not be affected by such deletions. However, MCAR is often an implausible assumption. If MCAR does not hold, then excluding examples and/or features will introduce bias in inference and learning, as the dataset will deviate from the true distribution. Hence, we will work with the MAR assumption and introduce imputation as an approach to treat missing data under such an assumption.
Imputation refers to the technique of replacing the missing entries with the most likely values in their place. Imputation allows for maintaining the dimensionality of the data, while taking its missingness into account for doing inference. The specific strategies for imputation must take the missingness assumption into account. Choosing an imputation strategy that is based on an incorrect assumption can lead to increasing the bias of the inference model, which can negatively affect its performance.
In Table 2 below, we provide some example imputation methods corresponding to each missingness assumption.
Assumption  Method  Description 

MCAR  Mean Substitution  Replace each missing entry with the mean value of the corresponding feature 
MCAR  Hot Deck  Replace each missing entry based on similar examples. 
MAR  Expectation Maximization  Place an initial guess for missing entries and underlying parameters, e.g. . Iteratively optimize expected loglikelihood. 
NMAR  Collaborative Filtering  Extract principal components from available entries and reconstruct the original. 
Consider some datapoint . Classification refers to the task of predicting the discrete response based on the observation of the datapoint. With missing data, it is useful to limit our scope to inference with generative models, since there is no principled solution to this problem in discriminative models [2]. Generative models postulate the joint distribution based on available and missing entries in order to answer probability queries such as . In his textbook, Murphy shows how a generative classifier may mitigate the missingness in data, depending on whether data is available or not at training time [2]. The discussion is summarized below.
Here, incomplete data refers to data with missing entries. For example, Table 1 from the Examples section is incomplete because some of the "Age" entries are missing. Now, let us consider the case where the features in are complete during training time and incomplete during test time. By incomplete, we imply that some of the features are missing. When these features of the test set are MAR, we can handle the missing entries via marginalization. Following the notations in the Preliminaries section. Consider computing the following probability:
As apparent, feature is missing for the test datapoint. Then, assuming MAR, the best we can do is marginalize out and compute the following probability instead:
Now, note:
Which corresponds to this equality:
When data is missing at training time, marginalization alone cannot treat missing data, as the joint conditional probability in (1) above cannot be estimated based on training data. Then computing the MLE or MAP estimate is no longer a simple optimization problem [2].
Handling missing data is very important. To avoid introducing further sources of error, it is important to reason about the process by which the data is generated, and to deduce the missingness pattern in the data. Identifying the correct missingness assumption and using an appropriate technique for treating missing data allows for an unbiased analysis despite missing entries.
