Recent developments of copula-based models to handle missing data of mixed-type in multivariate analysis
Abstract
In this thesis, we propose innovative imputation models to handle
missing data of mixed-type. Our imputation models can handle 1)
multilevel data sets through random effects; 2) heterogeneity in
a population by specifying infinite mixture models; and 3) a
large number of variables using graphical lasso methods. Two
clinical data sets, a randomised control trial of acute stroke
care patients and a survey of menstrual disorder among teenagers,
are used for the real data application examples, although we
believe that the proposed methods can also be applied to other
data sets with similar structures.
In Chapter 2, we propose a copula based method to handle missing
values in multivariate data of mixed type in multilevel data
sets. Building upon the extended rank likelihood approach
combined with a multinomial probit model formulation, our model
is a latent variable model which is able to capture the
relationship among variables of different types as well as
accounting for the clustering structure. Our proposed method is
evaluated through simulations using both artificial data and the
acute stroke data set to compare it with several conventional
methods of handling missing data. We conclude that our proposed
copula based imputation model for mixed type variables achieves
good imputation accuracy and recovery of parameters in some
models of interest, and that adding random effects enhances
performance when the clustering effect is strong.
In Chapter 3, we consider an infinite mixture of elliptical
copulas induced by a Dirichlet process mixture to build a
flexible copula function as the imputation model. A slice
sampling algorithm is used in conjunction with a prior parallel
tempering algorithm to sample from the infinite dimensional
parameter space and to overcome the mixing issue when sampling
from a multimodal distribution. Using simulations, we demonstrate
that the infinite mixture copula model provides a better overall
fit compared to their single component counterparts, and performs
better at capturing tail dependence features of the data. The
application of this model is also demonstrated using the acute
stroke data set.
In Chapter 4, we propose a Gaussian copula model with a graphical
lasso prior to analyse the conditional associations among 100+
questions in a study of menstrual disorder among teenagers. Our
data come from a large population based study of menstrual
disorder in Australian teenagers conducted in 2005 and 2016
respectively. We also compare cohort differences of menstruation
over the 11-year interval and use the model to predict girls with
a higher risk of developing endometriosis. The model is based on
the model proposed in Chapter 2, but with a graphical lasso prior
to shrink the elements in the precision matrix of the Gaussian
distribution to encourage a sparse graphical structure. The level
of shrinkage is adaptable from the strength of the conditional
associations among questions in the survey. We find that
menstrual disturbance is more pronouncedly reported in 2016 than
a decade ago, and the questions in the questionnaire form several
clusters with strong associations.
Description
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description