Selection bias in plots of microarray or other data that have been sampled from a high-dimensional space
Date
2005
Authors
Maindonald, John
Burden, Conrad
Journal Title
Journal ISSN
Volume Title
Publisher
Australian Mathematical Society
Abstract
For data that have many more features than observations, finding a low-dimensional representation that accurately reflects known prior groupings is non-trivial. Microarray gene expression data, used to create a "signature" or discrimination rule that distinguishes cancer tissues that are classified according to type of cancer, is an important special case. The optimal number of features is suitably determined using cross-validation, in which each of several parts of the data becomes in turn the test set, with the remaining data used for training. At each such division of "fold" of the data into a training and test set, both the selection of features and the derivation of the discriminant rule must be repeated. Use of the complete data for prior selection of features can lead to a grossly optimistic assessment of predictive accuracy and, in scatter-plot graphs that show discriminant function scores, to a spurious or exaggerated separation between groups. At each division or fold, a second versus first discriminant axis plot of test scores can be drwan. This paper presents a method for bringing there different plosts, which have different choices of features and realte to different coordinate systems, into a single plot in which the configuration of points fairly reflects the accuracy of the discriminant procedure. The methodology is applicable, in prinsiple, to use of any discriminant analysis methodology, or of ordination or multidimensional scaling, for obtaining a low dimensional graphical representation of data.
Description
Keywords
Citation
Collections
Source
ANZIAM Journal
Type
Journal article
Book Title
Entity type
Access Statement
License Rights
DOI
Restricted until
2037-12-31