Relevant Genes in Microarray Data
Joaquim F. Pinto da Costa1,
Hugo Alonso1, Luís A.C. Roque2 and
1Departamento de Matemática
Aplicada, Universidade do Porto, Portugal
2Departamento de Matemática, ISEP, Universidade do Porto,
3Departamento de Matemática, Universidade de Évora, Portugal
In this work we consider the problem of selecting informative genes
from the thousands of genes that are usually measured in microarray
experiments. Firstly, the selection is done by taking into account the
information about the class membership (disease) of each individual; we try
to find which of the measured genes have relevant information to
discriminate between the different classes by using Decision Trees .
Surprisingly, in the five datasets analysed, only a few of the thousands of
genes that are ususally measured were selected; it seams that most of the
genes are not good to discriminate between the diseases. Secondly, we
approach the problem by finding the Principal Components of the most
expressed genes. Two variants are used: the usual PCA using the Pearson
correlation matrix and a “weighted” version which is introduced in this
work. This weighted PCA consists in using an adaptation of a new rank
correlation coefficient that gives more importance to the higher ranks and
which was introduced by Pinto da Costa & Soares in .
Microarrays, Decision Trees, PCA, Weighted Rank Correlation
 Breiman, L., Friedman, J.H., Olshen, A. and Stone, C.J., 1984.
Classification and Regression Trees. Wadsworht, Belmont.
 Pinto da Costa, J.F. and Soares, C., 2005. A weighted rank measure of
correlation, Australian & New Zealand Journal of Statistics (to appear).