Hotel Estoril Eden, Monte Estoril,
5-8 October 2005



Text Box: Participants
NextText Box: Programme

stimating Gene Expression Missing Data Using PLS Regression

Lígia P. Brás and José C. Menezes
Centre for Biological and Chemical Engineering, Department of Chemical Engineering, IST, Technical University of Lisbon, Portugal

We present a method for the estimation of missing values (MVs) in DNA microarray data that is based on partial least squares (PLS) regression and involves the reprocess of imputed data and the use of correlations between genes and between arrays in an iterative manner. The method was called alternating PLS imputation (APLSimpute).

The imputation efficiency of APLSimpute was assessed under different conditions (type of data, fraction of data missing and missing structure) by the normalised root mean squared error and the squared Pearson correlation coefficients between actual and estimated values, and compared with that of other imputation methods. Namely, we considered the cluster-based method of Troyanskaya et al. (2001) called weighted K-nearest neighbours imputation (KNNimpute), the local least squares imputation (LLSimpute) proposed in Kim et al. (2005), partial least squares imputation (PLSimpute) method (Nguyen et al., 2004), and Bayesian principal component analysis (BPCA) of Oba et al. (2003). The following publicly available DNA microarray datasets were used: a time-series study of cell cycle regulated genes in yeast (Spellman et al., 1998), a study of gene expression regulated by the calcineurin/Crz1p signalling pathway in yeast (Yoshimoto et al., 2002) – mixed dataset, and a human cancer cell lines study (Ross et al., 2000) – non-time series dataset.

For the different proportions of missing data, LLSimpute and BPCA showed the best performance in time-series and mixed data. However, PLS-based methods are preferable when imputing non-time series data. Combining gene-based and array-based correlation in the estimation process by APLSimpute enhances the prediction ability in the presence of non-time series data with high missing rates. Results also suggest that, when dealing with time course data, the PLS-based methods should be further improved and optimised in terms of variable selection, because they are not as capable as LLS imputation or BPCA imputation to take advantage of local similarity structures present in such data.

Keywords: Gene expression data, DNA microarray data, missing value estimation

Kim, H, Golub, G.H., Park, H. (2005). Bioinformatics, 21, 187-198.
Nguyen, D.V., Wang, N., Carroll, R.J. (2004). J. Data Science, 2, 347-370.
Oba, S, Sato, M.A., Takemasa, I., Monden, M., Matsubara, K., Ishii, S. (2003). Bioinformatics, 19, 2088-2096.
Ross, D.T, et al. (2000). Nature Genet., 24, 227-235.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998). Mol. Biol. Cell., 9(12), 3273-3297.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B. (2001). Bioinformatics, 17, 520-525.
Yoshimoto, H., Saltsman, K., Gasch, A.P., Li, H.X., Ogawa, N., Botstein, D., Brown, P.O., Cyert, M.S. (2002). J. Biol. Chem., 277(34), 31079-31088.