Hotel Estoril Eden, Monte Estoril,
5-8 October 2005



NextText Box: Participants
Text Box: Programme

An Improved Algorithm for Segmenting ArrayCGH Data

John Marioni1,2, Natalie P. Thorne1,2, Jessica C. Pole3, Paul A.W. Edwards3 and Simon Tavaré1,2
1Hutchison/MRC Research Centre, Department of Oncology, University of Cambridge, UK
2Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK
3Hutchison/MRC Research Centre, Departments of Oncology and Pathology, University of Cambridge, UK

The statistical analysis of arrayCGH data can be divided into a number of different steps: data capture, normalisation, segmenting the data to find patterns of copy number alterations and post-segmentation analysis. In this presentation, we will concentrate on the third of these problems: the segmentation of the genome into regions of chromosomal gain and loss.

In particular, we will describe how it is possible to significantly improve the segmentation by incorporating biological information such as clone quality, clone length, the distance between clones or (in a tiling path setting) the overlap between clones. The model we propose extends the Hidden Markov Model approach of Fridlyand et al [1] by allowing the underlying Markov chain to be heterogeneous instead of homogeneous. This enables large amounts of additional information to be included in the model with only a small increase in the number of parameters. Additionally, our model allows an analysis to be carried out on a whole genome basis rather than on a chromosomal basis as is common at present.

In order to assess our model we used a simulated dataset to compare its performance with that of a number of commonly used algorithms. Additionally, we examined its performance in segmenting data obtained from an arrayCGH experiment involving about 50 tumour (mainly breast) cell lines (Pole et al, [2]). The arrays used (Huang et al, [3]) had low resolution coverage for the majority of the genome, 1.5Mb coverage for chromosome 8, and an additional tiled region in 8p12. Breast cancer cell lines frequently have highly rearranged and complex genomes, with breaks commonly occurring on 8p12. For several of these cell lines, detailed FISH analysis of copy number has been used to determine the number of copies in the region of interest. These data provide an excellent framework for illustrating the efficacy of our model.

1. Fridlyand, J. et al, Hidden Markov models approach to the analysis of array CGH data, Journal of Multivariate Analysis, (2004), 90, 132-153
2. Pole, J.C. et al, unpublished, see separate abstract   
3. Huang, H-E. et al, A recurrent chromosome breakpoint in breast cancer at the NRG1/neuregulin/heregulin gene, Cancer Research , (2004), 64, 6840-6844