Hotel Estoril Eden, Monte Estoril,
5-8 October 2005



NextText Box: Participants
Text Box: Programme

ene Expression and Annotation

Clare Foyle and Nick Fieller
Department of Probability & Statistics, University of Sheffield, UK

Various forms of oligonucleotide microarrays allow direct measurement of gene expression in samples from human subjects and are made with the aim of providing insight into the biological processes of some condition (e.g. cancer), for example which genes play key roles in its development.  Typically, many thousands of genes are measured on relatively few subjects and with relatively sparse replication. From the statistical viewpoint, the major problem is the analysis of very high dimensional data with limited numbers of observations and poor replication.

However, additional information is available. Most obviously there is concomitant information on the subjects themselves, including severity of condition and demographic information.  Appropriate use of this will enhance statistical analysis. Less well known is the availability of information on the genes which could play a dual role in the analysis. The broad term for this information is ‘annotation’.  Just as subjects with common characteristics might be expected to have similar gene expression profiles it might be anticipated that genes with some common annotation feature might display similarities.

A particular form of annotation is whether a gene has been referred to in connection with a biological function or disease.  Text mining techniques can determine the number of such citations in a textbase relating genes to a Medical Subject Heading (i.e. MeSH category as defined in the US National Library of Medicine’s controlled vocabulary used for indexing).  This can provide a measure of linkage between genes.  Since such information is typically extremely sparse, use of the published MeSH hierarchies of terms allows grouping of categories at various levels and hence a measure of further connections between genes.