March 9 -10, 2007
Hotel Quinta das Lágrimas
Organized by CIM
(Centro Internacional de Matemática)
The objective of this “Follow-up
Meeting” is to assess the impact of the thematic term "Statistics in
Genomics and Proteomics", one and a half years after the event. We aim at a rather
informal atmosphere to incite the discussion about the different
perspectives on the development of the subject in Portugal and elsewhere.
Simon Tavaré, Sophie Schbath and
Wolfgang Urfer will give talks on themes related with the themes of the
Workshop. There will be round table discussions
with the objective of identifying new (and perhaps not so new) important
statistical issues on genomics, proteomics and other "omics," as well as
strengthening and/or establishing further collaboration on these themes. There
will be a contributed paper session where we incite participants to bring
their work in progress, to gather ideas for further development and to
promote a healthy flow of discussion.
The number of participants is limited to 30.
Cheques should be made payable to “CIM” and sent to the address above:
Centro Internacional de
Tel: +351 239 802370/1
Participants should book their own reservations, in one of the following hotels, indicating that they are taking part in a "CIM meeting":
€ 80 /per day/ with buffet
€ 43,80 /per day/ with buffet breakfast
March 9 (Friday)
March 10 (Saturday)
Contributed Papers (Work in Progress)
Inferring the behaviour of colon crypts by exploiting methylation patterns
Nicolas, Darryl Shibata and Simon Tavaré1
The analysis of methylation patterns is a promising approach to investigate the genealogy of cell populations in an organism. In a stem cell niche scenario, sampled methylation patterns are the stochastic outcome of a complex interplay between niche structural features such as the number of stem cells within a niche and the niche succession time, the methylation/demethylation process, and the randomness due to sampling. As a consequence, methylation pattern studies can reveal niche characteristics but also require appropriate statistical methods. The analysis of methylation patterns sampled from colon crypts is a prototype of such a study. Previous analyses were based on forward simulation of the cell content of the whole crypt and subsequent comparisons between simulated and experimental data using a few statistics as a proxy to summarize the data. Here I describe a more powerful method to analyze these data based on coalescent modeling and Bayesian inference. Results support a scenario where the colon crypt is maintained by a high number of stem cells; the posterior indicates a number greater than 8 and the posterior mode is between 15 and 20. The results also provide further evidence for synergistic effects in the methylation/demethylation process that could for the first time be quantitatively assessed through their long term consequences such as the coexistence of hyper- and hypo-methylated patterns in the same colon crypt.
Identification of tissue-specific splicing-related factor signatures
Grosso1,2, Anita Gomes1, Nuno Morais2,
Sandra Caldeira1, Natalie Thorne2, Maria Carmo-Fonseca1
Developing computational analysis tools that can contribute to understanding the higher-level regulation of splicing in mammalian organisms is a key challenge. Here, we established a biologically coherent module of 181 splicing-related genes and we developed a joint analysis approach to explicitly search for tissue-specific changes in expression of these genes. We applied the method to multiple data sources from differentiating cultured cells and diverse adult mouse tissues. By examining the behavior of individual splicing genes across complex cell types and tissues, we have identified a total of 74 tissue-specific signatures associated with unique expression profiles of splicing-related genes. The genes that were found specifically up- or down-regulated in a particular tissue included well known RNA binding proteins that can modulate the association of core components of the spliceosome with the pre-mRNA, such as members of the hnRNP and SR protein families, SR protein kinases, DEAD-box RNA helicases and tissue-specific splicing regulators. However, a number of core snRNP components (i.e., Sm, Lsm and snRNP-specific proteins) were additionally found among the tissue-specific signatures. Moreover, we identified robust signatures associated with testis and whole brain, two organs previously described as containing high percentage of genes undergoing alternative splicing events. Thus, our results support the view that tissue-specific variations in the expression of splicing factors contribute to the divergent pattern of alternative splicing frequency seen in different tissues.
This work was supported by FCT in Portugal (“Fundação para a Ciência e Tecnologia”: SFRH/BD/22825/2005 and POCTI/MGI/49430/2002) and the European Commission funded Network of Excellence EURASNET (European Alternative Splicing Network).
Identification of distinct subset of cellular
mRNA associated with splicing factors:
Emiliano Barreto Hernandez, Carina Silva-Fortes, Margarida Gama-Carvalho, Lisete Sousa and Maria Antónia Amaral Turkman
Universidad Nacional de Colombia, CEAUL
Pre-mRNA splicing is an essential step in gene expression that occurs cotranscriptionally in the cell nucleus, involving a large number of RNA binding protein splicing factors, in addition to core spliceosome components. Several of these proteins are required for the recognition of intronic sequence elements, transiently associating with the primary transcript during splicing. Some protein splicing factors, such as the U2 small nuclear RNP auxiliary factor (U2AF), are known to be exported to the cytoplasm, despite being implicated solely in nuclear functions. This observation raises the question of whether U2AF associates with mature mRNA-ribonucleoprotein particles in transit to the cytoplasm, participating in additional cellular functions .
Using a combination of immunoprecipitation and microarray analysis, it was possible to identify subsets of mRNAs that associate differentially with U2AF65 and PTB, corresponding to approximately 10% of all cellular mRNAs expressed in HeLa cells, and to demonstrate that U2AF65 binds either directly or indirectly to defined spliced mRNA .
Isolation of U2AF65or PTB-associated mRNAs under native conditions is performed by immunoprecipitation from precleared HeLa cell lysate, using specific antibodies. For microarray hybridization, PCR amplified cDNA from input and immunoprecipitated samples were prepared and hybridized to Affymetrix GeneChip Human Genome U133 Plus 2.0. To identify mRNAs enriched by the immunoprecipitation procedure, a comparative analysis was performed between each experimental pair dataset (input and immunoprecipitation samples) with output of all genes .
This experiment is not a common gene expression comparison between two different conditions, here the wild sample must be processed for obtaining a new one where the mRNAs associated with a specific RNA binding protein have been enriched. The major difference between this experiment and the normal microarray experiment is that, in this case, more than 20% of the mRNA is differentially expressed between both samples and the common normalization methods are based on small differences between the samples.
Therefore, it is important to explore how the normalization methods and the methods used to detect differentially expressed genes can be adapted and applied to theses cases. Application to these data of the usual normalization and gene selection methods, did not produce reliable results, since we obtained quite different results for different procedures
Some questions arise from this kind of data: (1) Is it possible to modify existing normalization methods to deal with the data or is it necessary to develop a specific normalization method? (2) Is it possible to modify existing gene selection methods to deal with the data or is it necessary to develop a specific gene selection method? (3) Is it possible to use only one expression condition to detect which genes are expressed? (4) Which are the best reference conditions for selecting differential expressed genes in this experiment (wild sample or a mock sample)?
We are interested in evaluating how ROC analysis could improve background correction and gene selection methods: (1) Could we use ROC analysis to measure background noise from signal intensities? (2) Could we use ROC analysis to select optimal cutoff values to identify differential genes in multiple chip analysis and to select present calls in single chip analysis?
Key-words: microarrays, Affymetrix, mRNA binding proteins, normalization, gene
Statistical analysis of multipatients CGH array data
CGH arrays are now frequently used in the medical community for research andfor diagnosis, in particular in cancer studies. A typical experiment aims at detecting gains or loss of genomic material. From a statistical point of view, this results in a breakpoint detection problem for which several solutions have been proposed. One of the most efficient is based on a dynamic programming approach, which provides the optimal segmentation.
intensive use of CGH array now requires tool to analyze several arrays
simultaneously. The mixed linear model seems to be a promising strategy to
handle both covariates and possible correlations between CGH profiles.
However, standard estimation algorithms need to be adapted to include the
segmentation step. We present an E-M algorithm which takes benefit of the
dynamic programming and allows to handle several CGH profiles at the same
Assessing the exceptionality of network motifs
Getting and analyzing biological interaction networks is at the core of systems biology. To help understanding these complex networks, many recent works have suggested focusing on motifs which occur more frequently than expected in random (Milo et al., 2002; Shen-Orr et al., 2002; Prill et al., 2005). Such motifs seem indeed to reflect functional or computational units which combine to regulate the cellular behavior as a whole. The common method that has been used for now to detect significantly over-represented motifs is based on heavy simulations: random graphs are first generated, then the p-value is derived either from the empirical distribution of the count or via a Gaussian approximation of the z-score calculated thanks to the empirical mean and variance of the count.
To identify exceptional motifs in a given network, we propose a statistical and analytical method which does not require any simulation (Picard et al., 2007). For this, we first provide an analytical expression of the mean and variance of the count under any stationary random graph model. Then we approximate the motif count distribution by a compound Poisson distribution whose parameters are derived from the mean and variance of the count. Thanks to simulations, we show that the quality of our compound Poisson distribution is very good and highly better than a Gaussian or a Poisson one. The compound Poisson distribution can then be used to get an approximate p-value and to decide if an observed count is significantly high or not.
Beyond the p-value calculation, the assessment of the motif exceptionality in a given network relies on the choice of a suitable random graph model. This model should indeed fit some relevant characteristics of the observed network. The sequence degree is usually an important feature to take into account. Unfortunately the well known and well studied Erdös-Rényi model does not fit correctly biological networks, in particular it does not consider heterogeneities. We then emphasize the recent and promising model called ERMG for Erdös-Rényi Mixture for Graphs proposed by Daudin et al. (2007). The ERMG model assumes that nodes are spread into several classes of connectivity and that the probability for two nodes to be connected depends on their classes. The goodness-of-fit of this model on real biological networks is very satisfactory.
Preliminary results will be shown on the protein-protein interaction network of H. pylori.
Daudin, J.-J., Picard, F. and Robin, S.(2007) Mixture model for random graphs: a variational approach, SSB Preprint n°4 (http://genome.jouy.inra.fr/ssb/preprint/)
Picard, F., Daudin, J.-J., Schbath, S. and Robin, S. (2007) Assessing the exceptionality of network motifs, SSB Preprint n°1(http://genome.jouy.inra.fr/ssb/preprint/)
Prill, R.J., Iglesias, P.A. and Levchenko, A. (2005) Dynamic properties of network motifs contribute to biological network organization. PloS Biology, 3:11 2005
Milo, R., Shen-Orr, S., Itzkovitz, S., Newman, M.E.J and Alon, U. (2002) Networks motifs: simple building block of complex networks. Science, 298 824—827
Shen-Orr, S.S., Milo, S., Mangan, S. and Alon, U. (2002) Networks motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics, 31 64—68
Statistical tools for proteomics
This talk will report on ongoing projects at several research institutions and companies for proteomic technologies and drugs.
The statistical methods used address the problem of extracting signal content from protein mass spectrometry data, protein-protein interaction experiments, time-dependent protein expression data and signalling networks. Special emphasis will be given to partial least squares regression, mixed linear models, genetic algorithms, Bayesian networks and wavelet-analysis.
All the projects represent the continuation and motivation of activities presented by talks and posters at the workshop on statistics in genomics and proteomics in 2005.