March 9 -10, 2007

Hotel Quinta das Lágrimas
Coimbra, Portugal

Organized by CIM

(Centro Internacional de Matemática)



     Registration      Accommodation      Programme      Participants      Abstracts    

Updated 2.Mar.07

This meeting follows the event “Workshop on Statistics in Genomics and Proteomics” which took place in Estoril from 5 to 8 of October 2005.

The objective of this “Follow-up Meeting” is to assess the impact of the thematic term "Statistics in Genomics and Proteomics", one and a half years after the event. We aim at a rather informal atmosphere to incite the discussion about the different perspectives on the development of the subject in Portugal and elsewhere.

Simon Tavaré, Sophie Schbath and Wolfgang Urfer will give talks on themes related with the themes of the Workshop. There will be round table discussions with the objective of identifying new  (and perhaps not so new) important statistical issues on genomics, proteomics and other "omics," as well as  strengthening and/or establishing further collaboration on these themes. There will be a contributed paper session where we incite participants to bring their work in progress, to gather ideas for further development and to promote a healthy flow of discussion.


  Fee: € 100
 (This fee covers coffee-breaks, dinner on Friday and lunch on Saturday)

  The number of participants is limited to 30.

  Intention of participation (form) should be sent until January 2, 2007, by email to Antónia Turkman.

  Cheques should be made payable to “CIM” and sent to the address above: 

         Centro Internacional de Matematica
         WSGP Follow-Up Meeting

         Complexo do Observatório Astronómico
         Almas de Freire
         3040 Coimbra

         Tel: +351 239 802370/1
         Fax: +351 239445380



  Participants should book their own reservations, in one of the following hotels, indicating that they are taking part in a "CIM meeting":

  Hotel Quinta das Lágrimas

        Single room                   €   80 /per day/ with buffet breakfast
      Double room:                 € 100 /per day/ with buffet breakfast

  Hotel D. Luís

       Single room                   € 43,80 /per day/ with buffet breakfast
      Double room:                 € 52,60 /per day/ with buffet breakfast



  March 9 (Friday)

  16:00 - 17:00  SIMON TAVARÉ
                          Inferring the behaviour of colon crypts by exploiting methylation patterns
  17:00 - 17:30  Coffee-Break
  17:30 – 19:30  Round-Table Discussion

  20:00  Dinner

  March 10 (Saturday)

    9:00 - 11:00  Contributed Papers (Work in Progress)
Identification of tissue-specific splicing-related factor signatures
   Identification of distinct subset of cellular mRNA associated with splicing factors:
                                evaluation, optimization of normalization and gene selection methods
Statistical analysis of multipatients CGH array data

  11:00 - 11:30  Coffee-Break
  11:30 - 12:30  SOPHIE SCHBATH
Assessing the exceptionality of network motifs

  12:30 - 14:00  Lunch

  14:00 - 15:00  WOLFGANG URFER
Statistical tools for proteomics
  15:00 - 16:30  Round-Table Discussion and Close-up





Ana Rita Grosso

Universidade de Lisboa, Portugal

Ana Subtil

Universidade de Lisboa, Portugal

Antónia Turkman

Universidade de Lisboa, Portugal

Bruno Sousa

Universidade do Minho, Portugal

Carina Silva-Fortes

Escola Superior de Tecnologia da Saúde de Lisboa, Portugal

Carlos Daniel Paulino

Instituto Superior Técnico, Portugal

Emiliano Barreto

Universidad Nacional de Colombia, Colombia

Feridun Turkman

Universidade de Lisboa, Portugal

Lisete Sousa

Universidade de Lisboa, Portugal

Luísa C. Castro

Universidade de Lisboa, Portugal

Luzia Gonçalves

Instituto de Higiene e Medicina Tropical, Portugal

Manuela Neves

Instituto Superior de Agronomia, Portugal

Margarida Gama-Carvalho

Instituto de Medicina Molecular, Portugal

Maria do Carmo Fonseca

Instituto de Medicina Molecular, Portugal

Marília Antunes

Universidade de Lisboa, Portugal

Marta Mesquita

Instituto Superior de Agronomia, Portugal

Nuno Sepúlveda

Instituto Gulbenkian de Ciência, Portugal

Pedro Fernandes

Instituto Gulbenkian de Ciência, Portugal

Rita Gaio

Universidade do Porto, Portugal

Rute Vieira

Universidade de Lisboa, Portugal

Sandra Ramos

Instituto Politécnico do Porto, Portugal

Simon Tavaré

University of Southern California, USA

Sophie Schbath

Institut National de la Recherche Agronomique, France

Stéphane Robin

Institut National de la Recherche Agronomique, France

Susana Vinga

Instituto de Engenharia de Sistemas e Computadores - ID, Portugal

Valeska Andreozzi

Universidade de Lisboa, Portugal

Wolfgang Urfer

University of Dortmund, Germany




Inferring the behaviour of colon crypts by exploiting methylation patterns

 Pierre Nicolas, Darryl Shibata and Simon Tavaré1
1University of Southern California, USA

 The analysis of methylation patterns is a promising approach to investigate the genealogy of cell populations in an organism. In a stem cell niche scenario, sampled methylation patterns are the stochastic outcome of a complex interplay between niche structural features such as the number of stem cells within a niche and the niche succession time, the methylation/demethylation process, and the randomness due to sampling. As a consequence, methylation pattern studies can reveal niche characteristics but also require appropriate statistical methods. The analysis of methylation patterns sampled from colon crypts is a prototype of such a study. Previous analyses were based on forward simulation of the cell content of the whole crypt and subsequent comparisons between simulated and experimental data using a few statistics as a proxy to summarize the data. Here I describe a more powerful method to analyze these data based on coalescent modeling and Bayesian inference. Results support a scenario where the colon crypt is maintained by a high number of stem cells; the posterior indicates a number greater than 8 and the posterior mode is between 15 and 20. The results also provide further evidence for synergistic effects in the methylation/demethylation process that could for the first time be quantitatively assessed through their long term consequences such as the coexistence of hyper- and hypo-methylated patterns in the same colon crypt.



Identification of tissue-specific splicing-related factor signatures 

Ana Rita Grosso1,2, Anita Gomes1, Nuno Morais2, Sandra Caldeira1, Natalie Thorne2, Maria Carmo-Fonseca1
1Institute of Molecular Medicine, Faculty of Medicine, University of Lisbon, Portugal
2University of Cambridge, Department of Oncology, Hutchison-MRC Research Centre, Cambridge CB2 2XZ, UK 

 Developing computational analysis tools that can contribute to understanding the higher-level regulation of splicing in mammalian organisms is a key challenge. Here, we established a biologically coherent module of 181 splicing-related genes and we developed a joint analysis approach to explicitly search for tissue-specific changes in expression of these genes. We applied the method to multiple data sources from differentiating cultured cells and diverse adult mouse tissues. By examining the behavior of individual splicing genes across complex cell types and tissues, we have identified a total of 74 tissue-specific signatures associated with unique expression profiles of splicing-related genes. The genes that were found specifically up- or down-regulated in a particular tissue included well known RNA binding proteins that can modulate the association of core components of the spliceosome with the pre-mRNA, such as members of the hnRNP and SR protein families, SR protein kinases, DEAD-box RNA helicases and tissue-specific splicing regulators. However, a number of core snRNP components (i.e., Sm, Lsm and snRNP-specific proteins) were additionally found among the tissue-specific signatures. Moreover, we identified robust signatures associated with testis and whole brain, two organs previously described as containing high percentage of genes undergoing alternative splicing events. Thus, our results support the view that tissue-specific variations in the expression of splicing factors contribute to the divergent pattern of alternative splicing frequency seen in different tissues.

This work was supported by FCT in Portugal (“Fundação para a Ciência e Tecnologia”: SFRH/BD/22825/2005 and POCTI/MGI/49430/2002) and the European Commission funded Network of Excellence EURASNET (European Alternative Splicing Network).



Identification of distinct subset of cellular mRNA associated with splicing factors:
evaluation, optimization of normalization and gene selection methods

Emiliano Barreto Hernandez, Carina Silva-Fortes, Margarida Gama-Carvalho, Lisete Sousa and Maria Antónia Amaral Turkman

1 Universidad Nacional de Colombia, CEAUL
2 Escola  Superior de Tecnologia da Saúde de Lisboa, CEAUL, Portugal
3 Facultade de Ciências, Universidade de Lisboa, CEAUL, Portugal

Pre-mRNA splicing is an essential step in gene expression that occurs cotranscriptionally in the cell nucleus, involving a large number of RNA binding protein splicing factors, in addition to core spliceosome components. Several of these proteins are required for the recognition of intronic sequence elements, transiently associating with the primary transcript during splicing. Some protein splicing factors, such as the U2 small nuclear RNP auxiliary factor (U2AF), are known to be exported to the cytoplasm, despite being implicated solely in nuclear functions. This observation raises the question of whether U2AF associates with mature mRNA-ribonucleoprotein particles in transit to the cytoplasm, participating in additional cellular functions [1].

Using a combination of immunoprecipitation and microarray analysis, it was possible to identify subsets of mRNAs that associate differentially with U2AF65 and PTB, corresponding to approximately 10% of all cellular mRNAs expressed in HeLa cells, and to demonstrate that U2AF65 binds either directly or indirectly to defined spliced mRNA [1].

Isolation of U2AF65or PTB-associated mRNAs under native conditions is performed by immunoprecipitation from precleared HeLa cell lysate, using specific antibodies. For microarray hybridization, PCR amplified cDNA from input and immunoprecipitated samples were prepared and hybridized to Affymetrix GeneChip Human Genome U133 Plus 2.0. To identify mRNAs enriched by the immunoprecipitation procedure, a comparative analysis was performed between each experimental pair dataset (input and immunoprecipitation samples) with output of all genes [1].

This experiment is not a common gene expression comparison between two different conditions, here the wild sample must be processed for obtaining a new one where the mRNAs associated with a specific RNA binding protein have been enriched. The major difference between this experiment and the normal microarray experiment is that, in this case, more than 20% of the mRNA is differentially expressed between both samples and the common normalization methods are based on small differences between the samples.

Therefore, it is important to explore how the normalization methods and the methods used to detect differentially expressed genes can be adapted and applied to theses cases. Application to these data of the usual normalization and gene selection methods, did not produce reliable results, since we obtained quite different results for different procedures

Some questions arise from this kind of data: (1) Is it possible to modify existing normalization methods to deal with the data or is it necessary to develop a specific normalization method? (2) Is it possible to modify existing gene selection methods to deal with the data or is it necessary to develop a specific gene selection method? (3) Is it possible to use only one expression condition to detect which genes are expressed? (4) Which are the best reference conditions for selecting differential expressed genes in this experiment (wild sample or a mock sample)?

We are interested in evaluating how ROC analysis could improve background correction and gene selection methods: (1) Could we use ROC analysis to measure background noise from signal intensities? (2) Could we use ROC analysis to select optimal cutoff values to identify differential genes in multiple chip analysis and to select present calls in single chip analysis?

Key-words: microarrays, Affymetrix, mRNA binding proteins, normalization, gene

[1] Gama-Carvalho, M.; Barbosa-Morais, N.L.; Brodsky,A.S; Silver,P.A. and Carmo-Fonseca, M. (2006).
Genome-wide identification of functionally distinct subsets of cellular mRNAs associated with two nucleocytoplasmic-shuttling mammalian splicing factors. Genome Biology, 7:R113.



Statistical analysis of multipatients CGH array data

Stéphane Robin
Institut National de la Recherche Agronomique, France

CGH arrays are now frequently used in the medical community for research andfor diagnosis, in particular in cancer studies. A typical experiment aims at detecting gains or loss of genomic material. From a statistical point of view, this results in a breakpoint detection problem for which several solutions have been proposed. One of the most efficient is based on a dynamic programming approach, which provides the optimal segmentation.

The intensive use of CGH array now requires tool to analyze several arrays simultaneously. The mixed linear model seems to be a promising strategy to handle both covariates and possible correlations between CGH profiles. However, standard estimation algorithms need to be adapted to include the segmentation step. We present an E-M algorithm which takes benefit of the dynamic programming and allows to handle several CGH profiles at the same
time. We apply this approach to the segmentation of profiles obtained cancer patients.



Assessing the exceptionality of network motifs

Sophie Schbath
Statistics for Systems Biology, INRA, France

Getting and analyzing biological interaction networks is at the core of systems biology. To help understanding these complex networks, many recent works have suggested focusing on motifs which occur more frequently than expected in random (Milo et al., 2002; Shen-Orr et al., 2002; Prill et al., 2005). Such motifs seem indeed to reflect functional or computational units which combine to regulate the cellular behavior as a whole. The common method that has been used for now to detect significantly over-represented motifs is based on heavy simulations: random graphs are first generated, then the p-value is derived either from the empirical distribution of the count or via a Gaussian approximation of the z-score calculated thanks to the empirical mean and variance of the count.

To identify exceptional motifs in a given network, we propose a statistical and analytical method which does not require any simulation (Picard et al., 2007). For this, we first provide an analytical expression of the mean and variance of the count under any stationary random graph model. Then we approximate the motif count distribution by a compound Poisson distribution whose parameters are derived from the mean and variance of the count. Thanks to simulations, we show that the quality of our compound Poisson distribution is very good and highly better than a Gaussian or a Poisson one. The compound Poisson distribution can then be used to get an approximate p-value and to decide if an observed count is significantly high or not.

Beyond the p-value calculation, the assessment of the motif exceptionality in a given network relies on the choice of a suitable random graph model. This model should indeed fit some relevant characteristics of the observed network. The sequence degree is usually an important feature to take into account. Unfortunately the well known and well studied Erdös-Rényi model does not fit correctly biological networks, in particular it does not consider heterogeneities. We then emphasize the recent and promising model called ERMG for Erdös-Rényi Mixture for Graphs proposed by Daudin et al. (2007). The ERMG model assumes that nodes are spread into several classes of connectivity and that the probability for two nodes to be connected depends on their classes. The goodness-of-fit of this model on real biological networks is very satisfactory.

Preliminary results will be shown on the protein-protein interaction network of H. pylori.


Daudin, J.-J., Picard, F. and Robin, S.(2007) Mixture model for random graphs: a variational approach, SSB Preprint n°4 (

Picard, F., Daudin, J.-J., Schbath, S. and Robin, S. (2007) Assessing the exceptionality of network motifs, SSB Preprint n°1(

Prill, R.J., Iglesias, P.A. and Levchenko, A. (2005) Dynamic properties of network motifs contribute to biological network organization. PloS Biology, 3:11 2005

Milo, R., Shen-Orr, S., Itzkovitz, S., Newman, M.E.J and Alon, U. (2002) Networks motifs: simple building block of complex networks. Science, 298 824—827

Shen-Orr, S.S., Milo, S., Mangan, S. and Alon, U. (2002) Networks motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics, 31 64—68



Statistical tools for proteomics

Wolfgang Urfer
Dortmund University, Germany

This talk will report on ongoing projects at several research institutions and companies for proteomic technologies and drugs.

The statistical methods used address the problem of extracting signal content from protein mass spectrometry data, protein-protein interaction experiments, time-dependent protein expression data and signalling networks. Special emphasis will be given to partial least squares regression, mixed linear models, genetic algorithms, Bayesian networks and wavelet-analysis.

All the projects represent the continuation and motivation of activities presented by talks and posters at the workshop on statistics in genomics and proteomics in 2005.