EvolHHuPro
The goal of our project is the definition of a complete set of the evolutionary histories (cascade of phylogenetic events) for the human proteome and their genome-scale analysis.
The genetic information encoded in the genome sequence contains the blueprint for the potential development and activity of an organism. This information can only be fully comprehended in the light of the evolutionary events (duplication, loss, recombination, mutation…) acting on the genome, that are reflected in changes in the sequence, structure and function of the gene products (nucleic acids and proteins) and ultimately, in the biological complexity of the organism.
The recent availability of the complete genome sequences of a large number of model organisms means that we can now begin to understand the mechanisms involved in the evolution of the genome and their consequences in the study of biological systems. This is illustrated by the evolutionary analyses and phylogenetic inferences that play an important role in most functional genomics studies, e.g. of promoters (‘phylogenetic footprinting’), of interactomes (notion of ‘interologs’ based on the presence and degree of conservation of counterparts of interactive proteins), and also, in comparisons of transcriptomes or proteomes (notion of phylogenetic proximity and co-regulation/co-expression).
At the same time, theoretical advances in information representation and management have revolutionised the way experimental information is collected, stored and exploited. Ontologies, such as Gene Ontology (GO) or Sequence Ontology (SO), provide a formal representation of the data for automatic, high-throughput data parsing by computers. These ontologies are being exploited in the new information management systems to allow large scale data mining, pattern discovery and knowledge inference.
Unfortunately, the vast number and complexity of the events shaping eukaryotic genomes means that a complete understanding of evolution at the genomic level is not currently feasible. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation.
We propose to characterise and to study the evolutionary histories of the human proteome, defined as the impact in the human proteins (extensions, insertions, deletions…) of the cascade of genetic events (duplication, lateral transfer, inversion, transposition, deletion, insertion…) that occurred during the evolution of the vertebrate genomes. This ambitious objective is now possible thanks to the emergence of formal descriptions of biological data and to the recent developments of accurate phylogenetic reconstruction and genome analyses (Partner 1: Figenix platform) and of automated reliable and exploitable protein sequence alignments (Partner 1 & 2: TCOFFEE, PipeAlign, MAO, MACSIMS…). These methodologies will be combined into a multi-agent, expert system for the construction of evolutionary histories. In order to facilitate the automatic definition of the important genetic events shaping a single protein and their potential causalities at the genome level, a new ontology will be developed. In a subsequent step, the evolutionary histories of the complete human proteome will be reconstructed, followed by their classification into protein sets sharing typical evolutionary histories, and the functional analysis of these sets. An analysis at the genomic level will be realized for a specific number of proteins identified in the classification and functional analysis step.