Difference between revisions of "Decrypthon Data Center"
(→Overview) |
|||
Line 16: | Line 16: | ||
[[Image:bird_ddc.jpg]] | [[Image:bird_ddc.jpg]] | ||
− | == | + | ==Contraints of thi Center== |
+ | The Decrypthon Date Center is be able to manage databases of various types, the table 1 details the data banks often used by biologists. These biological datasets are widely distributed over Internet and made available in different formats. The integration system must be able to cope with five strong constraints: | ||
+ | • Volume of the data: the public biological information consists of genome and gene sequences, protein structures and sequence alignments representing more than one terabyte. | ||
+ | |||
+ | • Heterogeneity and management of the data: ranging from three-dimensional models and sequence alignments, up to images of gels and scientific articles, these data are provided in different formats. This requires the developments of specific parsers corresponding to each data type. Additionally the data generated by the scientific projects have to be stored and indexed in real time. | ||
+ | |||
+ | • Safety and confidentiality: The system must also integrate clinical databases. For obvious questions of confidentiality and ethics, it is not always possible to duplicate such databases or to transfer sensitive data on the networks. The pertinent information thus remains under the control of the owners who allow a restricted access to the remote database. | ||
+ | |||
+ | • Interoperability capacities: In bioinformatics, many toolkits (Thompson et al., 2006) (Plewniak et al., 2003) have been developed with different programming languages and are composed of several independent software components sharing the same data through different protocols. Thus the system must provide an easy access for these external softwares through independent methods such as http service, web service or API. | ||
+ | |||
+ | • Query expression and treatment: the system must also provide a simple protocol allowing users, which are generally not computer scientists, to express easily their query or retrieval protocol on data banks without knowing the structure of the relational database. This protocol should be written in flat files or XML format in order to reuse it in any other data warehouse created by the same integration system. | ||
==Services== | ==Services== | ||
==supports== | ==supports== |
Revision as of 05:57, 19 June 2008
Overview
Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. Thus, modern biomedical information corresponds to a high volume of heterogeneous data that doubles in size every year (Statistics NCBI) and that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon), resulting from a collaboration between AFM/CNRS/IBM, has been instigated, firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres of French universities and, secondly to provide a facilitated access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a strong computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to manage automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management (called Decrypthon Data Center).
The BIRD System was used to implementation of Décrypthon Data Center
Sharing of large scale biological data for applications (Macsims, MS2PH, Magos, Ordalie..) Running on the Décrypthon Grid. Management of generated data (results) on the Grid Sharing of data and services for the scientific community http://decrypthon.u-strasbg.fr/birdweb/query.do
Contraints of thi Center
The Decrypthon Date Center is be able to manage databases of various types, the table 1 details the data banks often used by biologists. These biological datasets are widely distributed over Internet and made available in different formats. The integration system must be able to cope with five strong constraints:
• Volume of the data: the public biological information consists of genome and gene sequences, protein structures and sequence alignments representing more than one terabyte.
• Heterogeneity and management of the data: ranging from three-dimensional models and sequence alignments, up to images of gels and scientific articles, these data are provided in different formats. This requires the developments of specific parsers corresponding to each data type. Additionally the data generated by the scientific projects have to be stored and indexed in real time.
• Safety and confidentiality: The system must also integrate clinical databases. For obvious questions of confidentiality and ethics, it is not always possible to duplicate such databases or to transfer sensitive data on the networks. The pertinent information thus remains under the control of the owners who allow a restricted access to the remote database.
• Interoperability capacities: In bioinformatics, many toolkits (Thompson et al., 2006) (Plewniak et al., 2003) have been developed with different programming languages and are composed of several independent software components sharing the same data through different protocols. Thus the system must provide an easy access for these external softwares through independent methods such as http service, web service or API.
• Query expression and treatment: the system must also provide a simple protocol allowing users, which are generally not computer scientists, to express easily their query or retrieval protocol on data banks without knowing the structure of the relational database. This protocol should be written in flat files or XML format in order to reuse it in any other data warehouse created by the same integration system.