Decrypthon Data Center

From Wikili
Revision as of 06:03, 19 June 2008 by Nguyen (talk | contribs) (Functionnalities of thi Center)
Jump to: navigation, search

Overview

Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. Thus, modern biomedical information corresponds to a high volume of heterogeneous data that doubles in size every year (Statistics NCBI) and that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon), resulting from a collaboration between AFM/CNRS/IBM, has been instigated, firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres of French universities and, secondly to provide a facilitated access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a strong computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to manage automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management (called Decrypthon Data Center).


Error creating thumbnail: Unable to save thumbnail to destination

The BIRD System was used to implementation of Décrypthon Data Center

  Sharing of large scale biological data for applications (Macsims, MS2PH, Magos, Ordalie..)   
  Running on the Décrypthon Grid.
  Management of generated data (results) on the Grid   
  Sharing of data and services for the scientific community
  http://decrypthon.u-strasbg.fr/birdweb/query.do


File:Bird ddc.jpg

Functionnalities

The Decrypthon Date Center is be able to manage databases of various types, the table 1 details the data banks often used by biologists. These biological datasets are widely distributed over Internet and made available in different formats. The integration system must be able to cope with five strong constraints:

• Volume of the data: the public biological information consists of genome and gene sequences, protein structures and sequence alignments representing more than one terabyte.

• Heterogeneity and management of the data: ranging from three-dimensional models and sequence alignments, up to images of gels and scientific articles, these data are provided in different formats. This requires the developments of specific parsers corresponding to each data type. Additionally the data generated by the scientific projects have to be stored and indexed in real time.

• Safety and confidentiality: The system must also integrate clinical databases. For obvious questions of confidentiality and ethics, it is not always possible to duplicate such databases or to transfer sensitive data on the networks. The pertinent information thus remains under the control of the owners who allow a restricted access to the remote database.

• Interoperability capacities: In bioinformatics, many toolkits (Thompson et al., 2006) (Plewniak et al., 2003) have been developed with different programming languages and are composed of several independent software components sharing the same data through different protocols. Thus the system must provide an easy access for these external softwares through independent methods such as http service, web service or API.

• Query expression and treatment: the system must also provide a simple protocol allowing users, which are generally not computer scientists, to express easily their query or retrieval protocol on data banks without knowing the structure of the relational database. This protocol should be written in flat files or XML format in order to reuse it in any other data warehouse created by the same integration system.

Implementation

The data centre is integrated directly in the Decrypthon computing grid in order to efficiently share all the data necessary for the biological applications requiring a strong potential for calculation and storage. The Decrypthon data centre contains a local database of nucleotide, genomic and proteomic sequences. It provides access methods for heterogeneous and distributed data, as well as a treatment of queries, and data analysis tools.


In this project, BIRD is the core of centre (figure xx). BIRD uses DB2 as indexing and very large data storage unit. BIRD provides API Java and query services for several high level applications. BIRD is installed with the storage node of computing grid. WebSphere Federer Server-WFS (called federation database) is used to integrate the remote clinical database hosted by Universal Mutation Database-UMD software. This federation database communicates with each data source by means of software modules called wrappers. A wrapper contains characteristics about a specific data source. It provides a DB2 relational model for the remote data and assists the federated engine in query processing by sending sub-queries to the remote data source. The “Registrer” module allows us to automatically publish metadata and important information of a remote database. Thanks to this module, WFS system generates the virtual relational schema according to the remote database. It will be also possible for us to search metadata of a recorded data source. Figure 1 et 2 illustrate the data centre implemented by BIRD which makes integration of data of various sources and shares these data for the applications which run on the grid computing and for the biological community. The centre shares its data integrated with applications/clients and users of several manners: via an API Java, of a Web interface (figure 3) as well as language BIRD-QL via orders HTTP.

supports