JbetGenoret

From Wiki

Jump to: navigation, search

THis is only the abstract ... there are some changes in the article sent september 10th.

The EVI-Genoret Database : collecting, managing and publishing heterogeneous clinical and biological data to create and federate knowledge about retinal diseases

Guillaume Berthommier, Laëtitia Poidevin, Ravikiran Reddy, Hoan Nguyen, Olivier Poch and Raymond Ripp

Laboratoire de BioInformatique et Génomique Intégrative, UMR 7104 Institut de Génétique et de Biologie Moléculaire et Cellulaire 1, rue Laurent Fries, 67404 Illkirch – Strasbourg

Contents

Abstract:

This paper presents the EVI-Genoret Database, developed as part of the European Integrated Project EVI-GENORET “functional genomics of the retina in development, health and disease”. The aim of the Genoret Database is to provide an infrastructure for managing and systematising biological sample handling and processing, data acquisition and analysis, and templates for data storage, data mining and integration. In addition to the relational database and its associated website, a common integrated network system has been developed, allowing the IP members to query the database and to participate in its construction according to the standardized SOPs and protocols.

Introduction

The european FP6 Integrated Project EVI-GENORET [1] aims to construct, through a comprehensive and systematic research strategy, the basis for deciphering, understanding and integrating the biology and function of major genes involved in retinal development, maintenance, physiology and degeneration. The project is of a multidisciplinary nature involving studies of gene expression, proteomics, bioinformatics including comparative genomics, and population genetics. It provides possibly the most coordinated and focused approach to the functional genomics of the retina in development, health and disease. A specific component of the project, the Genoret Database, seeks to exploit the information generated by the various high throughput experiments in order to develop practical tools/resources (software tools and databases, integration with other related genomic/proteomic databases) that can be used for diagnosis, disease screening, drug target evaluation and the development of new therapeutic strategies for blinding diseases that threaten over 12 million people in Europe.

Three major axes

The EVI-GENORET project involves a large variety of experts, including clinicians, geneticists, computer scientists, and biologists, and it is not a simple task to define common references that allow cross correlations between the multiple heterogeneous sources of data. In addition, the format, structure and meaning of the collected information has to be manually defined in many cases, and if possible used to provide standards for future collaborations. With this in mind, the Genoret Database has been designed around three main axes of data organisation and treatment, namely Patient data, Genes and Protocols.

1 Patient data

The Patient data are furnished by each ophthalmologic centre and are not yet standardized across the european area. Some centres have only paper and are trying to collect this unstructured data. Excel or small Access databases are sometimes available but they where generally created to be read by human and cannot be easily explored by programs without standardisation. Data mining could be useful but very difficult. Nevertheless a common “level 1” of information can be defined that permits at least the creation of a “catalog” of available data. This catalog provides a preliminary level of database searches, that represents a so called “first contact” between a requesting scientist and the owner of the selected data. Subsequently, a more complete integration of almost all the higher levels of information is performed, with the constraint that the confidentiality of the data must be preserved.

2 Genes

In addition to the information directly available for each gene from the generalist databases such as Swissprot or Genbank, various other biological and genomic data sources are parsed and analysed to extract the pertinent information, including transcriptomic data or more specific data mining. External systems and databases have therefore been developed and are constantly updated. Examples are GenoretGenes, a warehousing system centralising information about all genes which may be involved in the retina and RetinoBase, a relational database hosting all transcriptomic data available in the consortium, ranging from the raw data to higher level analysis results. Exchange and mining of the additional information sources are performed using the in-house BIRD (Biological Integration and Retrieval of Data) system. This generalist data is generally well structured and characterised and as a consequence, it is easier to integrate than the patient-related data. It permits the assembly of a central information resource shared by almost all the components of the project. Links to these data are directly integrated in the Genoret relational database and, in addition, automatic queries are built on the fly whenever necessary.

3 Protocols

An important aspect of the Genoret Database is the possibility to store and exchange experimental standards and protocols (SOPs). The widespread use of common tools and data sets should encourage the establishment and diffusion of de facto standards. Therefore we provide numerous conversion tools specifically adapted to each case. Similar to the protocols that have been defined for data exchange and communication in the field of computer science, all operation procedures from biochemistry to clinical practices have to be standardized. The Genoret Database provides a central repository for experimental protocols, and offers user-friendly tools for uploading new protocols and retrieving existing ones.

Technical specifications

The Genoret Database/Website is based on a PostgreSQL relational database, an Apache/PHP http webserver and a number of additional Java and Tcl/Tk tools. The whole system, including GenoretGenes and RetinoBase, can be included in a federative database powered by the IBM Websphere Federation Server (WFS).

Uploading

In addition to the clinical and genomic data, the Genoret Database/Website allows any Genoret user to upload his own data files. Individual users access the database with a unique login and password. The data is first stored as a flat file with its associated information, mainly ownership, access-rights, description and purpose and, if possible, the full data is then integrated in the relational database.

Direct access to external databases

In many cases, the manual upload procedure has been automated in order to handle large data sets or real time processes. For this, specific exchange protocols using direct access, http server or PHP technology have been installed in some centres, querying small local SQL databases (such as MySQL or MS Access) or Excel and flat files. These direct access possibilities can also be used by the IBM Websphere Federation Server (WFS) to federate other laboratory data sources in the centralized virtual relational database, allowing cross queries between the different centres.

Data integration tools

A number of tools have been specifically created to perform automatic data integration. For example, ImAnno is a web interface developed to allow the annotation of in situ hybridisation images of the mouse embryo. It can also be exploited for other applications such as eye fundus images or GenoretGenes annotation by human experts. Another example is JavOO (Java Odbc for Office), which allows a direct connection to query Excel or Access files on a PC running MSWindows, providing access to small remote databases. Numerous display tools are also available, facilitating the query and visualisation of the complex and heterogenous information present in the database.

Relations

The deployment of the data in the database allows to create relations between them. Standards have been established in the uploaded datafile formats and content in order to allow their automatic integration. The links to external databases such as GenoretGenes, the functional genomics retinal gene analysis, provide a complete description of almost all important genes, while the link to the transcriptomics database, RetinoBase, furnishes important information concerning expression during different development stages. Patient data are now available from distant centres and can be directly accessed through WFS. In addition, common data such as the Gene Identity Card allow us to link mutation, polymorphism, pathologies and clinical data, tissue and cellular expression as well as animal models and literature.

SOPs and protocols

The Quality Assurance manager is currently involved in the development of the SOPs and the collection of the protocols from the entire EVI-GENORET consortium. Several meetings have been held to discuss the development of the SOPs and a general checklist for writing a SOP and template for patient exploration has been provided. A large number of SOPs have been uploaded by the users to the Genoret Database and can consequently be disseminated to all the consortium members.

Discussion

The main role of the central Genoret Database/Website is to collect, manage and publish IP members’ clinical and biological data. Subsequently, this should lead to the establishment of common standards for the procedures of the consortium. Due to the varying data sources, formats, and restrictions of each community, the content and organisation of the information could not be fixed in advance Furthermore, in the initial stages of the project, the extent and pertinence of the data exchanges between the different groups was not well defined. The database was therefore designed to provide different levels of integration, starting with low correlated information, and constructing progressively more extensive relations throughout the duration of the project. Another important aspect of the database design was the possibility to keep track of the ownership and access rights of the data furnished by the users, particularly for the uploaded files containing unpublished raw data, where it is important to know that they exist, although the full content should not be accessible. Except for specific cases, this protection cannot be applied for the information highly integrated in the relational database. Since the database can be accessed by non-expert users, user-friendly and powerful querying and navigation tools were developed. For example, an automatic query builder on any SQL table allows an easy display of its content through a graphical interface. More complex queries can also be built by the user and stored for future use. Documentation is also provided, describing the general organisation of the web site, in the form of clickable diagrams explaining the relations and cross links between all components and related data as well as through transversal views related to common projects. Last but not least the database can be queried by programs through its own SQL language (for local use) or through the Websphere Federation System. This allows the users to perform cross correlations and offers secure access for external programs.

References and links

[1] EVI-GENORET FP6 Integrated Project LSHG-CT-2005-512036

     http://www.evi-genoret.org