Difference between revisions of "Vep"

From Wikili
Jump to: navigation, search
(Variant Effect Predictor)
(Variant Effect Predictor)
Line 50: Line 50:
 
     - OK!
 
     - OK!
  
 
+
* Install local cache for database connections for homo sapiens
    Install local cache for database connections for homo sapiens
 
 
        
 
        
 
         The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP
 
         The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP
         Cache files will be stored in /my/home/kchennen/.vep
+
         Cache files will be stored in /biolo/vep/cache
 
         Do you want to install any cache files (y/n)? y
 
         Do you want to install any cache files (y/n)? y
         Cache directory /my/home/kchennen/.vep does not exists - do you want to create it (y/n)? y
+
         Cache directory /biolo/vep/cache does not exists - do you want to create it (y/n)? y
 
          
 
          
 
         Downloading list of available cache files
 
         Downloading list of available cache files
Line 68: Line 67:
 
         ...
 
         ...
 
          
 
          
         ? 25 26
+
         ? 26
        - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_refseq_vep_73.tar.gz
 
        ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_refseq_vep_73.tar.gz ==> 200 OK (253s)
 
        - unpacking homo_sapiens_refseq_vep_73.tar.gz
 
 
         - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz
 
         - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz
 
         ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s)
 
         ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s)
Line 79: Line 75:
 
        
 
        
 
         The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks.
 
         The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks.
         FASTA files will be stored in /my/home/kchennen/.vep
+
         FASTA files will be stored in /biolo/vep/cache
 
         Do you want to install any FASTA files (y/n)? y
 
         Do you want to install any FASTA files (y/n)? y
 
         FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download):  
 
         FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download):  
Line 93: Line 89:
 
         ** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s)
 
         ** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s)
 
         Extracting data
 
         Extracting data
         The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /my/home/kchennen/.vep/homo_sapiens/73/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa"         
+
         The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /biolo/vep/cache/homo_sapiens/73/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa"         
 
         Success
 
         Success
 
    
 
    
 
* Configure
 
* Configure
 +
** Add plugins
 +
*** Download latest [https://github.com/ensembl-variation/VEP_plugins archieve of vep plugins]
 +
*** Move all the plugins in the plugin directory /biolo/vep/cache/Plugins
 +
** Create the configuration file vep.ini in /biolo/vep/cache
  
    * create configuration file in /my/home/kchennen/.vep
+
  ##########################
           
+
  ## general features flags  
            ##########################
+
  ##########################
            ## general features flags  
+
  force_overwrite    1
            ##########################
+
  verbose            1
            force_overwrite    1
+
  species            homo_sapiens
            verbose            1
+
  fork              4
            species            homo_sapiens
+
 
            fork              4
+
  ###########################
           
+
  ## output annotation flags  
            ###########################
+
  ###########################
            ## output annotation flags  
+
  sift                b # the SIFT prediction and score, with both given as prediction(score)
            ###########################
+
  polyphen            b # the PolyPhen prediction and score
            sift                b # the SIFT prediction and score, with both given as prediction(score)
+
  regulatory          1 # Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site.
            polyphen            b # the PolyPhen prediction and score
+
  numbers              1 # Adds affected exon and intron numbering to to output.
            regulatory          1 # Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site.
+
  domains              1 # Adds names of overlapping protein domains to output.  
            numbers              1 # Adds affected exon and intron numbering to to output.
+
             
            domains              1 # Adds names of overlapping protein domains to output.  
+
  terms                so
                       
+
 
            terms                so
+
 
           
+
  ################################
           
+
  ## ouput indentifications flags  
            ################################
+
  ################################
            ## ouput indentifications flags  
+
  hgvs              1 # Add HGVS nomenclature based on Ensembl stable identifiers to the output.
            ################################
+
  symbol            1 # Adds the gene symbol (e.g. HGNC) (where available) to the output.
            hgvs              1 # Add HGVS nomenclature based on Ensembl stable identifiers to the output.
+
  ccds              1 # Adds the CCDS transcript identifer (where available) to the output.
            symbol            1 # Adds the gene symbol (e.g. HGNC) (where available) to the output.
+
  protein            1 # Add the Ensembl protein identifier to the output where appropriate.
            ccds              1 # Adds the CCDS transcript identifer (where available) to the output.
+
  canonical          1 # Adds a flag indicating if the transcript is the canonical transcript for the gene.
            protein            1 # Add the Ensembl protein identifier to the output where appropriate.
+
  biotype            1 # Adds the biotype of the transcript. Not used by default
            canonical          1 # Adds a flag indicating if the transcript is the canonical transcript for the gene.
+
  xref_refseq        1 # Output aligned RefSeq mRNA identifier for transcrip
            biotype            1 # Adds the biotype of the transcript. Not used by default
+
 
            xref_refseq        1 # Output aligned RefSeq mRNA identifier for transcrip
+
 
           
+
 
           
+
  #############################
           
+
  ## Co-located variants flags  
            #############################
+
  #############################
            ## Co-located variants flags  
+
  gmaf                1 # Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output.
            #############################
+
  #maf_1kg            1 # Add MAF from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output.
            gmaf                1 # Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output.
+
  maf_esp              1 # Include MAF from NHLBI-ESP populations.
            #maf_1kg            1 # Add MAF from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output.
+
  pubmed              1 # Report Pubmed IDs for publications that cite existing variant.  
            maf_esp              1 # Include MAF from NHLBI-ESP populations.
+
  check_alleles        1 # When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel.
            pubmed              1 # Report Pubmed IDs for publications that cite existing variant.  
+
  check_svs            1 # Checks for the existence of structural variants that overlap your input.  
            check_alleles        1 # When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel.
+
  ##failed            1 # When checking for co-located variants, by default the script will exclude variants that have been flagged as failed.
            check_svs            1 # Checks for the existence of structural variants that overlap your input.  
+
 
            ##failed            1 # When checking for co-located variants, by default the script will exclude variants that have been flagged as failed.
+
 
           
+
  #############################
           
+
  ##  Filtering and QC options  
            #############################
+
  #############################
            ##  Filtering and QC options  
+
  #check_ref          1 # Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database.
            #############################
+
  #coding_only        1 # Only return consequences that fall in the coding regions of transcripts.
            #check_ref          1 # Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database.
+
  no_intergenic      1 # Do not include intergenic consequences in the output.
            #coding_only        1 # Only return consequences that fall in the coding regions of transcripts.
+
  #most_severe        1 # Output only the most severe consequence per variation.  
            no_intergenic      1 # Do not include intergenic consequences in the output.
+
  #summary            1 # Output only a comma-separated list of all observed consequences per variation.
            #most_severe        1 # Output only the most severe consequence per variation.  
+
  #per_gene          1 # Output only the most severe consequence per gene.   
            #summary            1 # Output only a comma-separated list of all observed consequences per variation.
+
  filter_common      1 # Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters.
            #per_gene          1 # Output only the most severe consequence per gene.   
 
            filter_common      1 # Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters.
 
           
 
    * add plugins in /my/home/kchennen/.vep/Plugins
 

Revision as of 14:34, 15 October 2013

Date : 2013/10/14 Author : kchennen

Variant Effect Predictor

  • Installation on studio with Raymond
    • installation in /biolo/vep
 > curl "http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-tools/scripts/variant_effect_predictor.tar.gz?view=tar&root=ensembl&pathrev=branch-ensembl-73" | tar xz
 > cd variant_effect_predictor  
  • Install the API with a local cache in /biolo/vep/cache
 > perl INSTALL.pl -c /biolo/vep/cache
   Hello! This installer is configured to install v73 of the Ensembl API for use by the VEP.
   It will not affect any existing installations of the Ensembl API that you may have.
   
   It will also download and install cache files from Ensembl's FTP server.
   Checking for installed versions of the Ensembl API...done
   It looks like you already have v73 of the API installed.
   You shouldn't need to install the API
   
   Skip to the next step (n) to install cache files
   
   Do you want to continue installing the API (y/n)?y
   Setting up directories
       
   Downloading required files
    - fetching ensembl
    - unpacking ./Bio/tmp/ensembl.tar.gz
    - moving files
    - fetching ensembl-variation
    ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation.tar.gz?root=ensembl&view=tar&only_with_tag=branch-ensembl-73 ==> 301 Moved
    ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation.tar.gz?pathrev=branch-ensembl-73&root=ensembl&view=tar ==> 200 OK (8s)
    - unpacking ./Bio/tmp/ensembl-variation.tar.gz
    - moving files
    - fetching ensembl-functgenomics
    ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics.tar.gz?root=ensembl&view=tar&only_with_tag=branch-ensembl-73 ==> 301 Moved
    ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics.tar.gz?pathrev=branch-ensembl-73&root=ensembl&view=tar ==> 200 OK (5s)
    - unpacking ./Bio/tmp/ensembl-functgenomics.tar.gz
    - moving files
    - fetching BioPerl
    ** GET http://bioperl.org/DIST/BioPerl-1.6.1.tar.gz ==> 200 OK (15s)
    - unpacking ./Bio/tmp/BioPerl-1.6.1.tar.gz
    - moving files
       
    Testing VEP script
    - OK!
  • Install local cache for database connections for homo sapiens
       The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP
       Cache files will be stored in /biolo/vep/cache
       Do you want to install any cache files (y/n)? y
       Cache directory /biolo/vep/cache does not exists - do you want to create it (y/n)? y
       
       Downloading list of available cache files
       The following species/files are available; which do you want (can specify multiple separated by spaces): 
       1 : ailuropoda_melanoleuca_vep_73.tar.gz
       2 : anas_platyrhynchos_vep_73.tar.gz
       3 : anolis_carolinensis_vep_73.tar.gz
       ...
       25 : homo_sapiens_refseq_vep_73.tar.gz
       26 : homo_sapiens_vep_73.tar.gz
       ...
       
       ? 26
        - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz
       ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s)
        - unpacking homo_sapiens_vep_73.tar.gz
       
      Download FASTA files for homo sapiens
      
       The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks.
       FASTA files will be stored in /biolo/vep/cache
       Do you want to install any FASTA files (y/n)? y
       FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download): 
       1 : ailuropoda_melanoleuca
       2 : anas_platyrhynchos
       3 : ancestral_alleles
       ...
       26 : homo_sapiens
       ...
       
       ? 26
       Downloading Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz
       ** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s)
       Extracting data
       The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /biolo/vep/cache/homo_sapiens/73/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa"        
       Success
 
  • Configure
    • Add plugins
    • Create the configuration file vep.ini in /biolo/vep/cache
  ##########################
  ## general features flags 
  ##########################
  force_overwrite    1
  verbose            1
  species            homo_sapiens
  fork               4
  
  ###########################
  ## output annotation flags 
  ###########################
  sift                 b # the SIFT prediction and score, with both given as prediction(score)
  polyphen             b # the PolyPhen prediction and score
  regulatory           1 # Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site.
  numbers              1 # Adds affected exon and intron numbering to to output.
  domains              1 # Adds names of overlapping protein domains to output. 
              
  terms                so
  
  
  ################################
  ## ouput indentifications flags 
  ################################
  hgvs               1 # Add HGVS nomenclature based on Ensembl stable identifiers to the output.
  symbol             1 # Adds the gene symbol (e.g. HGNC) (where available) to the output.
  ccds               1 # Adds the CCDS transcript identifer (where available) to the output.
  protein            1 # Add the Ensembl protein identifier to the output where appropriate.
  canonical          1 # Adds a flag indicating if the transcript is the canonical transcript for the gene.
  biotype            1 # Adds the biotype of the transcript. Not used by default
  xref_refseq        1 # Output aligned RefSeq mRNA identifier for transcrip
  
  
  
  #############################
  ## Co-located variants flags 
  #############################
  gmaf                 1 # Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output.
  #maf_1kg             1 # Add MAF from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output.
  maf_esp              1 # Include MAF from NHLBI-ESP populations.
  pubmed               1 # Report Pubmed IDs for publications that cite existing variant. 
  check_alleles        1 # When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel.
  check_svs            1 # Checks for the existence of structural variants that overlap your input. 
  ##failed             1 # When checking for co-located variants, by default the script will exclude variants that have been flagged as failed.
  
  
  #############################
  ##  Filtering and QC options 
  #############################
  #check_ref          1 # Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database.
  #coding_only        1 # Only return consequences that fall in the coding regions of transcripts.
  no_intergenic       1 # Do not include intergenic consequences in the output.
  #most_severe        1 # Output only the most severe consequence per variation. 
  #summary            1 # Output only a comma-separated list of all observed consequences per variation.
  #per_gene           1 # Output only the most severe consequence per gene.  
  filter_common       1 # Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters.