Q: What are Structural Variations (SV)?

SV are generally defined as variation in a DNA region that vary in length from ~50 base pairs to many megabases and include several classes such as translocations, inversions, insertions, deletions.

Q: What are Copy Number Variations (CNV)?

CNV are deletions and duplications in the genome (unbalanced SV) that vary in length from ~50 base pairs to many megabases.

Q: What are the differences between SV and CNV?

CNV are unbalanced SV with gain or loss of genomic material. For example, a heterozygous duplication as a CNV will be characterized with the start and end coordinates and the number of copies which is 3.

Q: Can AnnotSV annotate every format of SV?

AnnotSV supports as well VCF or BED format in input.
- VCF format supports complex rearrangements with breakends, that can arbitrary be summarized as a set of novel adjacencies, as described in the Variant Call Format Specification VCFv4.4 (Jan 2023).
- BED format doesn’t allow inter-chromosomal feature definitions (e.g. inter-chromosomal translocation). A new file format (BEDPE) is proposed in order to concisely describe disjoint genome features but it is not supported by AnnotSV.

Q: I would like to annotate my SV with new annotation sources but I don’t know how to do that…

No problem. AnnotSV is under active and continuous development. You can email me with a detailed request and I will answer as quickly as possible.

Q: I have just updated AnnotSV or the annotations sources and the annotation process is longer than usual, is it normal?

After an update of AnnotSV sources, some files will be reprocessed and thus taking several additional time. Further use of AnnotSV will be quicker!

Q: How to cite AnnotSV in my work?

We do appreciate citations very much. So, if you are using AnnotSV, please cite our work using the following references:

AnnotSV and knotAnnotSV: a web server for human structural variations annotations, ranking and analysis.
Geoffroy V, Guignard T, Kress A, Gaillard JB, Solli-Nowlan T, Schalk A, Gatinois V, Dollfus H, Scheidecker S, Muller J.
NAR. 2021 May 22. doi: 10.1093/nar/gkab402

AnnotSV: An integrated tool for Structural Variations annotation.
Geoffroy V, Herenger Y, Kress A, Stoetzel C, Piton A, Dollfus H, Muller J.
Bioinformatics. 2018 Apr 14. doi: 10.1093/bioinformatics/bty304

And if you use the phenotype-driven analysis in your work, please cite also the following articles:
• Next-generation diagnostics and disease-gene discovery with the Exomiser. Smedley D., et al, Nature Protocols (2015) doi:10.1038/nprot.2015.124
• Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Köhler S., et al, Nucleic Acids Research (2019) doi: 10.1093/nar/gky1105

Q: What are the WARNINGs that AnnotSV mention while running?

AnnotSV writes to the standard output progress of the analysis including warnings about issues or missing information that can be either blocking or simply informative.

Q: Why are some values empty or set to -1 in the output files?

When no information is available for a specific type of annotation, then the value is empty. Regarding the frequencies, the default is set to -1.

Q: Why some SV have empty gene annotation in the output file?

If a SV is located in an intergenic region and so doesn’t cover a gene, then the SV is reported in the output file but without gene annotation.

Q: Why can we have several gene annotations for one SV??

In some cases, one SV overlaps a large portion of the genome including several genes. In these cases, the annotation of the SV is split on several lines.

Annotation example for the deletion 1:16892807-17087595
AnnotSV keep all gene annotations, with only one transcript annotation for each gene:
SV chrom SV start SV end Gene name NM CDS length tx length location
1 16892807 17087595 DEL CROCCP2 NR_026752 1 12652 txStart-txEnd
1 16892807 17087595 DEL ESPNP NR_026567 1 28941 txStart-txEnd
1 16892807 17087595 DEL FAM231A NM_001282321 511 511 txStart-txEnd
1 16892807 17087595 DEL FAM231C NM_001310138 511 656 txStart-txEnd
1 16892807 17087595 DEL LOC102724562 NR_135824 1 2998 txStart-txEnd
1 16892807 17087595 DEL MIR3675 NR_037446 1 75 txStart-txEnd
1 16892807 17087595 DEL MST1L NM_001271733 2015 6468 txStart-exon14
1 16892807 17087595 DEL MST1P2 NR_027504 1 4848 txStart-txEnd
1 16892807 17087595 DEL NBPF1 NM_017940 2912 47294 intron3-txEnd

Q: I am confused by the difference between the 'full' and the 'split' annotation mode. CNVs have been split into several lines, but each line get different DB annotation (DGV, 1000g…). I thought that same region should have the same annotations (excluding gene/transcript)?

AnnotSV builds 2 types of annotations, one based on the full-length SV (corresponding to the Annotation_mode = "full") and one based on each gene within the SV (corresponding to the annotation_mode = "split"). Thus you will have access to:
Be careful: the first 3 columns (SV chrom, SV start and SV end) remains the same despite being in "full" or in "split" type.

Regarding these "split" lines:

Q: What do OMIM and GenCC Inheritance annotations mean?

AD = "Autosomal dominant"
AR = "Autosomal recessive"
XLD = "X-linked dominant"
XLR = "X-linked recessive"
YLD = "Y-linked dominant"
YLR = "Y-linked recessive"
XL = "X-linked"
YL = "Y-linked"
ADm = "Autosomal dominant with maternal imprinting"
ADp = "Autosomal dominant with paternal imprinting"
2G = "Digenic inheritance"
MT = "Mitochondrial"
sD = "Semidominant"
SOM = "Somatic mosaicism"
IPVE = "Incomplete Penetrance and/or Variable Expressivity" (according to the recommendations from the French ACHRO-PUCE network)"

Q: Why do I get this error message: “Feature (10:134136286-134136486) beyond the length of 10 size (133797422 bp). Skipping.”

One possibility is that you are using the bad “-genomeBuild” option. For example, you are using a bedfile in input with the SV coordinates on GRCh37 but with the “-genomeBuild GRCh38” option.

Q: Is AnnotSV available for other organisms?

The main objective of AnnotSV is to annotate SV information from human data. By default, all the annotations are based on human specific databases. Nevertheless, some additional annotation files can be added for mouse. If you are interested, please see the specific mouse README file.

Q: Is there an option to just generate SV “split” by gene?

You can choose to keep only the split annotation lines thanks to the "-annotationMode" option.

Q: I am unable to run the code on the input files provided. It crashes on the Repeat annotation step due to a bad_alloc error. Do you have any ideas on why this is happening?

AnnotSV needs to be run with an appropriate RAM (depending of the annotations used). Setting your system to allocate 10 Go should solve the problem.

Q: I'm getting the error: “ANNOTSV environment variable not specified. Please define it before running AnnotSV. Exit”. How can I fix this problem?

ANNOTSV is the environment variable defining the installation path of the software.

• In csh, you can define it with the following command line:
setenv ANNOTSV /path_of_AnnotSV_installation/bin
• In bash, you can define it with the following command line:
export ANNOTSV=/path_of_AnnotSV_installation/bin

I advise you to save the good command in your .cshrc or .bashrc file.

Q: My annotated SV is intersecting both a benign SV and a pathogenic SV. How can I explain that?

Several possible explanations can be considered:
• The pathogenicity can concern a recessive disease. So the pathogenic SV can be present in the heterozygous state in the healthy population (with a DGV low frequency)
• The pathogenic region of the dbVar SV is not overlapping the DGV SV

Q: I’m getting the error: “-- max size for a Tcl value (2147483647 bytes) exceeded”. How can I fix this problem?

You are probably using AnnotSV to annotate a very large SV input file (from a large cohort). Thus you are facing a memory issue either caused by the current machine specification or the programming language used for AnnotSV (Tcl). To solve this you can split your input file into smaller files, run AnnotSV and then later merge them into a single output file. This will be fixed in a future release.

Q: For a VCF with only “BND” events, which refers to breakpoints, how are these being shown in the AnnotSV output when SVminSize is set to 50bp? Since a breakpoint start and stop positions only differ by 1bp, I am wondering why these are not filtered out by AnnotSV.

AnnotSV is designed to annotate SV and not SNV/indel from a VCF, which is the aim of the "SVminSize" option.
Actually, SV can be described in three different ways in a VCF file:
• Type2: alt="❮INS❯", "❮DEL❯", "❮BND❯"...
• Type3: complex rearrangements with breakends: alt="G]17:1584563]"

The “SVminSize” parameter is only used to exclude SNV/indel (small deletion, insertion or duplication) from a VCF input file.

Q: How is calculated the “SV length” annotation?

It is to notice that in the VCF specification:
• For imprecise structural variants (i.e. symbolic allele, i.e. with an angle-bracketed notation; e.g. ≺DUP≻):
END = POS + length of REF allele
• For precise structural variants:
END = POS + length of REF allele - 1
• AnnotSV reports the “SVLEN” value if given in a VCF input file.
• Nevertheless, when it is not provided, AnnotSV calculates the SV length (with "alt length" - "ref length") depending on the description of it in a VCF input file: ref="G" and alt="ACTGCTAACGATCCGTTTGCTGCTAACGATCTAACGATCGGGATTGCTAATCTCGGG"
• Else, AnnotSV calculates the SV length only for deletion, duplication and inversion (with "SVend - SVstart", and with a negative value for deletion). Indeed, this calculation cannot be done for insertion, breakend...
• The SV length is set to 0 for translocations.
• Else, the SV length is blank.

Q: Why do I get negative values in the SV_length column?

It is to notice that deletions have negative values. Other SV types have positive values.

Q: What does the candidateGenesFile parameter refer to?

The candidateGenesFile contains the candidate genes of the user. This information is used to filter out the SV annotations that do not overlap a candidate gene (-candidateGenesFiltering 1).

Q: My input bed file contains ~10000 SV, but only ~2000 SV are annotated. Why?

AnnotSV does not annotate:
• The SNV/indel (size<50bp)
• The SV in a bad format
• The SV for which the “END” is not defined.
AnnotSV creates a report of unannotated variants (“.unannotated.tsv” file).
If you want to annotate SNV/indel, please set the -SVminSize to 1.

Q: How overlaps (%) are calculated?

AnnotSV provides 3 different types of annotations:

- An annotation with features overlapping the SV (DGV, 1000 genomes…):

- An annotation with features overlapped with the SV (pathogenic SV from dbVar, promoters, enhancers…):

- A gene-based annotations
Each gene overlapped by the SV to annotate is reported (even with 1bp overlap).

Q: Why not to use a reciprocal overlap with features overlapped with the SV to annotate?

Let’s take the example of pathogenic SV as features.

=> AnnotSV would lose some information if using a reciprocal overlap.

Q: Concerning custom annotations, for example with benign genomic regions, where exactly should I put my bed file of interest between the “FtIncludedInSV”, “SVincludedInFt” and “AnyOverlap” directories?

This question refers to the “Custom annotations: External BED annotation files (optional)” section of the README.
TheBED file should be placed in the “SVincludedInFt” directory. Indeed, the "benign SV" annotation might only be fully useful if overlapping 100% of your "SV to annotate". Otherwise, if the "SV to annotate" includes additional genomic material (compared to the benign SV), you cannot definitely conclude on its benign/pathogenic status.

Q: I have a custom BED file with genomic regions and their AF identified from healthy population. How to use it for filtering my SV?

Your BED file should be placed in the “SVincludedInFt” directory (please see the above question).

There are 2 different methods to use your BED file:

  • Either, you can directly annotate your SV with your BED file to get the AF values associated with the overlapped SV. Then you can filter out your SV completely overlapped with a common SV (filtering based on the reported AF and a given threshold)
    => This way, you will have to filter your data for each analysis (repetitive). But you keep the possibility to change the threshold (AF > 0.05 (i.e. 5%), AF > 0.01 (i.e. 1%)…) any time.

  • Or, you can first filter your bed file to only keep common SV (can be considered as benign) (population AF > 0.01 (i.e. 1%)) and then just remove those SV which get annotated by the AF values.
    => This way, you will only need to filter the BED file once. But you will not be able to modify the threshold.

  • Each method has its own pros and cons. If your threshold is fixed, the second method is more appropriate. Otherwise, use the first one.

    Q: What are the minimal info/headers needed in a VCF input file to run AnnotSV?

    AnnotSV is using the VCF format following official specification VCF v4.3. Nevertheless, some flexibility is allowed:
    - No meta-information line (prefixed with “##”) is required
    But the following is mandatory:
    - A header line (prefixed with “#CHROM”)
    - The following INFO keys are required: GT, SVLEN and END.

    The comprehension of the square-bracketed notations relies on the homogenization rules from the variant-extractor tool developed by Rodrigo Martín.

    In order to be able to classify the SV, the SV type is extracted from the "ALT" column.
    The SV type should be:
    - An angle-bracketed notation among ❮DEL❯, ❮INS❯, ❮DUP❯, ❮INV❯, ❮BND❯, ❮LINE1❯, ❮SVA❯, ❮ALU❯, ❮CN0❯, ❮CN2❯, ❮CN3❯ ...
    - A square-bracketed notation

    The SVTYPE value from the “INFO” column, deprecated, can also be used by AnnotSV if not available in the “ALT” column.

    In order to use the “snvIndelPASS” option (using of the variants only if they passed all filters during the calling), the FILTER column value is mandatory.

    Q: I’m getting the error: “ERROR: chromosome sort ordering for file … is inconsistent with other files”. How can I fix this problem?

    The locale specified by your environment can affect the traditional “sort” order that uses native byte values. Please, set LC_ALL=C.
    In csh, you can define it with the following command line:
    setenv LC_ALL C
    In bash:
    export LC_ALL=C

    Q: I’m getting the error: « unexpected token "END" at position 0; expecting VALUE » while running Exomiser. How can I fix this problem?

    You are facing a memory issue. Please, try increasing RAM/MEM on your compute node.

    Q: I have some concerns about data sharing. Does AnnotSV connect somehow with the web version of Exomiser?

    AnnotSV doesn't connect with the web version of Exomiser.
    All necessary Exomiser data (to score a gene by using the HPO terms) is installed locally in the $ANNOTSV/share/annotSV/Annotations_Exomiser/ directory.
    The code for the Exomiser module was extracted directly from Exomiser (thanks to developer Jules Jacobsen). A minimal Java 8 installation is required. Moreover, the Exomiser module writes in the /tmp/spring.log file that must, therefore, have write permissions.
    Given the input (a gene name and HPO terms), this module returns a score.

    Q: What is knotAnnotSV?

    knotAnnotSV is a freely accessible web interface that allows you to explore your annotated SV dataset in a user-friendly way. This interface is well detailed in the README.knotAnnotSV_latest.pdf file available on Github.

    Q: I have annotated a SV VCF file and some “Samples_ID” are empty. What is happening?

    Some VCF might contain multiple samples. Thus, each SV might have been called for only some or all of the samples (indicated with the GT feature). Since AnnotSV annotates all the SV from the input file, the reported “Samples_ID” output column specifically lists the samples for which the SV was called.

    As an example, from this SV VCF input file:
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3
    1 3085911 . N <DEL> 4752.09 PASS SVTYPE=DEL;END=3095542 GT 0/0 0/1 1/1
    5 25635 . N <DUP> 4256.36 PASS SVTYPE=DUP;END=26358 GT ./. ./. 0/0
    7 5085911 . N <DEL> 3752.09 PASS SVTYPE=DEL;END=5095542 GT 0/1 0/1 0/0
    AnnotSV will report the following “Samples_ID” column:
    AnnotSV_ID Samples_ID
    1_3085911_3095542_DEL_1 sample2,sample3
    7_5085911_5095542_DEL_1 sample1,sample2

    Q: Should I submit all possible HPO related to a patient to figure out which rare disease this patient has?

    For each overlapped gene, Exomiser assigns a similarity score based on "all" the submitted HPO terms.
    It is important to keep in mind that:
    • The analysis of genomic data in rare disorders mostly considers the presence of single gene variants in coding regions that follow a concrete monogenic mode of inheritance. In this case, the use of all HPO terms makes sense.
    • A digenic inheritance, with variants in two functionally-related genes in the same individual, is a plausible alternative that might explain the genetic basis of the disease in some cases. In this case, the use of all HPO terms will skew the exomiser damaging score.

    Q: Regarding square-bracketed ALT notation, how AnnotSV handle missing breakends in VCF input files?

    The comprehension of the square-bracketed notations relies on the homogenization rules from the variant-extractor tool developed by Rodrigo Martín.

    Duplication, inversion, deletion and insertion:
    As breakends are always reciprocal, AnnotSV returns just one full annotation per SV (one full annotation per breakend pair). For this reason, considering paired breakends, the ALT feature with the lowest position is returned. The other one is reported in the unannotated output file.
    AnnotSV returns one full annotation for each breakend of the pair.

    Regarding your question, it is to notice that with one breakend, you can always infer the other. Indeed, the only thing that could be different from the mate breakend is its CIPOS INFO field (but it should be provided in the CIEND field of the other breakend). Regarding GT, both breakends must have the same GT because they represent the same thing.
    The CIEND/CIPOS relationship is that you can use the CIEND info from the breakend you already have to set the CIPOS field of the new "inferred" breakend.
    1 67452229 166_1 N [1:67452635[N 353.30 . SVTYPE=BND;CIPOS=-2,0;CIEND=-10,9;CIPOS95=-1,0;CIEND95=-2,1;MATEID=166_2
    1 67452635 166_2 N [1:67452229[N 353.30 . SVTYPE=BND;CIPOS=-10,9;CIEND=-2,0;CIPOS95=-2,1;CIEND95=-1,0;MATEID=166_1

    So, to conclude, AnnotSV rescues the missing breakend when possible.