SEQMINER can integrate data from various sources (e.g. annotation). Annotation information is necessary for determining analysis units (e.g., single variant tests or gene-based tests) and variant priority in statistical analysis. SEQMINER implements an efficient and powerful variant annotator for sequence data in generic TSV files, supporting both gene-based and region-based annotation.
Our goal is to provide abundant information for genetic variants efficiently. Some annotation software output annotations for only one transcript, while SEQMINER will provide annotations to all transcripts. SEQMINER supports various file format: VCF file, plain text file, plink association output file and EPACTS file.
We also provide a similar standalone command line tool, TabAnno (short for "Tab-delimited file Annotation"). This tool are best described (here).
You first need to install and load seqminer. Under R computation environment, use:
# Installation:
install.packages("seqminer")
# Load SEQMINER library:
library("seqminer")
You have an input file in VCF format, and your goal is to annotate it using refFlat gene definition. Then you need the following command:
reference <- system.file("tabanno/test.fa", package = "seqminer")
geneFile <- system.file("tabanno/test.gene.txt", package = "seqminer")
param <- list(reference = reference,
geneFile = geneFile)
param <- makeAnnotationParameter(param)
inVcf <- system.file("tabanno/input.test.vcf", package = "seqminer")
outVcf <- paste0(getwd(), "/", "output.vcf")
annotateVcf (inVcf, outVcf, param)
Annotated VCF will be named output.vcf under current directory. Annotation results have two additional tags: ANNO and ANNOFULL. The first tag ANNO showed the most important annotation types (determined by priority file); the second tag ANNOFULL showed detailed annotations. Let's see one example annotation:
ANNO=Nonsynonymous:GENE1|GENE3;ANNOFULL=GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon|GENE3/CODING_GENE:-:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site:Exon|GENE2/NON_CODING_GENE:+:Upstream
ANNO tag displayed the most important variant type is Nonsynonymous and that happened at GENE1 and GENE3; ANNOFULL tag is the full set of annotation and it consists of two parts, for GENE1 and GENE3 respectively. The first part is for GENE1:
GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon
The format can be explained by sections, and we use annotation for GENE1 as an example:
The workflow of SEQMINER data integration has two steps. In the first step, you create annotation parameter (refer to the Parameters section), then you apply the parameter, input file name and output file name in a annotation related function calls (e.g. annotateVCF).
Parameter is a R list where you can customize data integration steps. In the simplest case, you can use makeAnnotationParameter() to create a default parameter:
param <- makeAnnotationParameter()
You can optionally change some parameters, for example, change reference genome:
reference <- "reference.fa"
param <- makeAnnotationParameter(reference = reference)
Most parameter names are self explanatory, and here we list all of them.
reference: Specify a FASTA format reference genome file.
Specifying reference option will enable SEQMINER to give more nucleotide level information, for example, instead of just annotating a variant as exon, it will tell you which codon, which exon that variant locates, whether its synonymous/non-synonymous and etc.
SEQMINER requires a FASTA index file and that will save running memory and speed up annotation. You can use "samtools faidx FileName.fa" to generate Fasta index file.
For example, you can specify a FASTA file of the whole genome and use param <- list(reference = "/path/to/human.g1k.v37.fa
Currently, SEQMINER supports gene file in refFlat, knownGene and refGene format. A prepared list of all gene obtained of UCSC website is: refFlat_hg19.txt.gz . To use that file, you need to download it to local hard drive and use the option geneFile.
As SEQMINER supports refFlat file by default, you can skip specifying the gene file format to be refFlat (no need to specify geneFileFormat). However, to use knownGene or refGene format, you need to specify both geneFile and
geneFileFormat to tell SEQMINER which gene file path and its format. For example,
param <- list(geneFile = "/net/fantasia/home/zhanxw/anno/knownGene.txt.gz", geneFileFormat = "knownGene"
.
Codon file can tell the relationship between triplet codons and amino
acids. A default file (human codon) is provided with the package in:
system.file("tabanno/codon.txt", package = "seqminer")
.
If you have special codon file, you can specify that using flag codonFile, otherwise, SEQMINER will use the
default codon file:
default codon file
# DNA codon table. Stop codon is coded as: Stp, O, Stop
#Codon AminoAcid Letter FullName
AAA Lys K Lysine
AAC Asn N Asparagine
AAG Lys K Lysine
AAT Asn N Asparagine
ACA Thr T Threonine
...
upstreamRange: how far apart from 5'-end of the gene are counted as upstream
downstreamRange: how far apart from 3'-end of the gene are counted as downstream
The upstreamRange and downstreamRange parameters define the range of upstream and downstream.
spliceIntoExon: how far apart from 5'-end of the gene are counted as upstream
spliceIntoIntron: how far apart from 3'-end of the gene are counted as upstream
The spliceIntoExon and spliceIntoIntron defines the splice region. By default, 3 bases into the exon and 8 bases into the introns are defined as splicing sites. If mutations happen in these regions, we will annotate it as "Normal_Splice_Site" unless the mutations happens in the traditionally important "GU...AG" region in the intron, in that case, we will "Essential_Splice_Site" as annotation.
bed : Specify the bed file and tag (e.g. ONTARGET1=a1.bed,ONTARGET2=a2.bed)
BED file is commonly used to represent range. Here you will check if certain genome position is contained in one or more ranges specified in the BED format. An example BED file, example.bed, is as follows:
1 10 20 a
1 15 40 b
There are two ranges: range a is from chromsome 1 position 10 (inclusive) to chromosome 1 position 20 (exclusive); range b is from chromsome 1 position 15 (inclusive) to chromosome 1 position 40 (exclusive).
When use param <- list(bed="ONTARGET=example.bed)
as parameter, the output may look
like this: ONTARGET=a,b in the VCF INFO field. That means the variant overlap both region a(chromosome 1, base position 10 to 20) and b (chromosome 1, base position 15 to 40).
genomeScore : Specify the folder of genome score (e.g. GERP=dirGerp/,SIFT=dirSift/)
Genome score is a special binary file format (designed by Hyun Ming Kang) storing
per-position genome score, such as GERP score, SIFT score. You will need
to pre-process those scores in a directory. To annotate genome score,
use param <- list(genomeScore="TAG_NAME=DIRECTORY")
. In the VCF/BCF output file, you will
have TAG_NAME=some_value in the INFO field. In output files with other formats, you will see a separate column.
tabix : Specify the tabix file and tag (e.g. abc.txt.gz(chrom=1,pos=7,ref=3,alt=4,SIFT=7,PolyPhen=10)
Tabix file can be used in annotation. Here we required you provide bgzipped, indexed file. In above example, we use abc.txt.gz as input, and column 1,7,3,4 as chromosome, position, reference allele, alternative allele. We also use column 7 as SIFT score and column 10 as PolyPhen score. Similar syntax as previous, you will get outputs similar to 'SIFT=0.110;PolyPhen=0.00' in the VCF INFO field. NOTE, in bash, please use quote around parenthesis, otherwise parenthesis is not correctly treated. A working example is as follows:
param <- list(tabix = "hg19_ljb_all.txt.gz(chrom=1,pos=2,ref=4,alt=5,mySift=6,mySC=7,myPP2=8,myPC=9)")
In the above example, hg19_ljb_all.txt.gz is dbNSFP resource file which includes multiple scores and you can refer to Data Set to download it. The output file will have mySift, mySC, myPP2 and myPC tags/columns corresponding to the 6, 7, 8 and 9 columns of file hg19_ljb_all.txt.gz.
indexOutput : Specify whether or not index the output
Enable this option (param <- list(indexOutput=TRUE)
) will let SEQMINER index the output file. You will need to specify .gz as output file name suffix.
The output file will then be compressed by BGZIP algorithm, and the index file will be calculated using Tabix.
indexOutput : Specify whether or not index the output
Enable this option (param <- list(checkReference=TRUE)
) will examine the reference alleles against the reference genome
After you obtain annotation parameters, you can use one the following annotation functions:
annotateVCF(inputFile, outputFile, param)
Annotate input VCF file specified by inputFile, and output file is specified by outputFIle using parameter specified by param.
annotateVCF(inputPlain, outputFile, param)
Annotate tab-delimited file specified by inputFile, and output file is specified by outputFIle using parameter specified by param. The first columns of the inputFile are used as chromosome name, position, reference allele and alternative allele. If the beginning lines have the first character '#', then these are treated as header lines and will be not annotated.
SEQMINER can annotate a VCF file and also output statistics of 4 frequency table: annotation type; base change; codon change; indel size. We will provides details for the these output files.
In the tutorial part, we demonstrated commands for gene based annotations:
# 1. Provide necessary annotation resource
reference <- system.file("tabanno/test.fa", package = "seqminer")
geneFile <- system.file("tabanno/test.gene.txt", package = "seqminer")
# 2. Construct annotation parameters
param <- list(reference = reference,
geneFile = geneFile)
param <- makeAnnotationParameter(param)
# 3. Provide input and output file names
inVcf <- system.file("tabanno/input.test.vcf", package = "seqminer")
outVcf <- paste0(getwd(), "/", "output.vcf")
# 4. Perform data integration using annotateVcf()
annotateVcf (inVcf, outVcf, param)
As outVcf specifies the main output file is output.vcf, and there are several other frequency report files: out.vcf.anno.frq, out.vcf.top.anno.frq, out.vcf.base.frq, out.vcf.indel.frq and out.vcf.codon.frq.
The annotated VCF file is output.vcf
#VCF_test
#from http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps/Tutorial.shtml
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878
1 3 . A G 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14
1 5 . C A 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14
...
The annotation results are stored in the INFO column after ANN= tag. The annotation format is defined as following:
"|" separates different transcripts, e.g. in the first line, chromosome 1 position 3, there are 3 annotations: "GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site" and "GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site" and "GENE2/NON_CODING_GENE:+:Upstream"
":" separates within gene annotation in the following order: gene, strand, exon/intron, details.
Five frequency table will be generated after annotation: out.vcf.anno.frq, out.vcf.top.anno.frq, out.vcf.base.frq, out.vcf.indel.frq and out.vcf.codon.frq.
Their contents are self explanatory.
out.vcf.top.anno.freq
StructuralVariation 1
Stop_Loss 1
Frameshift 1
CodonGain 1
Insertion 1
Synonymous 1
Essential_Splice_Site 1
Utr5 1
Upstream 1
Intergenic 2
Nonsynonymous 3
Normal_Splice_Site 3
out.vcf.anno.frq
Stop_Loss 1
Utr5 2
Utr3 2
CodonRegion 2
CodonGain 2
Frameshift 2
Synonymous 3
StructuralVariation 3
Noncoding 3
Nonsynonymous 4
Deletion 6
Upstream 6
Insertion 6
Essential_Splice_Site 8
Downstream 8
Intron 12
Exon 21
Normal_Splice_Site 25
out.vcf.base.frq
A->G 1
T->C 1
T->G 2
A->C 2
C->A 5
out.vcf.codon.frq
Arg->Met 1
Pro->Thr 1
Arg->Arg 1
Pro->His 1
Gly->Gly 1
Pro->Pro 1
Stp->Tyr 1
Leu->Val 1
out.vcf.indel.frq
1 1
-4 1
3 1
-3 1
We list here some commonly used data sets during annotation. For more resources (e.g. other reference genome, gene files et al), please see Data Set.
Reference genomes
hs37d5.fa Reference genome in NCBI build 37 (you also need to download hs37d5.fa.fai
Gene file refFlat_hg19.txt.gz Gene database in refFlat format (from UCSC website). You can also use Gencode version 7 or Gencode version 11.
Annotation configuration files
Please contact Xiaowei Zhan zhanxw@gmail.com, Dajiang Liu dajiang.liu@outlook.com or Goncalo Abecasis (goncalo@umich.edu).
Authors sincerly appreicate Yanming Li for his wonderful tutorial on gene annotation software vcfCodingSnps, and Hyun Ming Kang for his code related to genome scores and his consistent suggestions and feedbacks.
Last update: November 13, 2014