(Last update November 10, 2013)
TaSer (TabAnno and SEqmineR) is an integrated toolset for next-generation sequence data analysis. It consists of three components:
TabAnno an efficient command line annotation tool.
SeqMiner R packge to handle VCF/BCF files or METAL format files.
SeqMinerCmd Command-line script for extracting data from VCF/BCF files.
This tutorial introduces a quick analysis workflow for sequencing data. If you want to learn more about how to use each software, please refer to their documentations by clicking on their names above.
TaSer has unique advantages comparing to Database or Tabix based solutions. We construct the following comparison table and we hope it can convince you that TaSer is a light-weight and efficient tool to processing next-generation sequence data.
TaSer | Database | Tabix | |
---|---|---|---|
Data pre-processing | Annotate VCF files using TabAnno (fast) | Build database project (computationally intensive) | Indexing (fast) |
Data extraction by gene | Via build-in function (1) (in R) | Via SQL (requires knowledge of database) | NA (need additional scripting) |
Data extraction by region | Via built-function (2) (in R) | Via SQL (requires knowledge of database) | By genomic position (fast) |
Standard statistical analysis | Existing R function e.g. glm() etc (convenient) | Via integrated tools, e.g. VAT or Plink/Seq (convenient) | NA (need additional scripting) |
Implementing novel analysis | Implement functions in R (convenient and flexible) | Wrap R code in integrated toolkit e.g. VAT (less flexible, and R code has to follow a special format etc) | NA (need additional scripting) |
Conclusion | Flexible and efficient solution to extract genetics informationfrom sequencing data in R | Powerful solution to perform routine tasks per software implementation, limited flexibility, high computational cost | Suitable for interval query. Limited function for other analysis |
(1) readVCFToMatrixByGene, readVCFToListByGene
(2) readVCFToMatrixByRange, readVCFToListByRange
We use 1000 Genome Project release in the VCF format as an example.
First, we download the variants located in the CFH region.
For Linux user, tabix support retrieving VCF data directly. The command line is:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 1:196621008-196716634 |bgzip -c > CFH.vcf.gz
For Mac or Windows usr, you can download the same VCF file here
Then use tabix to create a index and the index file named as CFh.vcf.gz.tbi will be created:
tabix -p vcf CFH.vcf.gz
library(seqminer)
## Loading required package: stringr
setwd("~/Documents/TASER Manual/data/")
fileName = "CFH.vcf.gz"
cfh <- readVCFToMatrixByRange(fileName, "1:196621007-196716634", "")[[1]]
## 1 region to be extracted.
dim(cfh)
## [1] 1442 1092
cfh[1:5, 1:5]
## HG00096 HG00097 HG00099 HG00100 HG00101
## 1:196621053 0 0 0 0 0
## 1:196621121 0 0 0 0 0
## 1:196621169 0 0 0 0 0
## 1:196621201 0 0 0 0 0
## 1:196621212 0 0 0 0 0
The results are store in the variable cfh. Here we demonstrate SeqMiner can extract genotypes directly from VCF files within a specified region. Besides genotypes, VCF can have more information. To extract them, try ?readVCFToListByList in your R terminal to find out.
TaSer provide a annotation component, TabAnno. It can integrate various data in a generic tab-delimited file such as VCF file. The comand line to annotate the VCF file is:
~/anno/anno -i CFH.vcf.gz -o CFH.anno.vcf.gz -r ~/anno/resources/hs37d5.fa -g ~/anno/resources/refFlat_hg19.txt.gz --indexOutput -p ~/anno/priority.txt -c ~/anno/codon.txt
The resource files can be downloaded separated: hs37d5.fa, refFlat_hg19.txt.gz, codon.txt, and priority.txt
You should see the following message:
..............................................
... Anno(tation) ...
... Xiaowei Zhan, Goncalo Abecasis ...
... Speical Thanks: ...
... Hyun Ming Kang, Yanming Li ...
... zhanxw@umich.edu ...
... Sep 2011 ...
................................................
The following parameters are available. Ones with "[]" are in effect:
Available Options
Required Parameters : -i [CFH.vcf.gz], -o [CFH.anno.vcf.gz]
Gene Annotations :
-g [/net/fantasia/home/zhanxw/anno/resources/refFlat_hg19.txt.gz]
-r [/net/fantasia/home/zhanxw/anno/resources/hs37d5.fa]
--inputFormat [], --checkReference, -f []
-p [/net/fantasia/home/zhanxw/anno/priority.txt]
-c [/net/fantasia/home/zhanxw/anno/codon.txt], -u []
-d [], --se [], --si []
Other Annotations : --genomeScore [], --bed [], --tabix []
Auxillary Functions : --outputFormat [], --indexOutput [true]
Load reference genome /net/fantasia/home/zhanxw/anno/resources/hs37d5.fa...
DONE: 86 chromosomes and 3137454505 bases are loaded.
Load codon file /net/fantasia/home/zhanxw/anno/codon.txt...
DONE: codon file loaded.
Load priority file /net/fantasia/home/zhanxw/anno/priority.txt...
DONE: 25 priority annotation types loaded.
Load gene file /net/fantasia/home/zhanxw/anno/resources/refFlat_hg19.txt.gz...
DONE: 39173 gene loaded.
DONE: Generated frequency of each annotype type in [ CFH.anno.vcf.gz.anno.frq ].
DONE: Generated frequency of each highest priority annotation type in [ CFH.anno.vcf.gz.top.anno.frq ].
DONE: Generated frequency of each base change in [ CFH.anno.vcf.gz.base.frq ].
DONE: Generated frequency of each codon change in [ CFH.anno.vcf.gz.codon.frq ].
DONE: Generated frequency of indel length in [ CFH.anno.vcf.gz.indel.frq ].
DONE: 1442 varaints are annotated.
DONE: Generated annotation output in [ CFH.anno.vcf.gz ].
DONE: Indexing succeed!
Annotation succeed!
To extract genotype by gene, you will need annoated VCF files as described in previous step. We have annotated vcf file CFH.anno.vcf.gz and it is automatically indexed by TabAnno. So extracting genotype is straightforward in R:
library(seqminer)
setwd("~/Documents/TASER Manual/data/")
fileName = "CFH.anno.vcf.gz"
geneFile = "refFlat_hg19.txt.gz"
cfh <- readVCFToMatrixByGene(fileName, geneFile, "CFH", "Nonsynonymous")[[1]]
## 1 region to be extracted.
dim(cfh)
## [1] 37 1092
cfh[1:5, 1:5]
## HG00096 HG00097 HG00099 HG00100 HG00101
## 1:196642174 0 0 0 0 0
## 1:196642221 0 0 0 0 0
## 1:196642233 0 0 2 0 1
## 1:196646650 0 0 0 0 0
## 1:196646698 0 0 0 0 0
Similar as region-based extraction, SeqMiner has an advanced function to extract information flexibly (not limited to genotype). Try ?readVCFToListByGene in your R terminal.
Rare variant association analysis can be performed using rvtests. Using --meta score,cov, rvtests will generate association statistis in METAL format. Similar to VCF files, you will need an extra step to bgzip and index the association results.
We have provided an exemplar file (Download). You need to first uncompress the downloaded file:
tar zvxf rvtests.tar.gz
To extract association results by region, you can use this command:
library(seqminer)
setwd("~/Documents/TASER Manual/data/rvtests")
score.fileName = "rvtest.MetaScore.assoc.anno.gz"
cov.fileName = "rvtest.MetaCov.assoc.gz"
cfh.test <- rvmeta.readDataByRange(score.fileName, cov.fileName, "1:196621169-196622041")
## 1 gene/region to be extracted.
## Read score tests...
## In study 0
## Done read score file: rvtest.MetaScore.assoc.anno.gz
## Read cov files ...
## Done read cov file: rvtest.MetaCov.assoc.gz
## Finished calculation.
cfh.test
## $Range1
## $Range1$ref
## $Range1$ref[[1]]
## [1] "A" "A"
##
##
## $Range1$alt
## $Range1$alt[[1]]
## [1] "G" "G"
##
##
## $Range1$nSample
## $Range1$nSample[[1]]
## [1] 1092 1092
##
##
## $Range1$af
## $Range1$af[[1]]
## [1] 0.02473 0.03068
##
##
## $Range1$ac
## $Range1$ac[[1]]
## [1] 54 67
##
##
## $Range1$callrate
## $Range1$callrate[[1]]
## [1] 1 1
##
##
## $Range1$hwe
## $Range1$hwe[[1]]
## [1] 0.003295 0.076875
##
##
## $Range1$nref
## $Range1$nref[[1]]
## [1] 1042 1028
##
##
## $Range1$nhet
## $Range1$nhet[[1]]
## [1] 46 61
##
##
## $Range1$nalt
## $Range1$nalt[[1]]
## [1] 4 3
##
##
## $Range1$ustat
## $Range1$ustat[[1]]
## [1] 6.965 -16.593
##
##
## $Range1$vstat
## $Range1$vstat[[1]]
## [1] 7.698 8.295
##
##
## $Range1$effect
## $Range1$effect[[1]]
## [1] 0.1175 -0.2412
##
##
## $Range1$pVal
## $Range1$pVal[[1]]
## [1] 0.36556 0.04546
##
##
## $Range1$cov
## $Range1$cov[[1]]
## [,1] [,2]
## [1,] 0.054266 0.001543
## [2,] 0.001543 0.063010
##
##
## $Range1$pos
## [1] "1:196621169" "1:196622041"
##
## $Range1$anno
## [1] "Utr5:CFH" "Intron:CFH"
##
## $Range1$covXZ
## $Range1$covXZ[[1]]
##
## [1,]
## [2,]
##
##
## $Range1$covZZ
## $Range1$covZZ[[1]]
## <0 x 0 matrix>
##
##
## $Range1$hweCase
## $Range1$hweCase[[1]]
## [1] NA NA
##
##
## $Range1$hweCtrl
## $Range1$hweCtrl[[1]]
## [1] NA NA
##
##
## $Range1$afCase
## $Range1$afCase[[1]]
## [1] NA NA
##
##
## $Range1$afCtrl
## $Range1$afCtrl[[1]]
## [1] NA NA
As you can see, the results are returned as a list. Several useful statistics include allele frequency, call rate, Hardy-Weinburg Equilibirum P-value and score test P-value.
TabAnno website contains download information and detailed usage documentation.
SeqMiner website have more examples of how to use SeqMiner R package and demonstrate rare variant analysis using SKAT.
SeqMinerCmd website described a command-line tools that can extract genotpyes or other information from VCF/BCF files.
TaSer is developed by Xiaowei Zhan and Dajiang Liu. We welcome your questions and feedbacks.