(Updated May 10, 2014)
To receive email notification of updated manual, please Sign up
SeqMiner (formerly, vcf2geno) is an R package that helps you efficiently extracting sequencing data in the VCF format files. We hope SeqMiner becomes a valuable tool for bioinformaticians, compuational genetists and statistical genetists.
In this tutorial, we will go over some basic SeqMiner functions of handling VCF files. These include gene-based annotations, GERP-score integration and genetic data extraction. These instructions are also available in SeqMiner demos, which can be invoked by > demo("workflow", package = "seqminer").
NOTE First thing to note, SeqMiner requires VCF/BCF files indexed by Tabix.
We will need to prepare annotation parameters. At minimal, we need (1) a reference genome files (FASTA format) and its index (.fa.fai) and (2) gene definitation file (refFlat format)
library(seqminer)
## Loading required package: stringr
param <- list(reference = system.file("tabanno/test.fa", package = "seqminer"),
geneFile = system.file("tabanno/test.gene.txt", package = "seqminer"))
param <- makeAnnotationParameter(param)
We have an example VCF file input.demo.vcf
and we will specify an output file out.vcf.gz
input <- system.file("tabanno/input.demo.vcf", package = "seqminer")
output <- paste0(getwd(), "/", "out.vcf.gz")
annotateVcf(input, output, param)
## Output file are written to [ /Library/Frameworks/R.framework/Versions/3.1/Resources/library/seqminer/tabanno/out.vcf.gz ].
## NULL
The above messages shows the annotation steps finish successfully.
We will demonstrate how to perform gene-based annotation, region-based annotation and GERP-score integration in just one step. In order to integrate external bioinformatics databas (DB), we use a region-based file as example (BED format). Here are relevant parameters:
setwd(system.file("tabanno", package = "seqminer"))
param <- list(reference = "test.fa", geneFile = "test.gene.txt", bed = "REGION=test.bed",
tabix = "test.dbNSFP.gz(SIFT=9,PolyPhen=10)", indexOutput = TRUE)
param <- makeAnnotationParameter(param)
input <- "input.demo.vcf"
output <- "out.vcf.gz"
annotateVcf(input, output, param)
## Output file are written to [ out.vcf.gz ].
## NULL
Now we are able to query the VCF by region or by gene
A quick way to obtain gentoype matrix:
genotype <- readVCFToMatrixByRange(fileName = output, range = "1:1-10", annoType = "")
## 1 region to be extracted.
print(genotype)
## $`1:1-10`
## NA12891 NA12892
## 1:3 1 0
## 1:5 1 0
## 1:7 1 0
You can also fine-control what fields from VCFs to extract:
genotypeList <- readVCFToListByRange(fileName = output, range = "1:1-10", annoType = "",
vcfColumn = c("CHROM", "POS"), vcfInfo = c("ANNO", "SIFT"), vcfIndv = "GT")
print(genotypeList)
## $CHROM
## [1] "1" "1" "1"
##
## $POS
## [1] 3 5 7
##
## $ANNO
## [1] "Normal_Splice_Site:GENE1|GENE3" "Nonsynonymous:GENE1|GENE3"
## [3] "Monomorphic"
##
## $SIFT
## [1] "0.0" "NA" "NA"
##
## $GT
## [,1] [,2] [,3]
## [1,] "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0"
##
## $sampleId
## [1] "NA12891" "NA12892"
Another useful feature for seqminer is to extract data by gene. Similar to 'readVCFToMatrixByRange', we use 'readVCFToMatrixByGene':
geneFile = system.file("tabanno/test.gene.txt", package = "seqminer")
genotype <- readVCFToMatrixByGene(fileName = output, geneFile = geneFile, geneName = "GENE1",
annoType = "")
## 1 region to be extracted.
print(genotype)
## $GENE1
## NA12891 NA12892
## 1:3 1 0
## 1:5 1 0
## 1:7 1 0
## 1:13 1 0
## 1:14 1 0
## 1:20 1 0
## 1:22 1 0
## 1:25 1 0
## 1:43 1 0
## 1:44 1 0
## 1:50 1 0
## 1:53 1 0
## 1:53 1 0
## 1:53 1 0
## 1:66 1 0
Similar to readVCFToMatrixByRange, we use readVCFToMatrixByGene:
genotypeList <- readVCFToListByGene(fileName = output, geneFile = geneFile,
geneName = "GENE1", annoType = "", vcfColumn = c("CHROM", "POS"), vcfInfo = "ANNO",
vcfIndv = "GT")
print(genotypeList)
## $CHROM
## [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##
## $POS
## [1] 3 5 7 13 14 20 22 25 43 44 50 53 53 53 66
##
## $ANNO
## [1] "Normal_Splice_Site:GENE1|GENE3"
## [2] "Nonsynonymous:GENE1|GENE3"
## [3] "Monomorphic"
## [4] "Synonymous:GENE1|GENE3"
## [5] "CodonGain:GENE1|GENE3"
## [6] "Frameshift:GENE1|GENE3"
## [7] "Insertion:GENE1|GENE2|GENE3"
## [8] "Stop_Loss:GENE1"
## [9] "Normal_Splice_Site:GENE1|GENE3"
## [10] "Normal_Splice_Site:GENE1|GENE3"
## [11] "Essential_Splice_Site:GENE1|GENE3"
## [12] "StructuralVariation:GENE1|GENE2|GENE3"
## [13] "Nonsynonymous:GENE1"
## [14] "Nonsynonymous:GENE1"
## [15] "Utr5:GENE3"
##
## $GT
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0" "0/0"
## [,12] [,13] [,14] [,15]
## [1,] "1/0" "1/0" "1/0" "1/0"
## [2,] "0/0" "0/0" "0/0" "0/0"
##
## $sampleId
## [1] "NA12891" "NA12892"
SeqMiner is developed by Xiaowei Zhan and Dajiang Liu. We welcome your questions and feedbacks.