(Updated November 10, 2013)
To receive email notification of updated manual, please Sign up
SeqMiner (formerly, vcf2geno) is an R package that helps you efficiently extracting sequencing data in the VCF format files. SeqMiner is one of the two components in our analysis toolset, TASER, which stands for (TabAnno and SEqmineR): an integrated toolset for next-generation sequence data analysis. The other component in TASER is TabAnno. It is a very efficient annotation tool.
In this tutorial, we will only introduce SeqMiner. For annotation purpose, please visit documentation of TabAnno at link; for integrate these two software packages and perform general analyses, please refer to link.
NOTE1 First thing to note, SeqMiner requires VCF files indexed by Tabix.
NOTE2 To use gene based extraction, VCF need to be annotated by TabAnno.
We have provided an examplar VCF file in this package. You can use the following command to find its location on your computer:
## Loading required package: stringr
fileName = system.file("vcf/all.anno.filtered.extract.vcf.gz", package = "seqminer") fileName
##  "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/seqminer/vcf/all.anno.filtered.extract.vcf.gz"
This tutorial demonstrates a quick guide of a major set of functaions of SeqMiner.
Extract genotype matrix from 1:196621007-196716634 (the genomic range of CFH gene), and only extract those Nonsynonymous variants:
cfh <- readVCFToMatrixByRange(fileName, "1:196621007-196716634", "Nonsynonymous")
## 1 region to be extracted.
## $`1:196621007-196716634` ## NA12286 NA12341 NA12342 ## 1:196642233 1 2 0 ## 1:196659237 0 2 1
In this example, we use the fileName for the VCF file path, 1:196621007-196716634 for the genomic range, and Nonsynonymous for the variants type. Extraction results are stored in the variable cfh. Its row names are variant positions, while column names are sample IDs.
Similarly, we can extract genotype matrix per gene:
geneFile = system.file("vcf/refFlat_hg19_6col.txt.gz", package = "seqminer") geneFile
##  "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/seqminer/vcf/refFlat_hg19_6col.txt.gz"
cfh <- readVCFToMatrixByGene(fileName, geneFile, "CFH", "Synonymous")
## 1 region to be extracted.
## $CFH ## NA12286 NA12341 NA12342 ## 1:196654324 1 2 1 ## 1:196682947 2 2 1
Note here we require a gene file in refFlat format. We have prepared several gene definition files in refFlat format using NCBI human build 37. From this website (Resource part), you can obtain these resources files.
The supported annotation variant type are:
# large strucutre variant StructuralVariation # indels Stop_Gain Stop_Loss Start_Gain Start_Loss Frameshift /* Indel length is not divisible by 3 */ CodonGain /* Insertion length is divisible by 3 */ CodonLoss /* Deletion length is divisible by 3 */ CodonRegion /* Just say the variant is in the Coding Region used in Structrual Varition*/ Insertion Deletion # single variants Nonsynonymous Synonymous Essential_Splice_Site Normal_Splice_Site Utr5 Utr3 Exon Intron Upstream Downstream SNV Noncoding Monomorphic
For advanced data extraction from VCF file, there are two functions: readVCFToListByRange and readVCFToListByGene. Their usage are similar to the previous two functions: readVCFToMatrixByRange and readVCFToMatrixByGene but the results are organized in list.
result <- readVCFToListByGene(fileName, geneFile, "CFH", "Synonymous", c("CHROM", "ID"), c("AC", "AN"), "GT") result
## $CHROM ##  "1" "1" ## ## $ID ##  "." "." ## ## $AC ##  "4" "5" ## ## $AN ##  "6" "6" ## ## $GT ## [,1] [,2] ## [1,] "0/1" "1/1" ## [2,] "1/1" "1/1" ## [3,] "0/1" "0/1" ## ## $sampleId ##  "NA12286" "NA12341" "NA12342"
There are 7 parameters in readVCFToListByGene: 1. VCF file path 2. Gene definition file (refFlat foramt) 3. Character array of gene names (can be one gene or more than one gene names) 4. Annotation type 5. VCF field to extract (explain later) 6. VCF tags in the INFO field to extract (explain later) 7. VCF tags in the individual field to extract (explain later)
The last three parameter are closely related to VCF format. In the VCF data lines, there are 8 fixed fields per record: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO. These field, as well as the FORMAT field can be specifed in the 5th parameter. For example, specify c("CHROM", "POS") will extract the genomic position.
In the VCF INFO field, there may be information including variant depth (DP), variant allele counts (AC), total allele counts (AN). To extract those information, use the 6th paramter.
In the VCF genotype fileds, which are from the 10th column to the last column per record, it contains the individual level information. Mostly commonly used information are genotype (GT), genotype depth (GD) and genotype quality (GQ). To extract these, specify c("GT", "GD", "GQ") in the 7th parameter.
The extraction results are stored in result in our example. All values are characters for simplicity. The extraction results is a list, and it contains CHROM, ID, AC, AN, GT and sampleId. CHROM and ID are speicified in the 5th parameter; AC and AN are speicified by the 6th parameter; GT are specified in the last parameter. We also list sample IDs in the sampleId part.
SeqMiner can greatly reduce your effort to process huge sequencing files in VCF format. That indicates your valuable time can be spared on more important things (e.g. investigating causal variants).
Here we demonstrate how to apply SKAT method to study CFH gene. The genotype file we used is extracted from 1000 Genomes Project. It contains 85 individuals in CEU population and the variants are all located in the CFH gene region. We will use simulated phenotype in this demonstration.
# Extract genotype matrix vcf.1000g <- system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") geneFile = system.file("vcf/refFlat_hg19_6col.txt.gz", package = "seqminer") geneName <- "CFH" cfh <- readVCFToMatrixByGene(vcf.1000g, geneFile, geneName, "Nonsynonymous")[]
## 1 region to be extracted.
# now cfh is a marker by people matrix dim(cfh)
##  8 85
# Simulation phenotype under the null model pheno <- rnorm(ncol(cfh)) # Apply SKAT method library(SKAT) obj <- SKAT_Null_Model(pheno ~ 1, out_type = "C") SKAT(t(cfh), obj)$p.value
## Warning: Genotypes of some variants are not the number of minor alleles! ## These genotypes are flipped!
##  0.5662
Another way to extract genotype is
readVCFToListByGene function, which provides more options to extract genotypical information from VCF file. In the following example, we extract chromosome name, chromosomal position, reference allele, annotation and genotype:
# Extract genotype matrix vcf.1000g <- system.file("vcf/1000g.phase1.20110521.CFH.var.anno.vcf.gz", package = "seqminer") geneFile = system.file("vcf/refFlat_hg19_6col.txt.gz", package = "seqminer") geneName <- "CFH"
cfh <- readVCFToListByGene(vcf.1000g, geneFile, geneName, "Nonsynonymous", c("CHR", "POS", "REF", "ALT"), c("ANNO"), c("GT"))
##  85 8
# Convert 0|0, 0|1, 1|1 style genotype to 0, 1, 2 geno is a people by marker # matrix geno <- matrix(NA, nr = nrow(cfh$GT), ncol(cfh$GT)) geno[cfh$GT == "0|0"] = 0 geno[cfh$GT == "0|1"] = 1 geno[cfh$GT == "1|1"] = 2 # Apply SKAT method library(SKAT) obj <- SKAT_Null_Model(pheno ~ 1, out_type = "C") SKAT(geno, obj)$p.value
## Warning: 3 SNPs with either high missing rates or no-variation are excluded! ## Warning: The missing genotype rate is 0.025882. Imputation is applied.
##  0.9297
SeqMiner is developed by Xiaowei Zhan and Dajiang Liu. We welcome your questions and feedbacks.