% bgzip JK.vcf
% bgzip P.vcf
% bgzip R.vcf
% bgzip S.vcf
% bgzip W.vcf
% ls -alh
total 70032
drwxr-xr-x 7 rodney staff 224B Feb 16 14:22 .
drwxr-xr-x 21 rodney staff 672B Feb 16 14:10 ..
-rw-r--r-- 1 rodney staff 3.1M Feb 16 14:20 JK.vcf.gz
-rw-r--r-- 1 rodney staff 5.8M Feb 16 14:22 P.vcf.gz
-rw-r--r-- 1 rodney staff 4.6M Feb 16 14:22 R.vcf.gz
-rw-r--r-- 1 rodney staff 14M Feb 16 14:22 S.vcf.gz
-rw-r--r-- 1 rodney staff 5.8M Feb 16 14:22 W.vcf.gz
Merging VCF Files
The first thing we need to to is zip and then index the individual vcf files. As a reminder, these files have been created by individual freebayes
SNP calling of chunks of individuals partitioned into groups based upon the first letter of the popualtion from which they were collected. So, in the following examples, the file R.vcf
contains all the individuals in populations starting with the letter R
. When there are not that many individuals in a group, they are clustered—below we have JK.vcf
which has all the individuals from populations starting with the letters J
and K
.
So, lets compress these raw files first.
Now we need to index these files.
% tabix -p vcf JK.vcf.gz
% tabix -p vcf P.vcf.gz
% tabix -p vcf R.vcf.gz
% tabix -p vcf S.vcf.gz
% tabix -p vcf W.vcf.gz
Then we can merge the files together into a single file. Here, the output file indicates the populations that contribute to it.
% vcf-merge JK.vcf.gz P.vcf.gz R.vcf.gz S.vcf.gz W.vcf.gz > JKPRSW.vcf
So here are a few things that I’ve found that provide some insights into the data we are getting. Here is the sequencing depth per individual for only the SNP sites who have a minimum quality score of 20.
% vcftools --vcf JKPRSW.vcf --remove-indels --minQ 20 --depth -c > depth.txt
This produced a distribution of sequencing depths for these individuals.
Alternatively—and perhaps more important for us—we can look at the per-site depths. The goal here is to retain sites that have enough sequencing depth such that we are confident that it is a good site but not too much such that it is a reapeaty region in the geniome. For simplicity, I’ll drop the lower and upper standard deviation.
For simplicity, I’m just going to take the inner-50% of the data, dropping off the lower and upper quantiles.
quantile( site_depth$SUM_DEPTH,
prob = c(0.25,0.75))
25% 75%
110 964
This leaves us with 5047 sites.
Now, we can go back to vcftools
and use these values to select only sites with a quality score exceeding 20, and are in the middle 50 percentile of the depth coverage and have a minor allele frequency of at least 5%.
% vcftools --vcf JKPRSW.vcf --remove-indels --min-meanDP 110 --max-meanDP 964 --minQ 20 --maf 0.05 --min-alleles 2 --max-alleles 2 --freq
After filtering, kept 302 out of 302 Individuals
Outputting Frequency Statistics...
After filtering, kept 120 out of a possible 22604 Sites
Run Time = 1.00 seconds
So, we have 120 loci to work with from this subset.