library( tidyverse )
<- list.files("./processing", pattern=".bam", full.names=TRUE)
bam for( i in 1:length(bam) ) {
<- bam[i]
file <- str_replace( file, pattern="-RG.bam", replacement="")
base
<- paste( base, "-sorted.bam", sep="")
fsorted <- paste( "samtools sort", file, "-o", fsorted )
cmd system( cmd )
<- paste( base, "-sorted-md.bam", sep="")
mdfile <- paste( "samtools markdup", fsorted, mdfile )
cmd system( cmd )
<- paste( "samtools index", mdfile)
cmd system( cmd )
cat( i, base, "\n")
}
For this, I wanted to sort, mark duplicates, and index each of the bam files we are working with. I softlinked all the bam files into a new folder called processing
via for file in ../samples/all005/*.bam; do ln -s $file; done
.
Then I fired up R and loaded the latest modeule on the server module load R/4.1.2
and then hooked up VSCode over SSH to write the following snippet.
Running this made several derivded files for each one, a sorted one, a one with marked duplicates, and one that had the index on it. So now each sample looks like this.
[rjdyer@huff10 4_processing]$ ls -1 Zoo_969*
Zoo_969-RG.bam -> ../samples/all005/Zoo_969-RG.bam
Zoo_969-sorted.bam
Zoo_969-sorted-md.bam
Zoo_969-sorted-md.bam.bai
After testing to make sure the commands were properly set up, I ran the script in as a batch. It didn’t take that long to finish, maybe 3-4 seconds per sample.