SNV¶
SNV analysis functions.
Quality Control¶
VCF¶
GATK¶
run GATK¶
-
baseq.snv.gatk.
alignment
(fq1, fq2, sample, genome, thread=8)[source]¶ Map low-divergent sequences against reference genome using BWA. Add ReadGroup(more details about ReadGroup )to bamfile using the input sample name. Outfile is in BAM format and indexed for the downstream analysis.
Usage:
baseq-SNV run_bwa -1 Reads.1.fq.gz -2 Read.2.fq.gz -g hg38 -n Test
Return:
Test.bam Test.bam.bai
-
baseq.snv.gatk.
bqsr
(markedbam, bqsrbam, genome, disable_dup_filter=False)[source]¶ Run BQSR. This function performs the two-steps process called base quality score recalibration. the first ster generates a recalibration table based on various covariates which is recruited to the second step to correct the systematic bias in input BAM file. More details about BaseRecalibrator and ApplyBQSR .
Usage:
Default mode filters duplicate reads (reads with “markduplicate” tags) before applying BQSR
baseq-SNV run_bqsr -m Test.marked.bam -g hg38 -q Test.marked.bqsr.bam
Disable reads filter before analysis.
baseq-SNV run_bqsr -m Test.marked.bam -g hg38 -q Test.marked.bqsr.bam -f Yes
Return:
Test.marked.bam.table Test.marked.bqsr.bai Test.marked.bqsr.bam
-
baseq.snv.gatk.
create_pon
(genome, list, path, interval)[source]¶ Create_pon function helps you create PoN(panel of normals) file necessary for mutect2 calling. The PoN captures common artifactual and germline variant sites. Mutect2 then uses the PoN to filter variants at the site-level.
Example of samples list (tab delimited):
Content of columns are normal sample name, normal BAM file, tumor sample name, tumor BAM file(order cannot be distruped). Absoulte path of all BAM files should be added if directory of BAM files and analysis directory are different.
N504 N504_marked_bqsr.bam T504 T504_marked_bqsr.bam N505 N505_marked_bqsr.bam T505 T505_marked_bqsr.bam N506 N506_marked_bqsr.bam T506 T506_marked_bqsr.bam N509 N509_marked_bqsr.bam T509 T509_marked_bqsr.bam N510 N510_marked_bqsr.bam T510 T510_marked_bqsr.bam
Usage:
Interval list defines genomic regions where analysis is restricted. Introduction of interval list format and its function, please see here.
# designated a intervals.list baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./ -L interval.list # Using the dafalut intervals.list baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./
-
baseq.snv.gatk.
mutect2
(genome, normalname, normalbam, tumorname, tumorbam, vcffile, pon, germline)[source]¶ Mutect2 is aim to call somatic SNVs and indels via local assembly of haplotypes. This function requires both tumor BAM file and its matched normal BAM file. tumorname and normalname should be consistent with the ReadGroup(ID) of tumor BAM file and normal BAM file respectively. PoN is refer to panel of normals callset(more infomation about PoN and how to create it, please see PoN ). Germline resource, also in VCF format, is used to annotate variant alleles. Download germline resource.
Usage:
Simplified Mutect2 command line
# single sample baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \ -t tumor -T tumor_marked_bqsr.bam -o ./ # multiple samples baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./
Specify PoN(panels of normals) VCF file and germline VCF file Default germline VCF file comes form GATK resource bundle and is recruited if germline isn’t designated.
# single sample baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \ -t tumor -T tumor_marked_bqsr.bam -o ./ \ -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz # multiple samples baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./ \ -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz
-
baseq.snv.gatk.
run_callvar
(bqsrbam, rawvcf, genome, disable_dup_filter=False)[source]¶ Call germline SNPs and indels via local re-assembly of haplotypes. BAM file recalbrated by BQSR do recommand as input BAM file and this functin only run the single sample genotypeVCF calling. More details see also HaplotypeCaller
Usage:
baseq-SNV run_callvar -q Test.marked.bqsr.bam -r Test.raw.indel.snp.vcf -g hg38
Return:
Test.raw.indel.snp.vcf
-
baseq.snv.gatk.
run_markdup
(bamfile, markedbam)[source]¶ Run MarkDuplicates of Picard. this function tags duplicate reads with “markduplicate” in BAM file. See also MarkDuplicates in GATK.
Usage:
baseq-SNV run_markdup -b Test.bam -m Test.marked.bam
Return: metrics file indicates the numbers of duplicates for both single- and paired-end reads
Test.marked.bam Test.marked.bam.bai Test.marked.bam.metrics
-
baseq.snv.gatk.
selectvar
(rawvcf, selectvcf, filtervcf, genome, run=True)[source]¶ This function selects SNPs from a VCF file which is usually the output file of HaplotypeCaller. Then, all SNPs are filtered by certain criteria based on INFO and/or FORMAT annotations. Criteria used here is “QD < 2.0 || FS > 60.0 || MQ < 40.0”. More details about SelectVariants and VariantFiltration
Usage:
baseq-SNV run_selectvar -r Test.raw.indel.snp.vcf -s Test.raw.snp.vcf -f Test.filtered.snp.vcf -g hg38
Return:
Test.raw.snp.vcf Test.filtered.snp.vcf