SNV

SNV analysis functions.

Quality Control

Enrichment

baseq.snv.qc.enrich.quality_control(samplename, bampath, intervals)[source]

Check the coverage depth and enrichment quality.

quality_control("sample01", "xx.bam", "panel.bed")
Return:
Sample/Total/Mapped/Map_Ratio/Dup_ratio/PCT_10X/PCT_30X/…

VCF

VCF stats and Filter

baseq.snv.vcf.GATK.vcf_stats(sample, vcfpath, min_depth=50)[source]

Stats on the VCF from GATK

vcf_stats("sample1", "path/to/vcf", min_depth=30)
Return:
A dict/json containing: Samplename/counts/mean_depth/GT_01/GT_11/MAF MAF is minor allel frequency.

GATK

run GATK

baseq.snv.gatk.alignment(fq1, fq2, sample, genome, thread=8)[source]

Map low-divergent sequences against reference genome using BWA. Add ReadGroup(more details about ReadGroup )to bamfile using the input sample name. Outfile is in BAM format and indexed for the downstream analysis.

Usage:

baseq-SNV run_bwa -1 Reads.1.fq.gz -2 Read.2.fq.gz -g hg38 -n Test

Return:

Test.bam
Test.bam.bai
baseq.snv.gatk.bqsr(markedbam, bqsrbam, genome, disable_dup_filter=False)[source]

Run BQSR. This function performs the two-steps process called base quality score recalibration. the first ster generates a recalibration table based on various covariates which is recruited to the second step to correct the systematic bias in input BAM file. More details about BaseRecalibrator and ApplyBQSR .

Usage:

  • Default mode filters duplicate reads (reads with “markduplicate” tags) before applying BQSR

    baseq-SNV run_bqsr -m Test.marked.bam -g hg38 -q Test.marked.bqsr.bam
    
  • Disable reads filter before analysis.

    baseq-SNV run_bqsr -m Test.marked.bam -g hg38 -q Test.marked.bqsr.bam -f Yes
    

Return:

Test.marked.bam.table
Test.marked.bqsr.bai
Test.marked.bqsr.bam
baseq.snv.gatk.create_pon(genome, list, path, interval)[source]

Create_pon function helps you create PoN(panel of normals) file necessary for mutect2 calling. The PoN captures common artifactual and germline variant sites. Mutect2 then uses the PoN to filter variants at the site-level.

Example of samples list (tab delimited):

  • Content of columns are normal sample name, normal BAM file, tumor sample name, tumor BAM file(order cannot be distruped). Absoulte path of all BAM files should be added if directory of BAM files and analysis directory are different.

    N504    N504_marked_bqsr.bam   T504    T504_marked_bqsr.bam
    N505    N505_marked_bqsr.bam   T505    T505_marked_bqsr.bam
    N506    N506_marked_bqsr.bam   T506    T506_marked_bqsr.bam
    N509    N509_marked_bqsr.bam   T509    T509_marked_bqsr.bam
    N510    N510_marked_bqsr.bam   T510    T510_marked_bqsr.bam
    

Usage:

  • Interval list defines genomic regions where analysis is restricted. Introduction of interval list format and its function, please see here.

    # designated a intervals.list
    baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./ -L interval.list
    # Using the dafalut intervals.list
    baseq-SNV create_pon -g hg37 -l sample_list.txt -p ./
    
baseq.snv.gatk.mutect2(genome, normalname, normalbam, tumorname, tumorbam, vcffile, pon, germline)[source]

Mutect2 is aim to call somatic SNVs and indels via local assembly of haplotypes. This function requires both tumor BAM file and its matched normal BAM file. tumorname and normalname should be consistent with the ReadGroup(ID) of tumor BAM file and normal BAM file respectively. PoN is refer to panel of normals callset(more infomation about PoN and how to create it, please see PoN ). Germline resource, also in VCF format, is used to annotate variant alleles. Download germline resource.

Usage:

  • Simplified Mutect2 command line

    # single sample
    baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \
                                  -t tumor -T tumor_marked_bqsr.bam -o ./
    # multiple samples
    baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./
    
  • Specify PoN(panels of normals) VCF file and germline VCF file Default germline VCF file comes form GATK resource bundle and is recruited if germline isn’t designated.

    # single sample
    baseq-SNV run_mutect2 -g hg37 -n normal -N normal_marked_bqsr.bam \
                                  -t tumor -T tumor_marked_bqsr.bam -o ./ \
                                  -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz
    # multiple samples
    baseq-SNV run_mutect2 -g hg37 -l sample_list.txt -o ./ \
                                  -p pon.vcf.gz -G af-only-gnomad.raw.sites.b37.vcf.gz
    
baseq.snv.gatk.run_callvar(bqsrbam, rawvcf, genome, disable_dup_filter=False)[source]

Call germline SNPs and indels via local re-assembly of haplotypes. BAM file recalbrated by BQSR do recommand as input BAM file and this functin only run the single sample genotypeVCF calling. More details see also HaplotypeCaller

Usage:

baseq-SNV run_callvar -q Test.marked.bqsr.bam -r Test.raw.indel.snp.vcf -g hg38

Return:

Test.raw.indel.snp.vcf
baseq.snv.gatk.run_markdup(bamfile, markedbam)[source]

Run MarkDuplicates of Picard. this function tags duplicate reads with “markduplicate” in BAM file. See also MarkDuplicates in GATK.

Usage:

baseq-SNV run_markdup -b Test.bam -m Test.marked.bam

Return: metrics file indicates the numbers of duplicates for both single- and paired-end reads

Test.marked.bam
Test.marked.bam.bai
Test.marked.bam.metrics
baseq.snv.gatk.selectvar(rawvcf, selectvcf, filtervcf, genome, run=True)[source]

This function selects SNPs from a VCF file which is usually the output file of HaplotypeCaller. Then, all SNPs are filtered by certain criteria based on INFO and/or FORMAT annotations. Criteria used here is “QD < 2.0 || FS > 60.0 || MQ < 40.0”. More details about SelectVariants and VariantFiltration

Usage:

baseq-SNV run_selectvar -r Test.raw.indel.snp.vcf -s Test.raw.snp.vcf -f Test.filtered.snp.vcf -g hg38

Return:

Test.raw.snp.vcf
Test.filtered.snp.vcf