CNV¶
baseq-CNV is a toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with Whole Genome Sequencing datas for both bulk and single cell experiments.
The copy number is based on the reads counts per genomic region. The region are predefined to exclude and discount the low complexity parts.
- Reads Alignment using Bowtie2
- Bin Counting for unique mapped reads
- Normalize by GC content
- CBS for reducing noise
Result¶
Name | Description |
---|---|
sample.bowtie.bam | Aligned bam file |
sample.bin.counts.txt | The counts of reads for each bin in the dynamic_bin_file |
sample.CNV_plot_[size].png | CNV plot figure for each bin-size |
sample.GC.png | GC content datas |
Pipeline¶
The total pipeline
baseq-CNV run_pipeline ./Tn5_S1.fq.gz -g hg19
BinCounting¶
According to dynamicbin … The command is..
baseq-CNV bincount -g hg19 -i ./sample.bam -o normbincounts.txt
Normalize¶
Normalize the raw read counts.
baseq-CNV normalize -g hg19 -i ./bincounts.txt -o bincounts_norm.txt
Config¶
[CNV]
bowtie2 = /mnt/gpfs/Database/softs/anaconda2/bin/bowtie2
samtools = /mnt/gpfs/Database/softs/anaconda2/bin/samtools
[CNV_ref_hg19]
bowtie2_index = /mnt/gpfs/Database/ref/hg19/hg19
dynamic_bin = /mnt/gpfs/Users/zhangxiannian/basematic/cnv/hg19.dynabin.txt
Quality Control¶
Alignment inforamtion and MAD
- Alignment: Total reads, mapping ratio
- MAD : Median Absolute Deviations, indicates the technical noise level of the sample.
APIs¶
Align¶
-
baseq.align.bowtie2.
bowtie2_sort
(fq1, fq2, bamfile, genome, reads=5000000, thread=8)[source]¶ Align the fastq reads using bowtie2 and sort the samfile.
from baseq.align.bowtie2 import bowtie2_sort #for single reads bowtie2_sort("read.1.fq.gz", "") #for multiple reads bowtie2_sort("read.1.fq.gz", "read.2.fq.gz")
Results:
sample.bam sample.bam.stats
Bincount¶
-
baseq.cnv.bincount.
counting
(genome, bamfile, out)[source]¶ bin counting using bisect for the dynamicbin;
from baseq.cnv.bincount.counting import counting counting("hg19", "aligned.bam", "bincount.txt")
This will generate:
bincount.txt # A tsv contain two columns: "index/counts"
Process:
- Read the dynamic bin;
- Read the bamfile using samtools view command;
- Filter the reads with mapping quality >=40;
- Map the genome position to binID and sum;
Normalize¶
-
baseq.cnv.normalize.
normalize
(genome, bincount, name)[source]¶ Normalize the Raw bin counts with bin length and GC contents, also estimate the Ploidy.
normalize("hg19", "bincounts.txt", "CNVsample")
This will generate two files:
Norm.Counts.CNVsample.txt 'chr', 'start', 'absstart', 'norm_by_GC', 'norm_by_GC_Ploidy' Norm.CNVsample.png
Process:
- Read the dynamicbin;
- Aggregate the Bins into 500kb;
- Normalize by bin length;
- Normalize by GC;
- Detect the Ploidy;
- Output:
- GC_content_image: images Normalized bin counts (1M)